Skip to main content

Initializing a new project

Before starting, ensure you are familiar with the core concepts of dlt, as this will provide essential context for working with dlt projects.

Overview

The dlt project init command is a powerful way to create a structured starting point for a dlt project. A dlt project revolves around a YAML specification, defining the entities and configurations needed to build pipelines, sources, and destinations as described in the core concepts.

Creating a New Project

Generating your first project

Start by creating a new folder for your project. Then, navigate to the folder in your terminal.

mkdir tutorial && cd tutorial

Run the following command to initialize a new dlt project:

# Initialize a dlt project named "tutorial", the name is derived from the folder name
dlt project init arrow duckdb

This command generates a project named tutorial with:

  • One pipeline
  • One Arrow source defined in sources/arrow.py
  • One DuckDB destination
  • One dataset on the DuckDB destination

The Generated Folder Structure

After running the command, the following folder structure is created:

.
├── .dlt/ # your dlt settings including profile settings
│ ├── config.toml
│ ├── dev.secrets.toml
│ └── secrets.toml
├── _storage/ # local storage for your project, excluded from git
├── destinations/ # your destinations, empty in this example
├── sources/ # your sources, contains the code for the arrow source
│ └── arrow.py
├── .gitignore
└── dlt.yml # the main project manifest

Understanding dlt.yml

The dlt.yml file is the central configuration for your dlt project. It defines the pipelines, sources, and destinations. In the generated project, the file looks like this:

# project settings
project:
name: template

profiles:
# profiles allow you to configure different settings for different environments
dev: {}

# your sources are the data sources you want to load from
sources:
arrow:
type: sources.arrow.source

# your destinations are the databases where your data will be saved
destinations:
duckdb:
type: duckdb

# your datasets are the datasets on your destinations where your data will go
datasets: {}

# your pipelines orchestrate data loading actions
pipelines:
my_pipeline:
source: arrow
destination: duckdb
dataset_name: my_pipeline_dataset
tip

If you do not want to start with source, destination and pipeline, you can simply run dlt project init tutorial. This will generate a project with empty sources, destinations, and pipelines.

Available project vars that are substituted during loading:

  • project_dir - the root directory of the project, i.E. the directory where the dlt.yml file is located
  • tmp_dir - the directory for storing temporary files, can be configured project section as seen above, by default will be set to ${project_dir}_storage.
  • name - the name of the project, can be configured project section as seen above
  • default_profile - the name of the default profile, can be configured project section as seen above
  • current_profile - the name of the current profile, this is set automatically when a profile is used

Some details about the project structure above:

  • The project section could be fully omitted in this case, it will be generated to be more explicit what the default settings are.
  • The runtime section is analog to the config.toml [runtime] section, and could also be omitted in this case.
  • The profiles section is not doing much in this case. There are two implicit profiles: dev and tests that are present in any project, we will learn about profiles in more detail later.

Understanding the basics of the project context

The dlt.yml marks the root of a project. Projects can also be nested. If you run any dlt project cli command, dlt will search for the project root in the filesystem tree starting from the current working directory and run all operations on the found project. So if your dlt.yml is in the tutorial folder, you can run dlt pipeline my_pipeline run from this folder or all subfolders of this folderand it will run the pipeline on the tutorial project.

Running the Pipeline

Once the project is initialized, you can run the pipeline using:

dlt pipeline my_pipeline run

This command:

  • Locates the pipeline named my_pipeline in dlt.yml
  • Executes it, populating the duckdb destination that is defined to be stored in ${tmp_dir}my_data.duckdb.

Inspecting the Results

Use the dlt dataset command to interact with the dataset stored in the DuckDB destination. For example:

Counting the loaded rows:

To count rows in the dataset, run:

dlt dataset duckdb_dataset row-counts

This will show the amount of rows in the items table as specified by the arrow source. Additionally the internal dlt tables are shown.

            table_name  row_count
0 items 100
1 _dlt_version 1
2 _dlt_loads 2
3 _dlt_pipeline_state 1

View Data

To view the first five rows of the items table:

dlt dataset duckdb_dataset head items

This displays the top entries in the items table, enabling quick validation of the pipeline's output. The output will be something like this:

Loading first 5 rows of table items.
id name age _dlt_load_id _dlt_id
0 100 jim 56 1737465323.617184 /qaxfQ/rbD/KcQ
1 100 alice 39 1737465323.617184 H996PcWDbMuDbQ
2 100 jerry 64 1737465323.617184 R27cQDLTQQ+dxg
3 100 jenny 50 1737465323.617184 9eKG60Ok0fbTpA
4 100 jerry 51 1737465323.617184 Wj9m7VGQzzLi3w

To show more rows use the --limit flag.

dlt dataset duckdb_dataset head items --limit 50

Learn more

Next chapter: Adding entities

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.