Initializing a new project
Before starting, ensure you are familiar with the core concepts of dlt, as this will provide essential context for working with dlt projects.
Overview
The dlt project init
command is a powerful way to create a structured starting point for a dlt project. A dlt project revolves around a YAML specification, defining the entities and configurations needed to build pipelines, sources, and destinations as described in the core concepts.
Creating a New Project
Generating your first project
Start by creating a new folder for your project. Then, navigate to the folder in your terminal.
mkdir tutorial && cd tutorial
Run the following command to initialize a new dlt project:
# Initialize a dlt project named "tutorial", the name is derived from the folder name
dlt project init arrow duckdb
This command generates a project named tutorial
with:
- One pipeline
- One Arrow source defined in
sources/arrow.py
- One DuckDB destination
- One dataset on the DuckDB destination
The Generated Folder Structure
After running the command, the following folder structure is created:
.
├── .dlt/ # your dlt settings including profile settings
│ ├── config.toml
│ ├── dev.secrets.toml
│ └── secrets.toml
├── _storage/ # local storage for your project, excluded from git
├── destinations/ # your destinations, empty in this example
├── sources/ # your sources, contains the code for the arrow source
│ └── arrow.py
├── .gitignore
└── dlt.yml # the main project manifest
Understanding dlt.yml
The dlt.yml
file is the central configuration for your dlt project. It defines the pipelines, sources, and destinations. In the generated project, the file looks like this:
# project settings
project:
name: template
profiles:
# profiles allow you to configure different settings for different environments
dev: {}
# your sources are the data sources you want to load from
sources:
arrow:
type: sources.arrow.source
# your destinations are the databases where your data will be saved
destinations:
duckdb:
type: duckdb
# your datasets are the datasets on your destinations where your data will go
datasets: {}
# your pipelines orchestrate data loading actions
pipelines:
my_pipeline:
source: arrow
destination: duckdb
dataset_name: my_pipeline_dataset
If you do not want to start with source, destination and pipeline, you can simply run dlt project init tutorial
. This will generate a project with empty sources, destinations, and pipelines.
Available project vars that are substituted during loading:
project_dir
- the root directory of the project, i.E. the directory where the dlt.yml file is locatedtmp_dir
- the directory for storing temporary files, can be configured project section as seen above, by default will be set to${project_dir}_storage
.name
- the name of the project, can be configured project section as seen abovedefault_profile
- the name of the default profile, can be configured project section as seen abovecurrent_profile
- the name of the current profile, this is set automatically when a profile is used
Some details about the project structure above:
- The
project
section could be fully omitted in this case, it will be generated to be more explicit what the default settings are. - The
runtime
section is analog to the config.toml [runtime] section, and could also be omitted in this case. - The
profiles
section is not doing much in this case. There are two implicit profiles:dev
andtests
that are present in any project, we will learn about profiles in more detail later.
Understanding the basics of the project context
The dlt.yml
marks the root of a project. Projects can also be nested. If you run any dlt project cli command, dlt
will search for the project root in the filesystem tree starting from the current working directory and run all operations on the found project. So if your dlt.yml is in the tutorial
folder, you can run dlt pipeline my_pipeline run
from this folder or all subfolders of this folderand it will run the pipeline on the tutorial
project.
Running the Pipeline
Once the project is initialized, you can run the pipeline using:
dlt pipeline my_pipeline run
This command:
- Locates the pipeline named
my_pipeline
indlt.yml
- Executes it, populating the duckdb destination that is defined to be stored in
${tmp_dir}my_data.duckdb
.
Inspecting the Results
Use the dlt dataset
command to interact with the dataset stored in the DuckDB destination. For example:
Counting the loaded rows:
To count rows in the dataset, run:
dlt dataset duckdb_dataset row-counts
This will show the amount of rows in the items table as specified by the arrow source. Additionally the internal dlt tables are shown.
table_name row_count
0 items 100
1 _dlt_version 1
2 _dlt_loads 2
3 _dlt_pipeline_state 1
View Data
To view the first five rows of the items
table:
dlt dataset duckdb_dataset head items
This displays the top entries in the items
table, enabling quick validation of the pipeline's output. The output will be something like this:
Loading first 5 rows of table items.
id name age _dlt_load_id _dlt_id
0 100 jim 56 1737465323.617184 /qaxfQ/rbD/KcQ
1 100 alice 39 1737465323.617184 H996PcWDbMuDbQ
2 100 jerry 64 1737465323.617184 R27cQDLTQQ+dxg
3 100 jenny 50 1737465323.617184 9eKG60Ok0fbTpA
4 100 jerry 51 1737465323.617184 Wj9m7VGQzzLi3w
To show more rows use the --limit
flag.
dlt dataset duckdb_dataset head items --limit 50