Skip to main content

dlt+ package tutorial

info

The dlt+ is under active development and are not yet stable. For this reason, the documentation is split into a conceptual overview and a tutorial example repository with a package that showcases all current features.

In this tutorial, we will go through the steps of understanding the dlt+ package from the example repository, how to run CLI commands that execute the entities defined in the package, and how to extend the package with your own sources, pipelines, transformations, and profiles.

A short glossary

In addition to the entities and concepts introduced in the main docs glossary at Glossary, we will introduce a few more specific to dlt+.

dlt+ package

A dlt project with an opinionated structure: a yaml/Python manifest file (dlt_project.yml) to declare dlt sources, destinations, pipelines, transformations, etc. It also introduces profiles to easily switch across dev, staging, prod, and personal configurations.

dlt+ profile

A dlt+ profile is a configuration for a dlt+ package. It is defined in the dlt_project.yml file and can be selected when running dlt+ commands to switch between different environments (dev, staging, prod) or personal configurations. One package may have multiple profiles.

Dataset

While already present in the open-source version of dlt, datasets are an important concept in the dlt+ package. A dataset is a physical collection of data and dlt metadata including the schema on a destination. One destination can have multiple datasets; for now, datasets are bound to a physical destination, but this may change in future iterations.

dlt+ cache

The dlt+ cache essentially is a local DuckDB database instance. Cache is automatically populated from a dlt dataset and pushed to the same or another dlt dataset after transformations are completed. Currently, the cache always has one input and one output dataset. The cache may also be used by data engineers and scientists to run local transformations on a subset of remote data for analytical purposes without the intention of ever pushing the data back to a remote destination. The cache can also reference remote data that only gets retrieved on demand when a transformation needs it.

dlt+ transformation

A dlt+ transformation is a collection of functions that modify the data stored in a dlt+ cache either by running SQL commands or by extracting, working on, and yielding arrow tables or pandas dataframes. A combination of the cache and transformations can be used for downstream processing of data loaded with dlt to a given bucket into a new destination.

Quickstart

Setup and run

  1. The package tutorial uses the uv python dependency manager. Install uv by following the instructions here
# example for linux/macos
curl -LsSf https://astral.sh/uv/install.sh | sh
  1. Clone the dlt+ package tutorial repository
git clone https://github.com/dlthub/dlt-package-tutorial
  1. Install the dependencies by running the make dev command in the repository root.
make dev
  1. Download GitHub example data by running the make download-example-data command in the repository root.
make download-example-data

The data will be stored in the ./_data directory.

  1. Run the make run command to load some data into an example dataset, run transformations on it, and push it back to another example dataset.
make run

You will see the two folders corresponding to the two filesystem-based datasets we are using in the ./_storage directory.

Run command breakdown

A quick breakdown of what dlt does when you run make run. You can see the individual commands being executed in the Makefile.

We are using an example GitHub archive data for this demo, which you have downloaded with make download-gh, and will run some SQL and Arrow table-based transformations on it.

  • dlt project clean - Removes any cached files from a possible previous run.

  • dlt pipeline -l - Lists all pipelines defined in the project.

  • dlt pipeline events_to_lake run - Runs the events_to_lake pipeline, which loads the GitHub data into a delta table in the events_lake folder.

  • dlt transformation . run - Populates the local cache and runs the defined transformations on this cache. These transformations are defined in the transformations folder. Then, the transformed tables are flushed to the csv_results_dataset folder.

  • dlt dataset csv_results_dataset info - Lists information about the dataset in the csv_results_dataset folder, which is the final destination of the transformed data.

A closer look at the example package

All the above commands use entities and relationships defined in the dlt_project.yml file. Let's have a look at the package and the files within it.

The folder structure

The example package has the following folder structure:

dlt_example_package/
├── dlt_project.yml # the main manifest file
├── __init__.py # the main package init file
├── sources/ # contains code for dlt sources
├── transformations/ # contains transformation code

The main manifest file dlt_project.yml

This file contains the main configuration for the package. It defines all the dlt entities that make up the package. Sections of this package are:

Sources

The sources section defines dlt sources which can be used from pipeline objects in a similar way you would use dlt sources in a pure Python dlt project. You can see how the events source references an implementation supplied in the sources folder as a Python implementation.

Destinations

The destinations section defines dlt destinations in a similar way to how you would define them in a pure Python dlt project. The example package contains two named filesystem destinations. Destinations are referenced by datasets. You can see the default local folders defined for the destinations in the file.

Datasets

The datasets section defines datasets that live on a destination defined in the destinations section. Datasets are used by the pipeline to load data to as well as by the cache to populate data from and load data to. In this example, we have two datasets, one on each destination.

Pipelines

Pipelines can be used to load data from sources to destinations. The example package contains one pipeline, events_to_lake, which loads data from the events source to a delta table in the events_lake folder on the delta_lake destination.

Caches

The cache defined in the example governs which tables are loaded from the input dataset, the delta_lake dataset, and which tables are written to the output dataset, the csv_results_dataset.

Transformations

In the transformations section, there is one Arrow-based transformation that operates on the cache defined above. The actual transformation code lies in the transformations folder.

Profiles

Profiles are inheritable configuration settings that can be conditionally applied to dlt+ CLI commands. For example, you could run loads without a profile in this example, and filesystem destinations would be using the local folders defined directly on the object. If you choose to select the prod profile, dlt would connect to the S3 bucket defined in this profile, provided you have supplied the needed credentials in the secrets.toml.

Misc settings

It is also possible to add additional dlt settings that mirror the config.toml settings in this file. In this example, we are changing the normalizer behavior for our purposes.

The main package init file ./__init__.py

The example package is set up to be used as a starting point for your own pip-installable package. The __init__.py file contains the main dlt package configuration and some helper functions. You can read the docstrings in the file for more information on what it does.

Available dlt+ CLI commands

Having the dlt+ package installed provides a few additional dlt+ CLI commands that orchestrate the entities defined in your package. These are:

  • dlt project - commands for managing dlt+ packages
  • dlt pipeline - commands for managing and running dlt+ pipelines
  • dlt dataset - commands for inspecting dlt+ datasets
  • dlt transformation - commands for managing and running dlt+ transformations
  • dlt cache - commands for managing and syncing dlt+ caches

Each of these commands has a few subcommands. You can discover these by running:

dlt transformation --help
# or if you need to run within the uv context
uv run dlt transformation --help

Commands that take an entity name to know which entity to work on may sometimes take a . as the entity name to run the first (or single) entity found. In this example, running:

dlt transformation . run

will run the one transformation found in this package.

Selecting a different profile when executing dlt+ commands

You can add --profile <profile_name> to a dlt+ CLI command to select a different profile. If you do not provide a profile, the command will run without a profile.

dlt transformation --profile prod . run 

It is possible to run dlt+ CLI commands with the profile prod being used.

Case study: Extending the example package with a new set of transformations and understanding what is going on

In this section, we do a short rundown of how we would go about providing a different set of transformations on the same data.

1. Duplicate the existing cache and transformation config

Create a new cache with the name github_events_cache_2 but the same content as the existing github_events_cache cache in your dlt_project.yml file.

Create a new transformation with the name github_events_transformations_2 as a copy of github_events_transformations in your dlt_project.yml file. Make sure that this new transformation references your new cache, not the old one. Keep the transformation type the same; you could choose to switch to dbt, but for this tutorial, we will stick with arrow tables.

2. Implement the new transformations

The dlt+ CLI provides a nice tool to scaffold your new transformations. You will need to have data in your input dataset already; otherwise, dlt will not know the schema to scaffold. Run the render-t-layer command on your transformation:

uv run dlt transformation github_events_transformations_2 render-t-layer

You will now see that in your ./transformations folder, a new set of transformations has appeared. You will have some transformation functions that manage incremental loading state based on the dlt_load_id as well as two transformation functions that are the actual implementation of your transformations. One staging view, which pre-selects only the rows eligible for the current transformation run as well as the main output table that, for now, just forwards all incoming rows unchanged into the output table.

Please note that if you run these new transformations without applying any changes, the run will fail, as your cache expects an aggregated table that exists in the provided github_events_transformations but not in your new scaffolded transformations. You can either update the settings in your copied cache or implement a transformation that corresponds to the expected table.

3. Understanding incremental transformations

The default transformations that are generated by the scaffolding command work incrementally with the load_id from the incoming dataset. The dlt_loads table is automatically made available on the cache, and the dlt transformation layer can see which load_ids exist in the incoming dataset and select only those load_ids that are not yet reported in the processed_load_ids table in the transformed dataset. After all transformations have been completed, the processed_load_ids table is updated with all load_ids processed in the current run. The cache automatically saves the processed_load_ids table to the output dataset after each run. The cache will also load the processed_load_ids table from the output dataset when syncing the input dataset if present. This way, the incremental transformation also works on ephemeral machines where the cache is not retained between runs.

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.