Usage

info

The dlt+ is under active development and are not yet stable. For this reason, the documentation is split into a conceptual overview and a tutorial example repository with a package that showcases all current features.

New building blocks

Built on top of open-source dlt, dlt+ introduces two new building blocks:

dlt package with an opinionated structure: a yaml/Python manifest file (dlt.yml) to declare dlt sources, destinations, pipelines, transformations, etc. It also introduces profiles to easily switch across dev, staging, prod, and personal configurations.
local transformation cache which automates data transformations the same way dlt pipelines automate extract and load.

dlt+ also extends datasets:

dlt datasets with contracts existing dlt datasets become an independent entity that allows convenient data access, transformations, and accepts granular data and schema contracts that may be applied per actual data user.

dlt package

Currently, the package layout is fully compatible with a standard Python package approach and may be distributed via PyPI or from a git repository.

It contains pyproject.toml that is a Python manifest of the package. It specifies the dependencies, source files, and package build system.
It contains dlt.yml that is a manifest of data platform entities: sources, destinations, datasets, pipelines, transformations, etc.
It contains Python modules with source code and tests. We propose a strict layout of the modules (i.e., source code is in the ./sources folder, etc.)

Package and project manager uv is used for packaging.

[🚧 WIP!] dlt init can create a new package with everything you need to develop, test, run, and deploy.

dlt.yml and profiles

The manifest file declares data platform entities.

The manifest supports profiles that override the top-level platform entities. The default profile name is dev. If you want to use dlt test helpers, you must define a tests profile as well.

On top of that:

You can include other files, both Python and yaml.
You can refer to secret values via $VAR.

The dlt package exposes a standard Python API. When used, it automatically switches the profile to access, which provides connection to production data. The access is limited by data and schema contracts on top of dlt datasets.

Package dependencies

Packages can be combined like any other Python dependencies for code resources.

Combining manifests, data catalogs, and profiles is [🚧 WIP!].

Pipeline runner

We provide a simple runner for declared pipelines.

[🚧 WIP!] A more advanced, production-grade runner is already part of the dlt+ future roadmap. It will expose different running options similar to existing Airflow helper (i.e., various forms of parallelism and backfilling).

Config and secrets

The dlt package manifest file is also a configuration file. Its content is transformed, cleaned up, and used to resolve dlt configurations of sources, destinations, pipelines, etc. Existing providers are also supported:

environ provider
.dlt/config.toml provider, including the global config
.dlt/<profile_name>.secrets.toml a secrets toml provider but scoped to a particular profile. secrets.toml file is ignored. A per-profile version is sought instead, i.e., dev.secrets.toml.

Dataset with data and schema contracts

dlt packages require (by default) that you declare the datasets you intend to use in the package and specify the destinations where they may be materialized. Datasets in the package create a data catalog that can be used to discover schemas, read, and write data.

Through datasets, dlt+ fully leverages schema inference of dlt. Cataloging is automated.

Datasets are a fundamental unit of governance in the package:

You can enable and disable them per profile.
You can set schema contracts also per profile.
[🚧 WIP!] You can set data contracts (i.e., read-only tables).
[🚧 WIP!] You can apply a different set of contracts per particular user.

Local transformation cache

If you are familiar with dlt pipelines, the concept of local transformations is easy to grasp. Pipelines simplify and automate the loading of data. Local transformations simplify and automate the transformation of data—primarily locally. In a nutshell:

You pass a set of input dlt datasets to the transformation cache.
The cache discovers the inputs (source schemas) for your transformations.
The cache exposes your data locally using duckdb (we support VIEWs for data lakes and full table copy for other destinations).
You can use the cache and duckdb as a query engine to run your transformations (currently, we support dbt and anything Python: pandas, arrow, polars, etc.).
The cache infers the output schema (if not declared) and syncs the results of the transformations to the output dataset.

Why you should use it:

Automatic source schema discovery.
Save costs by transforming locally.
No egress cost when close to the data.
The same engine (i.e., SQL dialect) no matter what the final destination.
Python transformations (Hamilton/Kedro).
dbt, sqlmesh, sdf supported.
Metadata propagation from input to output dataset, automatic cataloging.

Currently, a lot of things below are WIP:

A local (ad hoc) data catalog and a data cache for larger, distributed data (see your data lake and report tables in one place).
A local query engine (duckdb): universal schema and SQL dialect for transformations.
Arrow/polars transformations (via Python modules).
Incremental transformations [partial 🚧 WIP! - _dlt_load_id currently supported].
Syncing the cache back to output datasets.
Declarative cache behavior [🚧 WIP!].
Convenient Python interface [🚧 WIP!].
Many input and output datasets [🚧 WIP!].

Usage

New building blocks

dlt package

dlt.yml and profiles

Package dependencies

Pipeline runner

Config and secrets

Dataset with data and schema contracts

Local transformation cache

DHelp

Ask a question

Usage

New building blocks​

dlt package​

dlt.yml and profiles​

Python interface to use and share the data​

Package dependencies​

Pipeline runner​

Config and secrets​

Dataset with data and schema contracts​

Local transformation cache​

DHelp

Ask a question

New building blocks

dlt package

dlt.yml and profiles

Python interface to use and share the data

Package dependencies

Pipeline runner

Config and secrets

Dataset with data and schema contracts

Local transformation cache