Usage
The dlt+ is under active development and are not yet stable. For this reason, the documentation is split into a conceptual overview and a tutorial example repository with a package that showcases all current features.
New building blocksβ
Built on top of open-source dlt, dlt+ introduces two new building blocks:
- dlt package with an opinionated structure: a yaml/Python manifest file (dlt_project.yml) to declare dlt sources, destinations, pipelines, transformations, etc. It also introduces profiles to easily switch across dev, staging, prod, and personal configurations.
- local transformation cache which automates data transformations the same way dlt pipelines automate extract and load.
dlt+ also extends datasets:
- dlt datasets with contracts existing dlt datasets become an independent entity that allows convenient data access, transformations, and accepts granular data and schema contracts that may be applied per actual data user.
dlt packageβ
Currently, the package layout is fully compatible with a standard Python package approach and may be distributed via PyPI or from a git repository.
- It contains
pyproject.toml
that is a Python manifest of the package. It specifies the dependencies, source files, and package build system. - It contains
dlt_project.yml
that is a manifest of data platform entities: sources, destinations, datasets, pipelines, transformations, etc. - It contains Python modules with source code and tests. We propose a strict layout of the modules (i.e., source code is in the
./sources
folder, etc.)
Package and project manager uv
is used for packaging.
[π§ WIP!] dlt init
can create a new package with everything you need to develop, test, run, and deploy.
dlt_project.yml and profilesβ
The manifest file declares data platform entities.
The manifest supports profiles that override the top-level platform entities. The default profile name is dev
. If you want to use dlt test helpers, you must define a tests
profile as well.
On top of that:
- You can include other files, both Python and yaml.
- You can refer to secret values via
$VAR
.
Python interface to use and share the dataβ
The dlt package exposes a standard Python API. When used, it automatically switches the profile to access, which provides connection to production data. The access is limited by data and schema contracts on top of dlt datasets.
Package dependenciesβ
Packages can be combined like any other Python dependencies for code resources.
Combining manifests, data catalogs, and profiles is [π§ WIP!].
Pipeline runnerβ
We provide a simple runner for declared pipelines.
[π§ WIP!] A more advanced, production-grade runner is already part of the dlt+ future roadmap. It will expose different running options similar to existing Airflow helper (i.e., various forms of parallelism and backfilling).
Config and secretsβ
The dlt package manifest file is also a configuration file. Its content is transformed, cleaned up, and used to resolve dlt configurations of sources, destinations, pipelines, etc. Existing providers are also supported:
- environ provider
.dlt/config.toml
provider, including the global config.dlt/<profile_name>.secrets.toml
a secrets toml provider but scoped to a particular profile. secrets.toml file is ignored. A per-profile version is sought instead, i.e.,dev.secrets.toml
.
Dataset with data and schema contractsβ
dlt packages require (by default) that you declare the datasets you intend to use in the package and specify the destinations where they may be materialized. Datasets in the package create a data catalog that can be used to discover schemas, read, and write data.
Through datasets, dlt+ fully leverages schema inference of dlt
. Cataloging is automated.
Datasets are a fundamental unit of governance in the package:
- You can enable and disable them per profile.
- You can set schema contracts also per profile.
- [π§ WIP!] You can set data contracts (i.e., read-only tables).
- [π§ WIP!] You can apply a different set of contracts per particular user.
Local transformation cacheβ
If you are familiar with dlt pipelines, the concept of local transformations is easy to grasp. Pipelines simplify and automate the loading of data. Local transformations simplify and automate the transformation of dataβprimarily locally. In a nutshell:
- You pass a set of input dlt datasets to the transformation cache.
- The cache discovers the inputs (source schemas) for your transformations.
- The cache exposes your data locally using
duckdb
(we support VIEWs for data lakes and full table copy for other destinations). - You can use the cache and duckdb as a query engine to run your transformations (currently, we support
dbt
and anything Python: pandas, arrow, polars, etc.). - The cache infers the output schema (if not declared) and syncs the results of the transformations to the output dataset.
Why you should use it:
- Automatic source schema discovery.
- Save costs by transforming locally.
- No egress cost when close to the data.
- The same engine (i.e., SQL dialect) no matter what the final destination.
- Python transformations (Hamilton/Kedro).
- dbt, sqlmesh, sdf supported.
- Metadata propagation from input to output dataset, automatic cataloging.
Currently, a lot of things below are WIP:
- A local (ad hoc) data catalog and a data cache for larger, distributed data (see your data lake and report tables in one place).
- A local query engine (
duckdb
): universal schema and SQL dialect for transformations. - Arrow/polars transformations (via Python modules).
- Incremental transformations [partial π§ WIP! -
_dlt_load_id
currently supported]. - Syncing the cache back to output datasets.
- Declarative cache behavior [π§ WIP!].
- Convenient Python interface [π§ WIP!].
- Many input and output datasets [π§ WIP!].