Skip to main content

Usage

info

The dlt+ is under active development and are not yet stable. For this reason, the documentation is split into a conceptual overview and a tutorial example repository with a package that showcases all current features.

New building blocks​

Built on top of open-source dlt, dlt+ introduces two new building blocks:

  1. dlt package with an opinionated structure: a yaml/Python manifest file (dlt_project.yml) to declare dlt sources, destinations, pipelines, transformations, etc. It also introduces profiles to easily switch across dev, staging, prod, and personal configurations.
  2. local transformation cache which automates data transformations the same way dlt pipelines automate extract and load.

dlt+ also extends datasets:

  1. dlt datasets with contracts existing dlt datasets become an independent entity that allows convenient data access, transformations, and accepts granular data and schema contracts that may be applied per actual data user.

dlt package​

Currently, the package layout is fully compatible with a standard Python package approach and may be distributed via PyPI or from a git repository.

  1. It contains pyproject.toml that is a Python manifest of the package. It specifies the dependencies, source files, and package build system.
  2. It contains dlt_project.yml that is a manifest of data platform entities: sources, destinations, datasets, pipelines, transformations, etc.
  3. It contains Python modules with source code and tests. We propose a strict layout of the modules (i.e., source code is in the ./sources folder, etc.)

Package and project manager uv is used for packaging.

[🚧 WIP!] dlt init can create a new package with everything you need to develop, test, run, and deploy.

dlt_project.yml and profiles​

The manifest file declares data platform entities.

The manifest supports profiles that override the top-level platform entities. The default profile name is dev. If you want to use dlt test helpers, you must define a tests profile as well.

On top of that:

  • You can include other files, both Python and yaml.
  • You can refer to secret values via $VAR.

Python interface to use and share the data​

The dlt package exposes a standard Python API. When used, it automatically switches the profile to access, which provides connection to production data. The access is limited by data and schema contracts on top of dlt datasets.

Package dependencies​

Packages can be combined like any other Python dependencies for code resources.

Combining manifests, data catalogs, and profiles is [🚧 WIP!].

Pipeline runner​

We provide a simple runner for declared pipelines.

[🚧 WIP!] A more advanced, production-grade runner is already part of the dlt+ future roadmap. It will expose different running options similar to existing Airflow helper (i.e., various forms of parallelism and backfilling).

Config and secrets​

The dlt package manifest file is also a configuration file. Its content is transformed, cleaned up, and used to resolve dlt configurations of sources, destinations, pipelines, etc. Existing providers are also supported:

  1. environ provider
  2. .dlt/config.toml provider, including the global config
  3. .dlt/<profile_name>.secrets.toml a secrets toml provider but scoped to a particular profile. secrets.toml file is ignored. A per-profile version is sought instead, i.e., dev.secrets.toml.

Dataset with data and schema contracts​

dlt packages require (by default) that you declare the datasets you intend to use in the package and specify the destinations where they may be materialized. Datasets in the package create a data catalog that can be used to discover schemas, read, and write data.

Through datasets, dlt+ fully leverages schema inference of dlt. Cataloging is automated.

Datasets are a fundamental unit of governance in the package:

  • You can enable and disable them per profile.
  • You can set schema contracts also per profile.
  • [🚧 WIP!] You can set data contracts (i.e., read-only tables).
  • [🚧 WIP!] You can apply a different set of contracts per particular user.

Local transformation cache​

If you are familiar with dlt pipelines, the concept of local transformations is easy to grasp. Pipelines simplify and automate the loading of data. Local transformations simplify and automate the transformation of dataβ€”primarily locally. In a nutshell:

  1. You pass a set of input dlt datasets to the transformation cache.
  2. The cache discovers the inputs (source schemas) for your transformations.
  3. The cache exposes your data locally using duckdb (we support VIEWs for data lakes and full table copy for other destinations).
  4. You can use the cache and duckdb as a query engine to run your transformations (currently, we support dbt and anything Python: pandas, arrow, polars, etc.).
  5. The cache infers the output schema (if not declared) and syncs the results of the transformations to the output dataset.

Why you should use it:

  • Automatic source schema discovery.
  • Save costs by transforming locally.
  • No egress cost when close to the data.
  • The same engine (i.e., SQL dialect) no matter what the final destination.
  • Python transformations (Hamilton/Kedro).
  • dbt, sqlmesh, sdf supported.
  • Metadata propagation from input to output dataset, automatic cataloging.

Currently, a lot of things below are WIP:

  • A local (ad hoc) data catalog and a data cache for larger, distributed data (see your data lake and report tables in one place).
  • A local query engine (duckdb): universal schema and SQL dialect for transformations.
  • Arrow/polars transformations (via Python modules).
  • Incremental transformations [partial 🚧 WIP! - _dlt_load_id currently supported].
  • Syncing the cache back to output datasets.
  • Declarative cache behavior [🚧 WIP!].
  • Convenient Python interface [🚧 WIP!].
  • Many input and output datasets [🚧 WIP!].

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.