Photo by Wolfgang Weiser on Unsplash
Data Load Tool (dlt) Series: Getting Started with Pipelines Part 1
Introduction
Data Load Tool (DLT) is an open source library designed to simplify the process of extracting, loading and transforming (ELT) data from various, often messy data sources into well structured live datasets. Dlt loads data from a wide range of sources including RESTApis, SQL database, Cloud Storages and python data structures to destination of choice. DLT supports a variety of popular destinations and allows for the addition of custom destinations to create reverse ETL pipelines. It can be deployed anywhere Python runs, be it on Airflow, serverless functions, or any other cloud deployments of your choice. It offers a light weight interface that infers schemas and data types, normalizes the data, and handles nested data structures, making it easy to use, flexible and scalable.
In this article, I will In this article describe
How to run a simple pipeline with toy data.
How to explore the loaded data using:
DuckDB connection
dlt’s sql_client
dlt datasets
To get started, I recommend working in a virtual environment when working on python projects, This way, all the dependencies for your current project will be isolated from packages in other projects.
Pipeline with toy dataset
A pipeline is a connection that moves data from your Python code to a destination. The pipeline accepts dlt sources or resources, as well as generators, async generators, lists, and any iterables. Once the pipeline runs, all resources are evaluated and the data is loaded at the destination.
let’s start with a simple pipeline using a small dataset — Pokémon data represented as a list of Python dictionaries.
# Sample data containing pokemon details
data = [
{"id": "1", "name": "bulbasaur", "size": {"weight": 6.9, "height": 0.7}},
{"id": "4", "name": "charmander", "size": {"weight": 8.5, "height": 0.6}},
{"id": "25", "name": "pikachu", "size": {"weight": 6, "height": 0.4}},
]
Import dlt
and create a simple pipeline
Import dlt
pipeline = dlt.pipeline(
pipeline_name="resource",
destination="duckdb",
dataset_name="pokeman_data",
dev_mode=True,
)
pipeline_name: This is the name you give to your pipeline. It helps you track and monitor your pipeline, and also helps to bring back its state and data structures for future runs. If you don’t give a name, dlt will use the name of the Python file you’re running as the pipeline name.
destination: a name of the destination to which dlt will load the data. It may also be provided to the run method of the pipeline. The destination here is DuckDB
dataset_name: This is the name of the group of tables (or dataset) where your data will be sent. You can think of a dataset like a folder that holds many files, or a schema in a relational database. You can also specify this later when you run or load the pipeline. If you don’t provide a name, it will default to the name of your pipeline.
dev_mode: If you set this to True, dlt will add a timestamp to your dataset name every time you create a pipeline. This means a new dataset will be created each time you create a pipeline.
Run your pipeline and print the load info
# Run the pipeline with data and table name
load_info = pipeline.run(data, table_name="pokemon")
print(load_info)
The first run of a pipeline will scan the data that goes through it and generate a schema. To convert nested data into a relational format, dlt flattens dictionaries and unpacks nested lists into sub-tables.
For this example dlt created a schema called ‘mydata’ with the table ‘pokemon’ in it and stored it in DuckDB.
Duckdb Connection
Explore the loaded data
import duckdb
# A database '<pipeline_name>.duckdb' was created in working directory so just connect to it
# Connect to the DuckDB database
conn = duckdb.connect(f"{pipeline.pipeline_name}.duckdb")
# Set search path to the dataset
conn.sql(f"SET search_path = '{pipeline.dataset_name}'")
# Describe the dataset
conn.sql("DESCRIBE").df()
You can see:
- pokemon table,
and 3 special dlt tables
dltloads,
dltpipeline_state,
dltversion.
View the table in duckdb
# Fetch all data from 'pokemon' as a DataFrame
table = conn.sql("SELECT * FROM pokemon").df()
# Display the DataFrame
table
DLT sqlClient
The DLT SQLClient is a component of the dlt that allows users to interact with SQL databases seamlessly. It provides an abstraction for querying, executing SQL commands, and handling database connections as part of your DLT pipeline.
You start a connection to your database with pipeline.sql_client()
and execute a query to get all data from the pokemon
table
# Query data from 'pokemon' using the SQL client
with pipeline.sql_client() as client:
with client.execute_query("SELECT * FROM pokemon") as cursor:
data = cursor.df()
# Display the data
data
DLT dataset
A DLT Dataset is a key component of the dlt that represents a structured collection of data within a pipeline. It serves as an abstraction for handling data transformations, loading, and storage in a seamless and consistent way.
Here’s an example of how to retrieve data from a pipeline and load it into a Pandas DataFrame
dataset = pipeline.dataset(dataset_type="default")
dataset.pokemon.df()
We explored the fundamentals of the Data Load Tool (DLT) by creating a simple pipeline with Pokémon data, running it with a lightweight DuckDB destination, and exploring the loaded data using DLT’s features like the SQL client and dataset abstraction.
DLT simplifies the complexities of data extraction, transformation, and loading by providing an intuitive and scalable framework.