Data Load Tool (dlt) Series : Creating Pipelines from Different Sources Part 2

In the first article of this series,

  • I created a pipeline, loaded toy data into Duckdb and viewed loaded info.

  • I used dlt.pipeline and pipeline.run() methods

  • I used DuckDB, sql_client and dlt dataset to view tables and query data.

In this article, I will discuss

  • How to run a simple pipeline with different types of data, such as dataframes, databases and RestAPI

  • Use dlt.resource, dlt.source and dlt.transformer In the previous post, I simply used a list of dictionaries that essentially represents the pokemon table.

import dlt



# Sample data containing pokemon details

data = [

    {"id": "1", "name": "bulbasaur", "size": {"weight": 6.9, "height": 0.7}},

    {"id": "4", "name": "charmander", "size": {"weight": 8.5, "height": 0.6}},

    {"id": "25", "name": "pikachu", "size": {"weight": 6, "height": 0.4}},

]




# Set pipeline name, destination, and dataset name

pipeline = dlt.pipeline(

    pipeline_name="quick_start",

    destination="duckdb",

    dataset_name="mydata",

)



# Run the pipeline with data and table name

load_info = pipeline.run(data, table_name="pokemon")



print(load_info)

A better way is to wrap it in the @dlt.resource decorator

import dlt



# Create a dlt resource from the data

@dlt.resource(table_name='pokemon_new') # <--- we set new table name

def my_dict_list():

    data = [

    {"id": "1", "name": "bulbasaur", "size": {"weight": 6.9, "height": 0.7}},

    {"id": "4", "name": "charmander", "size": {"weight": 8.5, "height": 0.6}},

    {"id": "25", "name": "pikachu", "size": {"weight": 6, "height": 0.4}},

]




# Set pipeline name, destination, and dataset name

pipeline = dlt.pipeline(

    pipeline_name="quick_start",

    destination="duckdb",

    dataset_name="mydata",

)



    yield data

The dlt.resource decorator denotes a logical grouping of data within a data source, typically holding data of similar structure and origin.

Run the pipeline with my_dict resource

# Run the pipeline and print load info

load_info = pipeline.run(my_dict_list)

print(load_info)

Check what was loaded to the poke-mon table pipeline.dataset(dataset_type="default").pokemon_new.df()

Commonly used arguments:

  • name: The resource name and the name of the table generated by this resource. Defaults to the decorated function name.

  • table_name: the name of the table, if different from the resource name.

  • write_disposition: controls how to write data to a table. Defaults to "append".

Dataframes

For creating a pipeline

import pandas as pd



# Define a resource to load data from a CSV

@dlt.resource(name='df_data')

def my_df():

  sample_df = pd.read_csv("https://people.sc.fsu.edu/~jburkardt/data/csv/hw_200.csv")

  yield sample_df




# Run the pipeline with the defined resource

load_info = pipeline.run(my_df)

print(load_info)



# Query the loaded data from 'df_data'

pipeline.dataset(dataset_type="default").df_data.df()

Running this

Database

For creating a pipeline from an SQL database query you would do:

  1. Install the PyMySQL package pip install pymysql

  2. Create and run a pipeline to fetch data from an SQL resource and query the loaded data

import dlt

from sqlalchemy import create_engine



# Define a resource to fetch genome data from the database

@dlt.resource(table_name='genome_data')

def get_genome_data():

  engine = create_engine("mysql+pymysql://rfamro@mysql-rfam-public.ebi.ac.uk:4497/Rfam")

  with engine.connect() as conn:

      query = "SELECT * FROM genome"

      rows = conn.execution_options(yield_per=100).exec_driver_sql(query)

      yield from map(lambda row: dict(row._mapping), rows)



# Run the pipeline with the genome data resource

load_info = pipeline.run(get_genome_data)

print(load_info)



# Query the loaded data from 'genome_data'

pipeline.dataset(dataset_type="default").genome_data.df()

Rest API

Extracting data from an API is straightforward with dlt. You provide the base URL, define the resources you want to fetch, and dlt will handle the pagination, authentication, and data loading. For REST API endpoints, create a pipeline as follows:

from dlt.sources.helpers import requests




# Define a resource to fetch pokemons from PokeAPI

@dlt.resource(table_name='pokemon_api')

def get_pokemon():

    url = "https://pokeapi.co/api/v2/pokemon"

    response = requests.get(url)

    yield response.json()["results"]




# Run the pipeline using the defined resource

load_info = pipeline.run(get_pokemon)

print(load_info)



# Query the loaded data from 'pokemon_api' table

pipeline.dataset(dataset_type="default").pokemon_api.df()

from dlt.sources.helpers import requests:This imports the requests module from dlt (Data Load Tool), which is a library used for building data pipelines. The requests module helps make HTTP requests to fetch data from APIs. Defining the resource

  • @dlt.resource: This decorator marks the function as a resource in DLT. A resource is essentially a data-fetching function that can be used in the pipeline.

  • table_name='pokemon_api': Specifies that the fetched data will be stored in a table named pokemon_api in the destination database.

  • Inside the function:

    • It makes a GET request to the Pokémon API at https://pokeapi.co/api/v2/pokemon.

    • The yield keyword returns the list of Pokémon found in the results field of the JSON response. Using yield allows the resource to be used as an iterator, which can handle large data efficiently. Running the pipeline

  • pipeline.run(get_pokemon): This command runs the pipeline by fetching data using the get_pokemon resource.

  • The pipeline processes the data (e.g., transforms it, validates it) and then loads it into the destination (e.g., a database, file storage, etc.).

  • load_info: This variable stores metadata about the load operation, such as the number of rows loaded, errors, or warnings. Quering the data

  • pipeline.dataset(dataset_type="default"): Accesses the dataset where the data was loaded. The dataset_type="default" refers to the default dataset created by the pipeline.

  • .pokemon_api.df(): Refers to the pokemon_api table and converts it into a Pandas DataFrame (df()). This allows you to view or manipulate the Pokémon data that was fetched and loaded.

dlt sources

A source is a logical grouping of resources, e.g., endpoints of a single API. The most common approach is to define it in a separate Python module.

  • A source is a function decorated with @dlt.source that returns one or more resources.

  • A source can optionally define a schema with tables, columns, performance hints, and more.

  • The source Python module typically contains optional customizations and data transformations.

  • The source Python module typically contains the authentication and pagination code for a particular API. You declare a source by decorating a function that returns or yields one or more resources with @dlt.source.

@dlt.source

def all_data():

  return my_df, get_genome_data, get_pokemon


# load everything into a separate database using a new pipeline

pipeline = dlt.pipeline(

    pipeline_name="resource_source_new",

    destination="duckdb",

    dataset_name="all_data"

)



# Run the pipeline

load_info = pipeline.run(all_data())



# Print load info

print(load_info)

dlt transformers

dlt resources can be grouped into a dlt source, represented as:

          Source           
      /          \          
Resource 1  ...  Resource N

imagine a scenario where you need an additional step in between


                            Source                
                          /     \             
                       step       \ 
                     /             \        
                  Resource 1  ...  Resource N

This step could arise, for example, in a situation where:

  • Resource 1 returns a list of pokemons IDs, and you need to use each of those IDs to retrieve detailed information about the pokemons from a separate API endpoint.

In such cases, you would use dlt transformers — special dlt resources that can be fed data from another resource:

                         Source         
                        /         \                 
                    transformer     \       
                      /              \          
                     Resource 1  ...  Resource N
@dlt.resource(table_name='pokemon')

def my_dict_list():

    yield from data # <--- This would yield one item at a time




@dlt.transformer(data_from=my_dict_list, table_name='detailed_info')

def details(data_item): # <--- Transformer receives one item at a time

    id = data_item["id"]

    url = f"https://pokeapi.co/api/v2/pokemon/{id}"

    response = requests.get(url)

    details = response.json()



    yield details




load_info = pipeline.run(details())

print(load_info)

Reduce the nesting level of generated tables

You can limit how deep dlt goes when generating nested tables and flattening dicts into columns. By default, the library will descend and generate nested tables for all nested lists, without limit.

You can set nesting level for all resources on the source level:

@dlt.source(max_table_nesting=1)
def all_data():  
    return my_df, get_genome_data, get_pokemon

or for each resource separately:

@dlt.resource(table_name='pokemon_new', max_table_nesting=1)
def my_dict_list():    
    yield data

In the example above, we want only 1 level of nested tables to be generated (so there are no nested tables of a nested table). Typical settings:

  • max_table_nesting=0 will not generate nested tables and will not flatten dicts into columns at all. All nested data will be represented as JSON.

  • max_table_nesting=1 will generate nested tables of root tables and nothing more. All nested data in nested tables will be represented as JSON.

In this article, I demonstrated how to run a simple data pipeline in dlt using various data sources such as Python dictionaries, dataframes, SQL databases, and REST APIs. I also introduced key concepts like dlt.resource, dlt.source, and dlt.transformer, which make data extraction, transformation, and loading more modular and scalable. Finally how to reduce the nested level of generated tables made easier with dlt.