define Python transformations with Hamilton #1261

zilto · 2024-04-22T20:31:35Z

Feature description

With "Extract, Transform, Load" (ETL) as a frame of reference, dlt does "EL" and Hamilton does "T".

What is Hamilton

In short, Hamilton is a library to define a DAG of data transformations in Python. It is similar in scope to dbt, but it's supports all Python types, not just tables/dataframes/SQL constructs. Users can write transformations with Python primitives, pandas, polars, Spark, Ibis, etc. Many users adopt Hamilton for feature engineering (jaffle shop example. It also allows users to define machine learning and LLM dataflows.

It uses a declarative API, which essentially consists of

define your DAG in a Python module
pass the DAG to the Driver responsible for execution
request nodes from the DAG to be executed (e.g., features, tables, models to train)

Integration ideas

dlt plugin for Hamilton

We already added a dlt plugin in Hamilton allowing users to load dlt.Resource as input and save outputs to dlt.Destination. This is useful for Hamilton users who want to start using dlt and run both as a unified pipeline. Also, some Hamilton DAG nodes might be "incompatible with dlt" (e.g., an XGBoost model).

Hamilton help for dlt

It appears to make sense to have a "Hamilton helper" in dlt, similar to the dbt runner. It would help dlt users to package their Hamilton code and bundle it with their dlt pipeline to be executed. A typical pattern would look like this (full ref):

import dlt
from hamilton import driver
import slack  # NOTE this is dlt code, not an official Slack library
import transform  # module containing dataflow definition

# EXTRACT & LOAD
pipeline = dlt.pipeline(
   pipeline_name="slack",
   destination='duckdb',
   dataset_name="slack_community_backup"
)
source = slack.slack_source(
   selected_channels=["general"], replies=True
)
load_info = pipeline.run(source)

# TRANSFORM
dr = driver.Builder().with_modules(transform).build()
results = dr.execute(
   ["insert_threads"],  # query the `threads` node
   inputs=dict(pipeline=pipeline)  # pass the dlt load info
)

Action

get a sense of what dlt users are looking for, their needs regarding Python transforms and usage patterns
define an API and work towards a Hamilton helper in dlt
maybe we only need to publish docs and guides

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

define Python transformations with Hamilton #1261

define Python transformations with Hamilton #1261

zilto commented Apr 22, 2024 •

edited

define Python transformations with Hamilton #1261

define Python transformations with Hamilton #1261

Comments

zilto commented Apr 22, 2024 • edited

Feature description

What is Hamilton

Integration ideas

dlt plugin for Hamilton

Hamilton help for dlt

Action

zilto commented Apr 22, 2024 •

edited