Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

define Python transformations with Hamilton #1261

Open
zilto opened this issue Apr 22, 2024 · 0 comments
Open

define Python transformations with Hamilton #1261

zilto opened this issue Apr 22, 2024 · 0 comments

Comments

@zilto
Copy link

zilto commented Apr 22, 2024

Feature description

With "Extract, Transform, Load" (ETL) as a frame of reference, dlt does "EL" and Hamilton does "T".

What is Hamilton

In short, Hamilton is a library to define a DAG of data transformations in Python. It is similar in scope to dbt, but it's supports all Python types, not just tables/dataframes/SQL constructs. Users can write transformations with Python primitives, pandas, polars, Spark, Ibis, etc. Many users adopt Hamilton for feature engineering (jaffle shop example. It also allows users to define machine learning and LLM dataflows.

It uses a declarative API, which essentially consists of

  1. define your DAG in a Python module
  2. pass the DAG to the Driver responsible for execution
  3. request nodes from the DAG to be executed (e.g., features, tables, models to train)

Integration ideas

dlt plugin for Hamilton

We already added a dlt plugin in Hamilton allowing users to load dlt.Resource as input and save outputs to dlt.Destination. This is useful for Hamilton users who want to start using dlt and run both as a unified pipeline. Also, some Hamilton DAG nodes might be "incompatible with dlt" (e.g., an XGBoost model).

Hamilton help for dlt

It appears to make sense to have a "Hamilton helper" in dlt, similar to the dbt runner. It would help dlt users to package their Hamilton code and bundle it with their dlt pipeline to be executed. A typical pattern would look like this (full ref):

import dlt
from hamilton import driver
import slack  # NOTE this is dlt code, not an official Slack library
import transform  # module containing dataflow definition

# EXTRACT & LOAD
pipeline = dlt.pipeline(
   pipeline_name="slack",
   destination='duckdb',
   dataset_name="slack_community_backup"
)
source = slack.slack_source(
   selected_channels=["general"], replies=True
)
load_info = pipeline.run(source)

# TRANSFORM
dr = driver.Builder().with_modules(transform).build()
results = dr.execute(
   ["insert_threads"],  # query the `threads` node
   inputs=dict(pipeline=pipeline)  # pass the dlt load info
)

Action

  • get a sense of what dlt users are looking for, their needs regarding Python transforms and usage patterns
  • define an API and work towards a Hamilton helper in dlt
  • maybe we only need to publish docs and guides
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

1 participant