Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃毀 Adds MLflow materializer #358

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

bryangalindo
Copy link
Contributor

馃毀 WIP 馃毀

Changes

How I tested this

Notes

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

@bryangalindo
Copy link
Contributor Author

@bryangalindo
Copy link
Contributor Author

bryangalindo commented Sep 19, 2023

model flavors can be found here or below (but missing crate?)

>>> import mlflow
>>> mlflow.__version__
'2.7.1'
>>> [attr for attr in dir(mlflow) if hasattr(getattr(mlflow, attr), 'log_model')]
[
    'catboost', 'diviner', 'fastai', 'gluon', 'h2o', 'johnsnowlabs', 'langchain', 
    'lightgbm', 'mleap', 'onnx', 'openai', 'paddle', 'pmdarima', 'prophet', 
    'pyfunc', 'pytorch', 'sentence_transformers', 'sklearn', 'spacy', 'spark', 
    'statsmodels', 'tensorflow', 'transformers', 'xgboost'
]

top three flavors (probably): sklearn, tensorflow, pytorch. no hard data, just vibes.

@bryangalindo
Copy link
Contributor Author

bryangalindo commented Sep 19, 2023

example of load/save flow for sklearn model flavor from mlflow quickstart.

from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

import mlflow
from mlflow.models import infer_signature

run_id = None

db = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(db.data, db.target)

with mlflow.start_run() as run:
    rf = RandomForestRegressor(n_estimators=100, max_depth=6, max_features=3)
    rf.fit(X_train, y_train)
    save_predictions = rf.predict(X_test)
    signature = infer_signature(X_test, save_predictions)
    mlflow.sklearn.log_model(rf, "model", signature=signature)
    run_id = run.info.run_id

model = mlflow.sklearn.load_model(f"runs:/{run_id}")
load_predictions = model.predict(X_test)

assert save_predictions == load_predictions

disclaimer: I have not tested this

@bryangalindo

This comment was marked as off-topic.

@skrawcz
Copy link
Collaborator

skrawcz commented Sep 19, 2023

@bryangalindo we should come up with the Hamilton UX to help guide this. i.e. what's the API we want to expose for Hamilton?

@bryangalindo
Copy link
Contributor Author

@bryangalindo we should come up with the Hamilton UX to help guide this. i.e. what's the API we want to expose for Hamilton?

Ok let's chat during our sync. Thanks!

@bryangalindo bryangalindo changed the title Adds MLflow materializer 馃毀 Adds MLflow materializer Sep 19, 2023
@bryangalindo
Copy link
Contributor Author

bryangalindo commented Sep 21, 2023

High-level tasks:

Analysis:

  • (1 hour) Create "hello, world!" version of load_model/log_model to understand mlflow (debug, print stmts, etc).
  • (30 min) Observe directories/files created from log_model (see hamilton/plugins/mlruns/0/0b9e9b23c3ef443ba638d23e4318b58e)
  • (3 hours) Get high-level understanding of Hamilton driver, see hamilton/driver.py.
  • (3 hours) Get high-level understanding of e.g., regressors
  • (1 hour) Read through files in examples/materialization
  • (15 min) Decide on what reader/writer type makes sense (e.g., MLflowRegressorReader/MLflowRegressorWriter)
  • (15 min) Decide on the applicable type (e.g., dataframe, classifiers, regressors)
  • (30 min) Decide what metadata to save from model (see hamilton/plugins/mlruns/0/0b9e9b23c3ef443ba638d23e4318b58e/artifacts/model)
  • (15 min) Discover kwargs for log_model and load_model .

Reader/Writer Development:

  • (2 hours) Write reader
  • (2 hours) Write writer
  • (1 hour) Write get metadata function
  • (1 hour) Write unit tests for get metadata function
  • (1 hour) Write unit tests for reader/writer

Materializer Development:

  • (30 min) Write data loader module (see examples/materialization/data_loaders.py)
  • (1 hour) Write model_training module (see examples/materialization/model_training.py)
  • (1 hour) Write run.py module (see examples/materialization/run.py)
  • (1 hour) Write jupyter notebook example (see examples/materialization/notebook.ipynb)
  • (5 min) Write requirements.txt

@elijahbenizzy
Copy link
Collaborator

elijahbenizzy commented Sep 28, 2023

Hey @bryangalindo -- a thought on a feature that might be helpful. Here's an outline of what the API should look like -- the data saver/materialization implementatino should support this.

from hamilton.function_modifiers import source
dr = driver.Driver(...)
dr.materialize(
   to.mlflow(
      id="mlflow_save",
      dependencies=["my_cool_model"],
      model_input=source("training_data"),
      model_output=source("predictions"),
   )
)

Then the materializer would call infer_signature with the model_input and model_output -- these would be taken from nodes called training_data and predictions. The DAG would look like:

training_data -> mlflow_save
predictions -> mlflow_save

and possibly more connections. Does this make sense? This is all supported btw -- materializers can take in source/value type parameters -- if they take in one that isn't it, they'll just resolve to a value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants