Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pyspark DecisionTreeRegressionModel bundle does not include all attributes #871

Open
anigmo97 opened this issue Feb 2, 2024 · 7 comments

Comments

@anigmo97
Copy link

anigmo97 commented Feb 2, 2024

Issue Description

Pyspark DecisionTreeRegressionModel loses values ​​in attributes after packaging and loading them.

Minimal Reproducible Example

mleap version: 0.23.1
pyspark version: 3.3.0
Python version: 3.10.6

import pyspark
import mleap
import mleap.pyspark
from mleap.pyspark.spark_support import SimpleSparkSerializer

from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import DecisionTreeRegressor, DecisionTreeRegressionModel

# Step 1: Create a Spark session
spark = SparkSession.builder\
         .config('spark.jars.packages', 'ml.combust.mleap:mleap-spark_2.12:0.23.1') \
        .getOrCreate()

# Step 2: Prepare Data
data = [(1.0, 2.0, 3.0), (2.0, 3.0, 4.0), (3.0, 4.0, 5.0)]
columns = ["feature1", "feature2", "label"]
df = spark.createDataFrame(data, columns)

# Step 3: Feature Vector Assembly
assembler = VectorAssembler(inputCols=["feature1", "feature2"], outputCol="features")
df = assembler.transform(df)

# Step 4: Split Data
(trainingData, testData) = df.randomSplit([0.8, 0.2], seed=1234)

# Step 5: Create and Train Decision Tree Model
dt = DecisionTreeRegressor(featuresCol="features", labelCol="label")
model = dt.fit(trainingData)

# Step 6: Make Predictions
predictions = model.transform(testData)

If we take a look to the created model, we can see that nodes have different attributes.

print(model._to_java().rootNode().toString())
print(model._java_obj.rootNode().toString())

InternalNode(prediction = 4.0, impurity = 0.6666666666666666, split = org.apache.spark.ml.tree.ContinuousSplit@3ff80000)
InternalNode(prediction = 4.0, impurity = 0.6666666666666666, split = org.apache.spark.ml.tree.ContinuousSplit@3ff80000)

If I save and load the model the results are:

model_path = f"{os.getcwd()}/tree_regressor.zip"
model.serializeToBundle(f"jar:file:{model_path}", predictions)
print(f"Model Saved as MLeap bundle at: {model_path}")

loaded_model = DecisionTreeRegressionModel.deserializeFromBundle(f"jar:file:{model_path}")

print(loaded_model._to_java().rootNode().toString())
print(loaded_model._java_obj.rootNode().toString())
print(loaded_model._to_java().rootNode().impurityStats())

InternalNode(prediction = 0.0, impurity = 0.0, split = org.apache.spark.ml.tree.ContinuousSplit@3ff80000)
InternalNode(prediction = 0.0, impurity = 0.0, split = org.apache.spark.ml.tree.ContinuousSplit@3ff80000)
None

@jsleight
Copy link
Contributor

jsleight commented Feb 2, 2024

Step 1: Create a Spark session
spark = SparkSession.builder
.config('spark.jars.packages', 'ml.combust.mleap:mleap-spark_2.12:0.19.0')
.getOrCreate()

Your example is using mleap 0.19.0. Does this go away if you use the latest version? Also note that v0.23.1 is tested against Spark 3.4. I'd suspect it still works with Spark 3.3, but untested/unsupported.

@anigmo97
Copy link
Author

anigmo97 commented Feb 2, 2024

Hello @jsleight,
You're right I used the jar of the v0.19.0 by mistake.

I have tested using the correct jar:

spark = SparkSession.builder
.config('spark.jars.packages', 'ml.combust.mleap:mleap-spark_2.12:0.23.1')
.getOrCreate()

And the results remain the same. The attributes are lost

@jsleight
Copy link
Contributor

jsleight commented Feb 2, 2024

Looks like the op isn't serializing the impurities right now.

Looking at what the withImpurities is doing, it seems that is extra meta-data that can aid in debugging, but that isn't critical to inference tasks. Excluding the impurities is to reduce the bundle sizes.

@anigmo97
Copy link
Author

anigmo97 commented Feb 2, 2024

Hello @jsleight

The impurities are important for explainability for example.
Shap library use them to calculate shap values.

@jsleight
Copy link
Contributor

jsleight commented Feb 2, 2024

Yeah for sure. But I'd argue that serializing to mleap is for inference tasks. To do evaluation and introspection you could just

pipeline.save(path)
pipeline.load(path)

using spark's built in functions. Then serializeToBundle when you're ready to productionize the model.

@anigmo97
Copy link
Author

anigmo97 commented Feb 5, 2024

Hello @jsleight

I have no knowledge of Scala but I think I understood how objects are serialized internally.

What do you think about the possibility of an additional parameter in serializeToBundle and deserializeFromBundle that allows us to send a Map with:
Key: Canonical Name of the class that you want to Serialize in a special way.
Value: Custom Ops to apply to that class

And then in the BundleRegistry check if a class is in the new map or if it not, use the defaults

With this perhaps users could create their own ops and add and change attributes.

@jsleight
Copy link
Contributor

jsleight commented Feb 5, 2024

Ah, in mleap you can do that exact process by altering the ops registry. We use it for xgboost in order to allow xgboost models to be serialized in different ways depending how you want to serve them. See this readme and associated xgboost-runtime code as an example.

Using this process, your approach would be to:

  1. Create a custom Op
  2. Specify the new op in the resources.conf file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants