Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MleapSpringBoot - Multi-input StringIndexer not supported yet #784

Open
inardini opened this issue Oct 14, 2021 · 1 comment
Open

MleapSpringBoot - Multi-input StringIndexer not supported yet #784

inardini opened this issue Oct 14, 2021 · 1 comment

Comments

@inardini
Copy link

inardini commented Oct 14, 2021

To whom it may concern,

I'm trying to deploy an PySpark pipeline using the MLeap bundle with combustml/mleap-spring-boot:0.19.0-SNAPSHOT docker image. And I get this error:

[MleapSpringBoot-akka.actor.default-dispatcher-6] [akka://MleapSpringBoot/user/transform/model] 
Cannot load bundle because: java.lang.UnsupportedOperationException: Multi-input StringIndexer not supported yet.

Any insights how can I fix it?

The bundle has the following structure

model
├── bundle.json
└── root
    ├── RandomForestClassifier_e24b4862ceb2.node
    │   ├── model.json
    │   ├── node.json
    │   ├── tree0
    | .......
    │   └── tree9
    │       ├── model.json
    │       └── tree.json
    ├── StandardScaler_a24a7bb9bb7b.node
    │   ├── model.json
    │   └── node.json
    ├── StringIndexer_07ad6a29446e.node
    │   ├── model.json
    │   └── node.json
    ├── StringIndexer_397d06fcffaa.node
    │   ├── model.json
    │   └── node.json
    ├── VectorAssembler_56af20ae6ed6.node
    │   ├── model.json
    │   └── node.json
    ├── VectorAssembler_c118350511db.node
    │   ├── model.json
    │   └── node.json
    ├── model.json
    └── node.json

and it was trained using ml.combust.mleap:mleap-runtime_2.12:0.18.1 and ml.combust.mleap:mleap-spark_2.12:0.18.1 with spark version: 3.1.2.

Thanks

@jsleight
Copy link
Contributor

The error means you are using StringIndexer with the multi-column in/out formats. I.e., you set the InputCols parameter (and maybe the OutputCols parameter). This is a new feature added in Spark 3. Mleap does support spark 3, but doesn't yet support 100% of the capabilities (we try to throw exceptions like this when support isn't available yet).

As a workaround, you can replace your multi-column StringIndexer with multiple single-column StringIndexer. E.g., supposing you had code like this right now:

indexer = StringIndexer(inputCols=["foo", "bar", "baz"], outputCols=["a", "b", "c"])
pipe = Pipeline(stages=[...,indexer,...])

Then change it to:

indexer1 = StringIndexer(inputCol="foo", outputCol="a')
indexer2 = StringIndexer(inputCol="bar", outputCol="b")
indexer3 = StringIndexer(inputCol="baz", outputCol="c")
pipe = Pipeline(stages=[...,indexer1, indexer2, indexer3, ...])

Will be functionally equivalent and be supported in mleap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants