Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

[PIO-138] Fix batchpredict for custom PersistentModel #447

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

mars
Copy link
Member

@mars mars commented Nov 17, 2017

Fixes PIO-138

Switches batch query processing from Spark RDD to a Scala parallel collection. As a result, the pio batchpredict command changes in the following ways:

  • --query-partitions option is no longer available; parallelism is now managed by Scala's parallel collections
  • --input option is now read as a plain, local file
  • --output option is now written as a plain, local file
  • because the input & output files are no longer parallelized through Spark, memory limits may require that large batch jobs be split into multiple command runs.

This solves the root problem that certain custom PersistentModels, such as ALS Recommendation template, may themselves contain RDDs, which cannot be nested inside the batch queries RDD. (See SPARK-5063)

@mars
Copy link
Member Author

mars commented Nov 17, 2017

I'm currently testing this change with various engines and large batches.

@mars mars changed the title Fix batchpredict for custom PersistentModel [PIO-138] Fix batchpredict for custom PersistentModel Nov 17, 2017
@mars
Copy link
Member Author

mars commented Nov 18, 2017

Tested this new pio batchpredict with all three model types:

  • ✅ custom PersistentModel (ALS Recommendation)
  • ✅ built-in, default model serialization (Classification)
  • ✅ null model (Universal Recommender)

This PR is ready to go!

@mars
Copy link
Member Author

mars commented Nov 18, 2017

BTW, I found performance for a large, 250K query batch running on a single multi-core machine is equivalent to the previous Spark RDD-based performance.

@mars
Copy link
Member Author

mars commented Dec 14, 2017

This PR stalled due to @dszeto's concerns about removing the distributed processing capability from pio batchpredict. I agree that distributed batch processing is optimal, but do not have a solution for the nested RDDs problem encountered for RDD-based persistent models.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
1 participant