Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to share data between multiple transformer instances? #825

Open
zhenglaizhang opened this issue Aug 30, 2022 · 4 comments
Open

Comments

@zhenglaizhang
Copy link

Hello mleap experts,

I have built a custom transformer which maps a key to vector with a Map, but the scale is not small ~100K, the custom transformer is used multiple times in same mleap pipeline, they are serialized separately causing the underling Map duplicated. I am wondering if it's possible that multiple transformer instance share the same underlying data, so that I could only store one copy in bundle file, and store one copy in memory shared by multiple instances.

@zhenglaizhang zhenglaizhang changed the title Is it possible to share data between multiple transformer instance? Is it possible to share data between multiple transformer instances? Aug 30, 2022
@jsleight
Copy link
Contributor

Definitely you can have the transformer instances share the map state. E.g., store the map as part of a companion object.

Storing the map only once in the bundle file is trickier. I think it can be done but would be kind of ugly. Maybe something like add+store a "shouldWriteMap" parameter on the transformer which you set to true on exactly one instance of the transformer in your pipeline. It might be easier to just store the map multiple times within the bundle.

Another option which you could consider to make your transformer be multiple input/output so that you only need to use the transformer one time in your pipeline.

@zhenglaizhang
Copy link
Author

zhenglaizhang commented Aug 30, 2022

thanks @jsleight for quick response!

the map is not small compared with the overall bundle file, the most size portion is due to the map duplicates.

store the map as part of a companion object.

this is a good idea that i can use to reduce memory footprint, I could keep a another map to store different embeddings and keep each one for exact one copy, and then load the duplicated instance, just point to the map in the object.

adding a shouldWriteMap parameter will make ML team a little harder to use it, they need maintain the flag once. I am thinking that if it's possible to store the common data in the root, and make multiple instances to point to that common data, but this seems break the mleap serialization design philosophy?

Update the transformer to be multiple input/output is also one solution, but I may prefer to see if I could update the serialization/deserialization internally to achieve the goal as they are already a bunch of clients code using the custom transformer.

@zhenglaizhang
Copy link
Author

zhenglaizhang commented Aug 30, 2022

I checked around the mleap code, it seems I can customize the single transformer serialization with store() but cannot customize the overall node serialization to make some data sharing logic.

@jsleight
Copy link
Contributor

Yeah to my knowledge mleap APIs don't really have a good mechanism for storing global state in the bundle. Though I wouldn't be opposed to adding such capabilities if you want to submit a PR.

Perhaps by adding new APIs for writeGlobal and readGlobal (or something like that) which the Ops can use. Probably we would need to rely on transformers providing unique keys in the global bundle namespace, but I think that should be acceptable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants