Does torcharrow support industry-level large scale data? #476

circlecrystal · 2022-08-19T22:24:35Z

I`m asking for myself, and also my algo team members in company.
Currently we got PB level of data, which is separated in parquets across different remote hdfs paths (per day), and need to be trained.

Really wish to get an answer for this question: How well performed is torcharrow for this level of data in industry?

Does it download -> store then release couple remote parquets, or just transferring data through network without any local caching?
How well does it handle enormously large scale dataset? From TB-level to PB-level, maybe even EB level? Is it a performant solution when compared to other solutions?

wenleix · 2022-08-20T23:41:25Z

Thanks for the interests! We have an internal scalable distributed system called Data PreProcessing Service (DPP) [1] that executes traced TorchArrow program at Meta-scale.

It's an open question whether and how we can open source the distributed mode, as DPP has deep integration into Meta's infrastructure. It may be possible to open source just the tracer (thinking PyTorch FX Tracer) with separate integration into OSS big data ecosystem.

Wondering in your use case, is there any preferred big data stack would like to integrate to execute traced TA program? (e.g. Spark, Kafka, Ray, or customized distributed runtime? )

cc @dracifer, @msaroufim, @damianr99

[1] https://arxiv.org/pdf/2108.09373.pdf

circlecrystal · 2022-08-24T02:24:01Z

Thanks for the interests! We have an internal scalable distributed system called Data PreProcessing Service (DPP) [1] that executes traced TorchArrow program at Meta-scale.

It's an open question whether and how we can open source the distributed mode, as DPP has deep integration into Meta's infrastructure. It may be possible to open source just the tracer (thinking PyTorch FX Tracer) with separate integration into OSS big data ecosystem.

Wondering in your use case, is there any preferred big data stack would like to integrate to execute traced TA program? (e.g. Spark, Kafka, Ray, or customized distributed runtime? )

cc @dracifer, @msaroufim, @damianr99

[1] https://arxiv.org/pdf/2108.09373.pdf

Thanks for taking time to answer my question. Our current stack mostly prefer Spark or Ray to execute distributed program. The difficulty is that, a solution is still missing if we are aiming for training some large model across multiple training containers with large scale training data in pytorch framework.

circlecrystal changed the title ~~Does torcharrow support industry-level massive data?~~ Does torcharrow support industry-level large scale data? Aug 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does torcharrow support industry-level large scale data? #476

Does torcharrow support industry-level large scale data? #476

circlecrystal commented Aug 19, 2022 •

edited

wenleix commented Aug 20, 2022 •

edited

circlecrystal commented Aug 24, 2022 •

edited

Does torcharrow support industry-level large scale data? #476

Does torcharrow support industry-level large scale data? #476

Comments

circlecrystal commented Aug 19, 2022 • edited

wenleix commented Aug 20, 2022 • edited

circlecrystal commented Aug 24, 2022 • edited

circlecrystal commented Aug 19, 2022 •

edited

wenleix commented Aug 20, 2022 •

edited

circlecrystal commented Aug 24, 2022 •

edited