Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a pytorch dataloader that filter and download at run time #39

Open
rom1504 opened this issue Sep 16, 2021 · 9 comments
Open

Comments

@rom1504
Copy link
Owner

rom1504 commented Sep 16, 2021

this is an online version of #31
Combine the whole pipeline not as a big batch job, but instead as a data loader that

  • query/filter in a knn index + metadata structure
  • download
  • resize
  • give to training

It makes sense in particular when the model training speed is low. For example dalle is such a model.
For clip it could make less sense

it could be a lot more convenient than downloading TB of webdataset if it works:

  1. download a 16GB knn index and 50GB of metadata
  2. write your best keyword and how much of each you'd like (with clip thresholds)
  3. start the training on up to 400M sample
@rom1504 rom1504 pinned this issue Sep 19, 2021
@rom1504
Copy link
Owner Author

rom1504 commented Sep 25, 2021

related rom1504/img2dataset#56

I'm thinking of implementing the download+resize inside img2dataset since these features are already there.
I think to pass it to pytorch a good way would be to add a writer to img2dataset that would take as attribute a multiprocessing queue https://docs.python.org/3/library/multiprocessing.html#pipes-and-queues and then to use that queue in an iterable dataset https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset
Since queue is process and thread safe, it would work from the producer part (img2dataset produce from multiple processes), and from the consumer part (torch dataloader could apply any resizing/batching in different processes)

img2dataset would not need to depend on pytorch since implementing an iterable dataset only requires having a class with an __iter__ method

@rom1504
Copy link
Owner Author

rom1504 commented Sep 25, 2021

the filtering / retrieving from an index part would however make more sense to live here, so clip-retrieval could depend on img2dataset and use its UrlStreamingDataset to provide a FilteredUrlStreamingDataset

let's hope this can be made to work with the same speed at img2dataset (1300 sample/s)

@rom1504
Copy link
Owner Author

rom1504 commented Feb 11, 2022

rom1504/img2dataset#82

Could be interesting to investigate this path

  1. img2dataset is a (multi instance per machine) rest service that takes as input a path towards an url shard, and return a path towards an image shard when it's done
  2. clip inference is a (multiple instance per machine) rest service that takes as input a path towards an image shard and return a path towards an embedding shard when it's done
  3. autofaiss is a (multiple instance per machine) rest service that takes as input a path towards an embedding file and return a path towards an index path when it's done

The img2dataset service can also expose a shard endpoint that takes as input some url, caption files and turn them into shard files.
The autofaiss service can also expose a train endpoint and a merge endpoint.
The clip inference service can also expose a combine endpoint to turn N embeddings file into one

Then all that is needed will be an orchestrator with a metadata database, that makes sure all the shards are properly done.

Benefits:

  • easy separation of concern
  • easy deployment of the services
  • easy scaling
  • easy to combine various features
  • provide both streaming and batch modes with one implementation
  • possible to use it only to get a few shards
  • simpler to test
  • logic in each service is limited, no need to redo the orchestration every time

To check:

@rom1504
Copy link
Owner Author

rom1504 commented Feb 26, 2022

new idea: rethink all these tools as dataflow/stream/transformers taking an input a collection and producing an output collection with optional caching and pressure push back

reader:

  • url/meta in parquet, csv,.. -> shards of url/meta
  • images in files, tar, parquet -> shards of image/meta
  • embeddings in npy, parquet -> shards of embeddings
  • indices in .index -> shards of indices

writer:

  • shards of url/meta -> url/meta in parquet, csv, ..
  • shards of image/meta -> images in files, tar, parquet
  • shards of embeddings -> embeddings in npy, parquet
  • shards of indices -> indices in .index

transformer:

  • shard of url/meta -> shards of image/meta
  • shards of image/meta -> shards of embeddings / meta
  • shards of embeddings / meta -> shards of indices

These bricks could then be naturally composed to form downloaders, inferences and indexers

defining good interfaces for each subtool then making each tool a separate package, well tested and with good examples

Check if https://docarray.jina.ai/fundamentals/documentarray/ could be helpful to build this

This new structure should make it possible to make all these tools both more powerful and more reusable

@rom1504
Copy link
Owner Author

rom1504 commented Feb 26, 2022

@rom1504
Copy link
Owner Author

rom1504 commented Feb 26, 2022

let's first try and check how to read in parallel a large file with fsspec

@rom1504
Copy link
Owner Author

rom1504 commented Feb 27, 2022

reading a large file with fsspec works by seeking and reading up to a length, it's much faster

@rom1504
Copy link
Owner Author

rom1504 commented Feb 27, 2022

next step will be implementing a clean embedding-reader package

@rom1504
Copy link
Owner Author

rom1504 commented Feb 27, 2022

independently I think that https://towardsdatascience.com/data-pipelines-with-apache-beam-86cd8eb55fd8 looks good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

1 participant