Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming refacto #353

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Streaming refacto #353

wants to merge 6 commits into from

Conversation

rom1504
Copy link
Owner

@rom1504 rom1504 commented Oct 8, 2023

Refactor img2dataset in 3 parts:

  • core: provides the core functionalities as functions with no user interface
  • batch: looks like the current CLI, can download N millions samples, by splitting in shards and running distributed if needed.
  • service: a fastapi API, that can download a shard. It can also queue. It also contains a load balancer to be able to use N workers

The main ideas here are:

  • make the core functionality be a simple function/process that takes one shard and process it
  • bundle that as batch as usual
  • now provides also a service which can provide similar but more flexible functionalities: streaming download, potentially more flexible distribution, ...

The implementation is mostly working but requires refinement before considering merging.

Was implemented some 8 months ago, but I'm thinking to finish it now

@rom1504
Copy link
Owner Author

rom1504 commented Oct 8, 2023

#339 related issue

@rom1504
Copy link
Owner Author

rom1504 commented Oct 8, 2023

one alternative I'm considering (but maybe that can simply live next to batch and service instead of replacing them): implement things as tasks and then use a distributed queue like celery to run things. I don't like making installing a service like reddis part of user requirements though

@rom1504
Copy link
Owner Author

rom1504 commented Oct 8, 2023

maybe ray https://stackoverflow.com/a/54738967 is an alternative to celery

@rom1504
Copy link
Owner Author

rom1504 commented Oct 8, 2023

@rom1504 rom1504 added this to Needs triage in PR Triage Dec 17, 2023
@rom1504 rom1504 moved this from Needs triage to Important to finish in PR Triage Jan 9, 2024
@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

https://github.com/veonua/laion-yt-dlp example for celery

@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

I think this branch is an interesting POC but it's too many changes

I think a PR that would only extract the core as a standalone function / process would be a good start. It may need dropping the keys based on shard first

@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

maybe https://abseil.io/docs/python/quickstart is a good alternative

@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

https://docs.celeryq.dev/en/stable/userguide/tasks.html celery take on how to configure tasks

@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

the flags should be much more separated process by process

@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

next step here: try to POC a few different ideas of parameter definition instead of pydantic
feels like something simpler is possible

@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

anything can be generated into a fastapi model at the end if necessary https://github.com/rom1504/img2dataset/blob/streaming_refacto/img2dataset/service/service.py#L37

@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

maybe check a few projects using ray or celery and see if something helps

@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

configuration for processes

@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

one idea:

  • config for the process (init), include feature names
  • input features (image)
  • output features (image)

input and output calls are generated based on feature names

so have a generic config for processes

then apply that here but also any2dataset/video2dataset

@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

try to write that out as json maybe

so all actual config here would be one json per process

we would also have support top level for the flat args at top level for legacy reasons

@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

write it out and see how that plays out for all the projects

@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

being able to configure these processes properly here should make all the wrapper around it (service, batch, distributors) come along naturally

@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

resizer options:

        image_size,
        resize_mode,
        resize_only_if_bigger,
        upscale_interpolation="lanczos",
        downscale_interpolation="area",
        encode_quality=95,
        encode_format="jpg",
        skip_reencode=False,
        disable_all_reencoding=False,
        min_image_size=0,
        max_image_area=float("inf"),
        max_aspect_ratio=float("inf"),
        blurrer=None, <- this should be replaced by blurrer config or moved out ?

img_stream, blurring_bbox_list=None <- input feature
img_str, width, height, original_width, original_height, None <- output feature

this is a process that applies to one image

json for this?

@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

maybe a typed thing is better here?

  • proto
  • pydantic
  • what else ?

@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

@rom1504
Copy link
Owner Author

rom1504 commented Jan 9, 2024

write down some of these and some potential schema and find what sticks

@rom1504
Copy link
Owner Author

rom1504 commented Jan 12, 2024

@rom1504
Copy link
Owner Author

rom1504 commented Jan 12, 2024

@rom1504
Copy link
Owner Author

rom1504 commented Jan 12, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
PR Triage
Important to finish
Development

Successfully merging this pull request may close these issues.

None yet

1 participant