Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What do DIV and FLT stand for? #91

Open
vedantroy opened this issue Mar 26, 2024 · 2 comments
Open

What do DIV and FLT stand for? #91

vedantroy opened this issue Mar 26, 2024 · 2 comments

Comments

@vedantroy
Copy link

I see there are 3 subsets: DIV, FLT, and the aesthetic version. What are the filtering criteria used for DIV and FLT, and what do they stand for?

@shepnerd
Copy link
Member

DIV and FLT stand for diverse sampling and filtering respectively. Specifically, for DIV (diversity sampling), we aim to sample video clips from all long videos available to maximize data diversity. This was done by counting the frequencies of long videos in the segmented clip pool and sampling clips with probabilities inverse to these frequencies. For FLT (filtering), we applied a series of filtering strategies to video data alongside DIV sampling. These included: a) Removing video clips shorter than 1s (approximately 23.15% of the total) or longer than 120s (around 0.84% of the total). b) Computing CLIPScore for each video clip using a randomly sampled frame from the clip with OpenAI’s CLIP-ViT-L/14, then selecting clips within the top 30% of CLIPScores. c) Sampling 10M out of the remaining clips using DIV sampling.
You can refer to the Sec. E.1. of appendix of this paper.

@vedantroy
Copy link
Author

vedantroy commented Mar 27, 2024

Got it, and thanks for the fast response! 4 follow-ups (the first one is the most important):

  1. Have you released the JSONL for the full set of 230M clips?

After the filtering, we get total 234M video clips whose durations range from 2s to more than 30s.

  1. Does the aesthetic dataset do any sort of filtering by CLIP score? (I'm guessing not, but wanted to confirm) Also, how did you determine what a high aesthetic score was? (Top 10%? Above some constant? etc.)

  2. Is this passage:

we aim to sample video clips from all long videos available to maximize
data diversity. This was done by counting the frequencies of long videos in the segmented clip pool and sampling clips with probabilities inverse to these frequencies

Saying "if there are many clips from the same video, we sample those clips less" (presumably in order to avoid over sampling from longer videos?)

  1. Is there a reason you used CLIPScore using CLIP-ViT-L/14 instead of using the UMT_Score when calculating video-caption similarity?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants