What do DIV and FLT stand for? #91

vedantroy · 2024-03-26T23:16:04Z

I see there are 3 subsets: DIV, FLT, and the aesthetic version. What are the filtering criteria used for DIV and FLT, and what do they stand for?

shepnerd · 2024-03-27T11:43:18Z

DIV and FLT stand for diverse sampling and filtering respectively. Specifically, for DIV (diversity sampling), we aim to sample video clips from all long videos available to maximize data diversity. This was done by counting the frequencies of long videos in the segmented clip pool and sampling clips with probabilities inverse to these frequencies. For FLT (filtering), we applied a series of filtering strategies to video data alongside DIV sampling. These included: a) Removing video clips shorter than 1s (approximately 23.15% of the total) or longer than 120s (around 0.84% of the total). b) Computing CLIPScore for each video clip using a randomly sampled frame from the clip with OpenAI’s CLIP-ViT-L/14, then selecting clips within the top 30% of CLIPScores. c) Sampling 10M out of the remaining clips using DIV sampling.
You can refer to the Sec. E.1. of appendix of this paper.

vedantroy · 2024-03-27T16:00:50Z

Got it, and thanks for the fast response! 4 follow-ups (the first one is the most important):

Have you released the JSONL for the full set of 230M clips?

After the filtering, we get total 234M video clips whose durations range from 2s to more than 30s.

Does the aesthetic dataset do any sort of filtering by CLIP score? (I'm guessing not, but wanted to confirm) Also, how did you determine what a high aesthetic score was? (Top 10%? Above some constant? etc.)
Is this passage:

we aim to sample video clips from all long videos available to maximize
data diversity. This was done by counting the frequencies of long videos in the segmented clip pool and sampling clips with probabilities inverse to these frequencies

Saying "if there are many clips from the same video, we sample those clips less" (presumably in order to avoid over sampling from longer videos?)

Is there a reason you used CLIPScore using CLIP-ViT-L/14 instead of using the UMT_Score when calculating video-caption similarity?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What do DIV and FLT stand for? #91

What do DIV and FLT stand for? #91

vedantroy commented Mar 26, 2024

shepnerd commented Mar 27, 2024

vedantroy commented Mar 27, 2024 •

edited

What do DIV and FLT stand for? #91

What do DIV and FLT stand for? #91

Comments

vedantroy commented Mar 26, 2024

shepnerd commented Mar 27, 2024

vedantroy commented Mar 27, 2024 • edited

vedantroy commented Mar 27, 2024 •

edited