Skip to content

Latest commit

 

History

History
59 lines (46 loc) · 20.6 KB

File metadata and controls

59 lines (46 loc) · 20.6 KB

Refined open source dataset by Data-Juicer

We found that there are still some "bad" samples in existing processed datasets (e.g. RedPajama, The Pile.). So we use our Data-Juicer to refine them and try to feed them to LLMs for better performance.

We use simple 3-σ rule to set the hyperparameters for ops in each recipe.

Before and after refining for Pretraining Text Dataset

subset #samples before #samples after keep ratio config link data link source
arXiv 1,724,497 1,655,259 95.99% redpajama-arxiv-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
Books 205,182 195,983 95.51% redpajama-book-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
Wikipedia 29,834,171 26,990,659 90.47% redpajama-wiki-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
C4 364,868,892 344,491,171 94.42% redpajama-c4-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
Common Crawl 2019-30 81,085,420 36,557,283 45.08% redpajama-cc-2019-30-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
Common Crawl 2020-05 90,850,492 42,612,596 46.90% redpajama-cc-2020-05-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
Common Crawl 2021-04 98,878,523 44,724,752 45.23% redpajama-cc-2021-04-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
Common Crawl 2022-05 94,058,868 42,648,496 45.34% redpajama-cc-2022-05-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
Common Crawl 2023-06 111,402,716 50,643,699 45.46% redpajama-cc-2023-06-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
Github Code 73,208,524
+ 21,387,703
49,279,344 52.09% redpajama-code-refine.yaml
stack-code-refine.yaml
redpajama-stack-code-deduplicate.yaml
Aliyun
ModelScope
HuggingFace
Redpajama
The Stack
StackExchange 45,447,328 26,309,203 57.89% redpajama-pile-stackexchange-refine.yaml Aliyun
ModelScope
HuggingFace
Redpajama
The Pile
EuroParl 69,814 61,601 88.23% pile-europarl-refine.yaml Aliyun
ModelScope
HuggingFace
The Pile
FreeLaw 3,562,015 2,942,612 82.61% pile-freelaw-refine.yaml Aliyun
ModelScope
HuggingFace
The Pile
HackerNews 373,027 371,331 99.55% pile-hackernews-refine.yaml Aliyun
ModelScope
HuggingFace
The Pile
NIH ExPorter 939,661 858,492 91.36% pile-nih-refine.yaml Aliyun
ModelScope
HuggingFace
The Pile
PhilPapers 32,782 29,117 88.82% pile-philpaper-refine.yaml Aliyun
ModelScope
HuggingFace
The Pile
PubMed Abstracts 15,518,009 15,009,325 96.72% pile-pubmed-abstract-refine.yaml Aliyun
ModelScope
HuggingFace
The Pile
PubMed Central 3,098,930 2,694,860 86.96% pile-pubmed-central-refine.yaml Aliyun
ModelScope
HuggingFace
The Pile
USPTO 5,883,024 4,516,283 76.77% pile-uspto-refine.yaml Aliyun
ModelScope
HuggingFace
The Pile

Before and after refining for Alpaca-CoT Dataset

subset #samples before #samples after keep ratio config link data link source
Alpaca-Cot EN 136,219,879 72,855,345 54.48% alpaca-cot-en-refine.yaml Aliyun
ModelScope
HuggingFace
39 Subsets of Alpaca-CoT
Alpaca-Cot ZH 21,197,246 9,873,214 46.58% alpaca-cot-zh-refine.yaml Aliyun
ModelScope
HuggingFace
28 Subsets of Alpaca-CoT

Before and after refining for Multimodal Dataset

subset #samples before #samples after keep ratio config link data link source
LLaVA pretrain (LCS-558k) 558,128 500,380 89.65% llava-pretrain-refine.yaml Aliyun
ModelScope
HuggingFace
LLaVA-1.5

Evaluation Results

  • LLaVA pretrain (LCS-558k): models pretrained with refined dataset and fine-tuned with the original instruct dataset outperforms the baseline (LLaVA-1.5-13B) on 10 out of 12 benchmarks.
model VQAv2 GQA VizWiz SQA TextVQA POPE MME MM-Bench MM-Bench-CN SEED LLaVA-Bench-Wild MM-Vet
LLaVA-1.5-13B
(baseline)
80.0 63.3 53.6 71.6 61.3 85.9 1531.3 67.7 63.6 61.6 72.5 36.1
LLaVA-1.5-13B
(refined pretrain dataset)
79.94 63.5 54.09 74.20 60.82 86.67 1565.53 68.2 63.9 61.8 75.9 37.4

For Video Dataset

We provide a video dataset processing recipe example for users to better utilize video-related OPs in general-video-refine-example.yaml. Here we apply three types of OPs:

  • Text-Only: to improve the dataset quality according to the video captions.
  • Video-Only: to improve the dataset quality according to the video features.
  • Text-Video: to improve the dataset quality according to the alignment between text and videos. Users can start to process their video datasets based on this recipe.