Name		Name	Last commit message	Last commit date
parent directory ..
alpaca_cot		alpaca_cot
github_code		github_code
README.md		README.md
README_ZH.md		README_ZH.md
data-juicer-sandbox-optimal.yaml		data-juicer-sandbox-optimal.yaml
data-juicer-sandbox-self-evolution.yaml		data-juicer-sandbox-self-evolution.yaml
general-video-refine-example.yaml		general-video-refine-example.yaml
llava-pretrain-refine.yaml		llava-pretrain-refine.yaml
pile-europarl-refine.yaml		pile-europarl-refine.yaml
pile-freelaw-refine.yaml		pile-freelaw-refine.yaml
pile-hackernews-refine.yaml		pile-hackernews-refine.yaml
pile-nih-refine.yaml		pile-nih-refine.yaml
pile-philpaper-refine.yaml		pile-philpaper-refine.yaml
pile-pubmed-abstract-refine.yaml		pile-pubmed-abstract-refine.yaml
pile-pubmed-central-refine.yaml		pile-pubmed-central-refine.yaml
pile-uspto-refine.yaml		pile-uspto-refine.yaml
redpajama-arxiv-refine.yaml		redpajama-arxiv-refine.yaml
redpajama-book-refine.yaml		redpajama-book-refine.yaml
redpajama-c4-refine.yaml		redpajama-c4-refine.yaml
redpajama-cc-2019-30-refine.yaml		redpajama-cc-2019-30-refine.yaml
redpajama-cc-2020-05-refine.yaml		redpajama-cc-2020-05-refine.yaml
redpajama-cc-2021-04-refine.yaml		redpajama-cc-2021-04-refine.yaml
redpajama-cc-2022-05-refine.yaml		redpajama-cc-2022-05-refine.yaml
redpajama-cc-2023-06-refine.yaml		redpajama-cc-2023-06-refine.yaml
redpajama-pile-stackexchange-refine.yaml		redpajama-pile-stackexchange-refine.yaml
redpajama-wiki-refine.yaml		redpajama-wiki-refine.yaml

README.md

Refined open source dataset by Data-Juicer

We found that there are still some "bad" samples in existing processed datasets (e.g. RedPajama, The Pile.). So we use our Data-Juicer to refine them and try to feed them to LLMs for better performance.

We use simple 3-σ rule to set the hyperparameters for ops in each recipe.

Before and after refining for Pretraining Text Dataset

subset	#samples before	#samples after	keep ratio	config link	data link	source
arXiv	1,724,497	1,655,259	95.99%	redpajama-arxiv-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
Books	205,182	195,983	95.51%	redpajama-book-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
Wikipedia	29,834,171	26,990,659	90.47%	redpajama-wiki-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
C4	364,868,892	344,491,171	94.42%	redpajama-c4-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
Common Crawl 2019-30	81,085,420	36,557,283	45.08%	redpajama-cc-2019-30-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
Common Crawl 2020-05	90,850,492	42,612,596	46.90%	redpajama-cc-2020-05-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
Common Crawl 2021-04	98,878,523	44,724,752	45.23%	redpajama-cc-2021-04-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
Common Crawl 2022-05	94,058,868	42,648,496	45.34%	redpajama-cc-2022-05-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
Common Crawl 2023-06	111,402,716	50,643,699	45.46%	redpajama-cc-2023-06-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama
Github Code	73,208,524 + 21,387,703	49,279,344	52.09%	redpajama-code-refine.yaml stack-code-refine.yaml redpajama-stack-code-deduplicate.yaml	Aliyun ModelScope HuggingFace	Redpajama The Stack
StackExchange	45,447,328	26,309,203	57.89%	redpajama-pile-stackexchange-refine.yaml	Aliyun ModelScope HuggingFace	Redpajama The Pile
EuroParl	69,814	61,601	88.23%	pile-europarl-refine.yaml	Aliyun ModelScope HuggingFace	The Pile
FreeLaw	3,562,015	2,942,612	82.61%	pile-freelaw-refine.yaml	Aliyun ModelScope HuggingFace	The Pile
HackerNews	373,027	371,331	99.55%	pile-hackernews-refine.yaml	Aliyun ModelScope HuggingFace	The Pile
NIH ExPorter	939,661	858,492	91.36%	pile-nih-refine.yaml	Aliyun ModelScope HuggingFace	The Pile
PhilPapers	32,782	29,117	88.82%	pile-philpaper-refine.yaml	Aliyun ModelScope HuggingFace	The Pile
PubMed Abstracts	15,518,009	15,009,325	96.72%	pile-pubmed-abstract-refine.yaml	Aliyun ModelScope HuggingFace	The Pile
PubMed Central	3,098,930	2,694,860	86.96%	pile-pubmed-central-refine.yaml	Aliyun ModelScope HuggingFace	The Pile
USPTO	5,883,024	4,516,283	76.77%	pile-uspto-refine.yaml	Aliyun ModelScope HuggingFace	The Pile

Before and after refining for Alpaca-CoT Dataset

subset	#samples before	#samples after	keep ratio	config link	data link	source
Alpaca-Cot EN	136,219,879	72,855,345	54.48%	alpaca-cot-en-refine.yaml	Aliyun ModelScope HuggingFace	39 Subsets of Alpaca-CoT
Alpaca-Cot ZH	21,197,246	9,873,214	46.58%	alpaca-cot-zh-refine.yaml	Aliyun ModelScope HuggingFace	28 Subsets of Alpaca-CoT

Before and after refining for Multimodal Dataset

subset	#samples before	#samples after	keep ratio	config link	data link	source
LLaVA pretrain (LCS-558k)	558,128	500,380	89.65%	llava-pretrain-refine.yaml	Aliyun ModelScope HuggingFace	LLaVA-1.5
Data-Juicer (T2V, 147k)	1,217,346	147,176	12.09%	data-juicer-sandbox-optimal.yaml	Aliyun ModelScope HuggingFace	InternVid (606k) Panda-70M (605k) MSR-VTT (6k)
Data-Juicer (DJ, 228k)	3,408,553	227,867	8.15%	data-juicer-sandbox-self-evolution.yaml	Aliyun ModelScope	InternVid (606k) Panda-70M (2,599k) Pexels (198k) MSR-VTT (6k)

Evaluation Results

LLaVA pretrain (LCS-558k): models pretrained with refined dataset and fine-tuned with the original instruct dataset outperforms the baseline (LLaVA-1.5-13B) on 10 out of 12 benchmarks.

model	VQAv2	GQA	VizWiz	SQA	TextVQA	POPE	MME	MM-Bench	MM-Bench-CN	SEED	LLaVA-Bench-Wild	MM-Vet
LLaVA-1.5-13B (baseline)	80.0	63.3	53.6	71.6	61.3	85.9	1531.3	67.7	63.6	61.6	72.5	36.1
LLaVA-1.5-13B (refined pretrain dataset)	79.94	63.5	54.09	74.20	60.82	86.67	1565.53	68.2	63.9	61.8	75.9	37.4

Data-Juicer (T2V, 147k) and Data-Juicer (DJ, 228k): models trained with refined dataset outperforms the baseline (T2V-Turbo) on VBench. T2V-Turbo is the teacher model of Data-Juicer (T2V, 147k) and Data-Juicer (T2V, 147k) is the teacher model of Data-Juicer (DJ, 228k). Please refer to Sandbox for more detail.

model	Total Score	Quality Score	Semantic Score	subject consistency	background consistency	temporal flickering	motion smoothness	dynamic degree	aesthetic quality
T2V-Turbo	81.01	82.57	74.76	96.28	97.02	97.48	97.34	49.17	63.04
Data-Juicer (T2V, 147k)	82.10	83.14	77.93	97.32	99.03	96.60	96.51	51.67	68.92
Data-Juicer (DJ, 228k)	82.53	83.38	79.13	97.92	99.27	98.14	97.77	38.89	67.39

model	imaging quality	object class	multiple objects	human action	color	spatial relationship	scene	appearance style	temporal style	overall consistency
T2V-Turbo	72.49	93.96	54.65	95.20	89.90	38.67	55.58	24.42	25.51	28.16
Data-Juicer (T2V, 147k)	70.42	95.85	61.63	95.60	94.06	46.95	57.57	24.42	26.34	28.90
Data-Juicer (DJ, 228k)	70.41	96.44	64.51	95.40	95.51	47.17	57.30	25.55	26.82	29.25

For Video Dataset

We provide a video dataset processing recipe example for users to better utilize video-related OPs in general-video-refine-example.yaml. Here we apply three types of OPs:

Text-Only: to improve the dataset quality according to the video captions.
Video-Only: to improve the dataset quality according to the video features.
Text-Video: to improve the dataset quality according to the alignment between text and videos. Users can start to process their video datasets based on this recipe.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_juicer_recipes

data_juicer_recipes

README.md

Refined open source dataset by Data-Juicer

Before and after refining for Pretraining Text Dataset

Before and after refining for Alpaca-CoT Dataset

Before and after refining for Multimodal Dataset

Evaluation Results

For Video Dataset

Files

data_juicer_recipes

Directory actions

More options

Directory actions

More options

Latest commit

History

data_juicer_recipes

Folders and files

parent directory

README.md

Refined open source dataset by Data-Juicer

Before and after refining for Pretraining Text Dataset

Before and after refining for Alpaca-CoT Dataset

Before and after refining for Multimodal Dataset

Evaluation Results

For Video Dataset