Update README.md

FlagAI-Open · Sep 29, 2024 · bee7120 · bee7120
1 parent 11c1cf8
commit bee7120
Showing 1 changed file with 42 additions and 0 deletions.
diff --git a/examples/CCI3/README.md b/examples/CCI3/README.md
@@ -0,0 +1,42 @@
+# Dataset curation
+A new method for filtering LLM training datasets uses synthetic data to develop classifiers for identifying educational content. This approach was applied in training LLama3 and Phi3, though its large-scale effect on web data filtering remains underexplored.
+
+The popular Phi3 models, trained on 3.3 and 4.8 trillion tokens, used "heavily filtered public web data (by educational level) and synthetic LLM-generated data." Similarly, the LLama3 team leveraged Llama 2 to build text-quality classifiers for Llama 3. However, these classifiers and filtered datasets are not publicly available.
+
+To improve the quality of Chinese corpora, we followed [Fineweb-edu's](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) approach and developed an educational quality classifier with annotations from Qwen2-72B-Instruct to build the [CCI3-HQ](https://huggingface.co/datasets/BAAI/CCI3-HQ) dataset.
+
+## Annotation
+We used Qwen2-72B-Instruct to score 145,000 pairs of web samples and their scores from 0 to 5, generated by Qwen2. The samples were annotated based on their educational quality with 0 being not educational and 5 being highly educational. 
+
+The prompt used for annotation mostly reuses [FineWeb-edu prompt](./prompt.txt).
+
+
+## Classifier training
+The classifier was trained on We added a classification head with a single regression output to [BGE-M3](https://huggingface.co/BAAI/bge-m3) and trained the model for 20 epochs with a learning rate of 3e-4. During training, the embedding and encoder layers were frozen to focus on the classification head and dropout was not used. The model achieved an F1 score of 73% when converted to a binary classifier using a score threshold of 3.
+
+
+The classifier is available at: https://huggingface.co/BAAI/cci3-hq-classifier
+
+## Evaluation and results
+### Setup
+Due to the mixed Chinese and English datasets, we chose Qwen2-0.5B model for datasets evaluation, each experiment with 100B tokens training. 
+
+We follow the same evaluation setup for all models using [FineWeb setup](https://github.com/huggingface/cosmopedia/tree/main/evaluation) with [lighteval](https://github.com/huggingface/lighteval) library.
+You can checkout the [evaluation script](./lighteval_tasks_v2.py) here.
+
+### Results
+We conducted two types of experiments:
+1. Mixed Dataset Experiment: The ratio of English, code, and Chinese is 60% : 10% : 30%.
+2. Chinese Dataset Experiment: The Chinese ratio is 100%.
+
+For English datasets, we uniformly used [FineWeb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu/tree/main/sample/100BT). For code data, we used [StarCoder](https://huggingface.co/bigcode/starcoder). 
+For Chinese datasets, we selected [wanjuan-v1](https://github.com/opendatalab/WanJuan1.0), [skypile](https://huggingface.co/datasets/Skywork/SkyPile-150B), and [cci3.0](https://huggingface.co/datasets/BAAI/CCI3-Data).
+
+In the plots below, cci_edu datasets is from [CCI 3.0 HQ Dataset](https://data.baai.ac.cn/details/BAAI-CCI3-HQ).
+
+For Mixed Dataset Experiment all evaluation metrics are averaged.
+![Mixed Dataset Experiment](./datasets_mix_metrics.png)
+
+For Chinese Dataset Experiment only chinese evaluation metrics are averaged.
+![Chinese Dataset Experiment](./chinese_dataset_metrics.png)
+