What's Changed
- Added c4 badwords filter, added batch tokenization to tokenscounter by @guipenedo in #160
- Add a skip parameter to all readers (defaults to zero) by @rantav in #167
- Adds n-gram based decontamination by @guipenedo in #172
- Fix: Handle Non-dict Objects in to_dict Without Errors by @justHungryMan in #139
- Adds
tasks_per_job
to slurm executor by @guipenedo in #153 - Unsigned int tokenizer and srun args by @marianna13 in #154
- Enhance BaseReader to allow custom adapters access to instance variables by @justHungryMan in #169
- remove ListFilter from the process_common_crawl_dump example by @QasidSaleem in #181
- Hf dataset update by @hynky1999 in #170
- Optimize URLFilter and add option to disable integrated wordlists by @its5Q in #174
- Add progres for files by @hynky1999 in #176
- Make colorization configurable for both files and console output by @guipenedo in #185
- Migrate dedup to xxhash by @guipenedo in #179
- [WIP] Multi-Lingual Tokenization by @beme248 in #147
- Add more word tokenizers by @vsabolcec in #187
- Speed up CI with uv by @guipenedo in #188
- Url Index + missing hash_config struct inference by @hynky1999 in #191
- Migrate pipeline blocks to new word tokenizers by @guipenedo in #189
- Fix snapshot representation and numeric conversion in example Code (fineweb) by @justHungryMan in #192
- Extend randomize_start feature to local executor by @justHungryMan in #193
- Add description for randomize_start by @justHungryMan in #194
- Allow an integer parameter for 'randomize_start' in executor/base.py by @justHungryMan in #199
- Issues w/ DatatroveFolderDataset by @TJ-Solergibert in #203
- code consistency about radomize_start_duration by @justHungryMan in #207
- feat(ci): add trufflehog secrets detection by @McPatate in #211
- fix(ci): remove unnecessary permissions by @McPatate in #212
- Add label_only option to LanguageFilter by @justHungryMan in #210
- Fixes text normalization by @hynky1999 in #218
- Summary stats by @hynky1999 in #158
- Speedup json writer by @its5Q in #175
- add alternative fasttext lid models by @guipenedo in #226
- Adds paths_file to readers by @guipenedo in #228
- Add an example for filtering an HF dataset and push to hub by @loubnabnl in #201
- checks if min_num_sentences is disabled or not before computing the n… by @QasidSaleem in #232
- DocumentTokenizerContextShuffler fixes by @sippycoder in #229
- add dependencies lid.py, io.py #239 by @aiqwe in #241
- Add withdirs to extra_options only when not using glob_pattern by @olga1988olga in #244
- Add token and char count to histogram stats by @guipenedo in #251
- fix correct type inference for cached filesystems by @hynky1999 in #257
- Simple enhancement for readibility by @aiqwe in #253
- Fix
test_basic_article_trafilatura
test failure by @tylerjthomas9 in #264 - Update MinhashConfig with detailed settings and add default language … by @justHungryMan in #252
- Update README.md by @shizhediao in #276
- Implement zstd Compression Support for JSONL and Parquet Files by @justHungryMan in #230
- Update filter_hf_dataset.py by @shizhediao in #274
- Add expand_metadata Option to JsonlWriter by @justHungryMan in #268
- Add shuffle option on huggingface reader by @justHungryMan in #224
New Contributors
- @rantav made their first contribution in #167
- @QasidSaleem made their first contribution in #181
- @its5Q made their first contribution in #174
- @beme248 made their first contribution in #147
- @vsabolcec made their first contribution in #187
- @TJ-Solergibert made their first contribution in #203
- @McPatate made their first contribution in #211
- @loubnabnl made their first contribution in #201
- @sippycoder made their first contribution in #229
- @aiqwe made their first contribution in #241
- @olga1988olga made their first contribution in #244
- @tylerjthomas9 made their first contribution in #264
- @shizhediao made their first contribution in #276
Full Changelog: v0.2.0...v0.3.0