Release v0.3.0 · huggingface/datatrove

What's Changed

Added c4 badwords filter, added batch tokenization to tokenscounter by @guipenedo in #160
Add a skip parameter to all readers (defaults to zero) by @rantav in #167
Adds n-gram based decontamination by @guipenedo in #172
Fix: Handle Non-dict Objects in to_dict Without Errors by @justHungryMan in #139
Adds tasks_per_job to slurm executor by @guipenedo in #153
Unsigned int tokenizer and srun args by @marianna13 in #154
Enhance BaseReader to allow custom adapters access to instance variables by @justHungryMan in #169
remove ListFilter from the process_common_crawl_dump example by @QasidSaleem in #181
Hf dataset update by @hynky1999 in #170
Optimize URLFilter and add option to disable integrated wordlists by @its5Q in #174
Add progres for files by @hynky1999 in #176
Make colorization configurable for both files and console output by @guipenedo in #185
Migrate dedup to xxhash by @guipenedo in #179
[WIP] Multi-Lingual Tokenization by @beme248 in #147
Add more word tokenizers by @vsabolcec in #187
Speed up CI with uv by @guipenedo in #188
Url Index + missing hash_config struct inference by @hynky1999 in #191
Migrate pipeline blocks to new word tokenizers by @guipenedo in #189
Fix snapshot representation and numeric conversion in example Code (fineweb) by @justHungryMan in #192
Extend randomize_start feature to local executor by @justHungryMan in #193
Add description for randomize_start by @justHungryMan in #194
Allow an integer parameter for 'randomize_start' in executor/base.py by @justHungryMan in #199
Issues w/ DatatroveFolderDataset by @TJ-Solergibert in #203
code consistency about radomize_start_duration by @justHungryMan in #207
feat(ci): add trufflehog secrets detection by @McPatate in #211
fix(ci): remove unnecessary permissions by @McPatate in #212
Add label_only option to LanguageFilter by @justHungryMan in #210
Fixes text normalization by @hynky1999 in #218
Summary stats by @hynky1999 in #158
Speedup json writer by @its5Q in #175
add alternative fasttext lid models by @guipenedo in #226
Adds paths_file to readers by @guipenedo in #228
Add an example for filtering an HF dataset and push to hub by @loubnabnl in #201
checks if min_num_sentences is disabled or not before computing the n… by @QasidSaleem in #232
DocumentTokenizerContextShuffler fixes by @sippycoder in #229
add dependencies lid.py, io.py #239 by @aiqwe in #241
Add withdirs to extra_options only when not using glob_pattern by @olga1988olga in #244
Add token and char count to histogram stats by @guipenedo in #251
fix correct type inference for cached filesystems by @hynky1999 in #257
Simple enhancement for readibility by @aiqwe in #253
Fix test_basic_article_trafilatura test failure by @tylerjthomas9 in #264
Update MinhashConfig with detailed settings and add default language … by @justHungryMan in #252
Update README.md by @shizhediao in #276
Implement zstd Compression Support for JSONL and Parquet Files by @justHungryMan in #230
Update filter_hf_dataset.py by @shizhediao in #274
Add expand_metadata Option to JsonlWriter by @justHungryMan in #268
Add shuffle option on huggingface reader by @justHungryMan in #224

New Contributors

@rantav made their first contribution in #167
@QasidSaleem made their first contribution in #181
@its5Q made their first contribution in #174
@beme248 made their first contribution in #147
@vsabolcec made their first contribution in #187
@TJ-Solergibert made their first contribution in #203
@McPatate made their first contribution in #211
@loubnabnl made their first contribution in #201
@sippycoder made their first contribution in #229
@aiqwe made their first contribution in #241
@olga1988olga made their first contribution in #244
@tylerjthomas9 made their first contribution in #264
@shizhediao made their first contribution in #276

Full Changelog: v0.2.0...v0.3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.3.0

What's Changed

New Contributors

Contributors