Releases: allenai/dolma
Releases · allenai/dolma
v1.0.13
What's Changed
- Bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows in the github_actions group across 1 directory by @dependabot in #198
- Fix bug in length filtering for deduping by @soldni in #197
- Polymorphic span replacement by @undfined in #200
- Dependabot fail, match upload/download action versions by @undfined in #202
- [Json fooramt error in line 133] Update getting-started.md by @yushengsu-thu in #196
- Revert upload/download to v3 for now by @undfined in #203
- Undfined/runner v3 by @undfined in #204
Full Changelog: v1.0.12...v1.0.13
v1.0.12
What's Changed
- Added tokenizers for length by @soldni in #189
- Update getting-started.md by @yushengsu-thu in #193
- Bump nltk from 3.8.1 to 3.9 in the pip group across 1 directory by @dependabot in #187
- Use Numpy v1.x instead of 2.x by @soldni in #195
New Contributors
- @yushengsu-thu made their first contribution in #193
Full Changelog: v1.0.11...v1.0.12
v1.0.11
v1.0.10
v1.0.9
v1.0.8
v1.0.7
v1.0.6
v1.0.5
v1.0.4
What's Changed
- Bump rustls from 0.21.10 to 0.21.11 in the cargo group across 1 directory by @dependabot in #149
- fix divide by 0 in gopher tagger by @peterbjorgensen in #148
- Fixing dtype option not being correctly propagated by @soldni in #154
- Add support for parsing WARC by @soldni in #153
- Reducing hash calls by @Whattabatt in #156
- Bump rustls from 0.21.11 to 0.21.12 in the cargo group across 1 directory by @dependabot in #155
- Adding Quality Classifier from Dolma 1.7 by @soldni in #163
- Adds ZST support in Deduper and Mixer by @soldni in #170
- Workaround to fix memory leak in HuggingFace tokenizer by @soldni in #169
- Adding partition logic by @Whattabatt in #161
- added option for tokenizer to split on special tokens by @soldni in #176
- Version bump for new release (1.0.4) by @soldni in #179
New Contributors
- @Whattabatt made their first contribution in #156
Full Changelog: v1.0.3...v1.0.4