-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify how rules in the mixer are provided #50
Comments
Because of the strange requirement on how to specify logical filter rules for fields that does not exist for all documents, I looked into this behaviour and it turned out to be a bug in jsonpath-rust, which is now fixed. You may want to update to a newer version of jsonpath-rust to include this fix. |
Dolma still depends on the broken version of jsonpath-rust (0.3.0) or older. The bugfix mentioned above is included in the latest releases. I think the oldest version with the fix included is 0.3.3. I would recommend bumping the version in the Cargo.toml. The latest version is 0.4.0. |
This is nice; I will bump in the next version @peterbjorgensen! In the meantime, I recently added support for specifying rules using jq syntax (not the default, but can be used by specifying streams:
- name: falcon
documents:
- s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v0/documents/*
attributes:
- dedupe_para_ngrams_13_1
- pii_regex_with_counts_fast_v2
- tokenizer_repetitions_v2r2
output:
max_size_in_bytes: 3_814_697_265
path: s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v1/documents
min_text_length: 25
discard_fields:
- attributes
filter:
include:
# computes average duplication factor and only keep docs with less than 30% duplication
- >-
(.attributes.dedupe_para_ngrams_13_1 | length == 0) or
((.attributes.dedupe_para_ngrams_13_1 | map(.[2] * (.[1] - .[0])) | add) / (.text | length) <= 0.3)
exclude:
# Remove documents with more than 10 repeated ngrams
- >-
(.tokenizer_repetitions_v2r2__tokenizer_repetitions_v2r2__doc_max_score_repetition != null) and
(.tokenizer_repetitions_v2r2__tokenizer_repetitions_v2r2__doc_max_score_repetition[0][-1] > 10)
# PII filter
- .attributes.pii_regex_with_counts_fast_v2__pii_regex_with_counts_fast_v2__doc_count[0][-1] > 5
syntax: jq
processes: 188 |
cool, isn't there a Is it also possible to filter on document metadata, such as |
No description provided.
The text was updated successfully, but these errors were encountered: