Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify how rules in the mixer are provided #50

Open
soldni opened this issue Sep 25, 2023 · 4 comments
Open

Simplify how rules in the mixer are provided #50

soldni opened this issue Sep 25, 2023 · 4 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@soldni
Copy link
Member

soldni commented Sep 25, 2023

No description provided.

@soldni soldni added the enhancement New feature or request label Sep 25, 2023
@soldni soldni self-assigned this Sep 25, 2023
@soldni soldni changed the title Simplify how rules in the mixer are proviced Simplify how rules in the mixer are provided Sep 25, 2023
@soldni soldni added this to the 0.10.0 milestone Sep 25, 2023
@peterbjorgensen
Copy link
Contributor

Because of the strange requirement on how to specify logical filter rules for fields that does not exist for all documents, I looked into this behaviour and it turned out to be a bug in jsonpath-rust, which is now fixed. You may want to update to a newer version of jsonpath-rust to include this fix.
besok/jsonpath-rust#47

@peterbjorgensen
Copy link
Contributor

Dolma still depends on the broken version of jsonpath-rust (0.3.0) or older. The bugfix mentioned above is included in the latest releases. I think the oldest version with the fix included is 0.3.3. I would recommend bumping the version in the Cargo.toml. The latest version is 0.4.0.

@soldni
Copy link
Member Author

soldni commented Apr 5, 2024

This is nice; I will bump in the next version @peterbjorgensen! In the meantime, I recently added support for specifying rules using jq syntax (not the default, but can be used by specifying syntax: jq, e.g.):

streams:
  - name: falcon
    documents:
      - s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v0/documents/*
    attributes:
      - dedupe_para_ngrams_13_1
      - pii_regex_with_counts_fast_v2
      - tokenizer_repetitions_v2r2
    output:
      max_size_in_bytes: 3_814_697_265
      path: s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v1/documents
      min_text_length: 25
      discard_fields:
        - attributes
    filter:
      include:
        # computes average duplication factor and only keep docs with less than 30% duplication
        - >-
          (.attributes.dedupe_para_ngrams_13_1 | length == 0) or
          ((.attributes.dedupe_para_ngrams_13_1 | map(.[2] * (.[1] - .[0])) | add) / (.text | length) <= 0.3)
      exclude:
        # Remove documents with more than 10 repeated ngrams
        - >-
          (.tokenizer_repetitions_v2r2__tokenizer_repetitions_v2r2__doc_max_score_repetition != null) and
          (.tokenizer_repetitions_v2r2__tokenizer_repetitions_v2r2__doc_max_score_repetition[0][-1] > 10)

        # PII filter
        - .attributes.pii_regex_with_counts_fast_v2__pii_regex_with_counts_fast_v2__doc_count[0][-1] > 5
      syntax: jq


processes: 188

@peterbjorgensen
Copy link
Contributor

cool, isn't there a .attributes missing in the example for the exclude filters, i.e. it should be .attributes.tokenizer_repetitions_v2r2__tokenizer_repetitions_v2r2__doc_max_score_repetition

Is it also possible to filter on document metadata, such as .metadata.sub-source == "mygoodsource"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants