Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What purpose cutoff.csv used in the cc_net pipeline? #106

Open
kemalbastak opened this issue Feb 26, 2024 · 2 comments
Open

What purpose cutoff.csv used in the cc_net pipeline? #106

kemalbastak opened this issue Feb 26, 2024 · 2 comments

Comments

@kemalbastak
Copy link

I am trying to add turkish language (tr) to cutoff.csv on rp_v1 branch.
There is few data on how the language score is calculated
How do we add custom language score on this csv?

@mauriceweber
Copy link
Collaborator

Hi @kemalbastak

As far as I know, the CCNet pipeline does not support Turkish out of the box, but you can probably modify the pipeline to get it to support tr. We never went through that process, but to get there, I think you have to do the following steps:

  1. train your own reference wikipedia model (checkout the makefile and ccnet readme for that)
  2. collect the percentile statistics on the model distribution on the commoncrawl corpus
  3. add those statistics to the cutoff.csv
  4. run ccnet with turkish

I'd also recommend contacting the maintainers of the ccnet if there are issues related to that.

I hope this helps!

@kemalbastak
Copy link
Author

I have calculated for 2023-50 CC dump and used 'perplexity' score on that data.

percentiles = {f"%{i}": np.percentile(all_pp_values, i) for i in range(1, 101)}

Got closer values with the existing languages in the cutoff.csv.

Thanks for the answer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants