Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I use this model on Chinese dataset? #8

Open
f617452296 opened this issue Feb 17, 2020 · 13 comments
Open

How can I use this model on Chinese dataset? #8

f617452296 opened this issue Feb 17, 2020 · 13 comments

Comments

@f617452296
Copy link

And can this model be helpful on Chinese dataset?

@ekQ
Copy link
Collaborator

ekQ commented Feb 17, 2020

We haven't looked into this, but you could try it using BERT-Base, Chinese to initialize the model.

@qiuhuiGithub
Copy link

qiuhuiGithub commented Mar 18, 2020

I test the model on chinese GEC task and it works fine.

@f617452296
Copy link
Author

f617452296 commented Mar 18, 2020 via email

@qiuhuiGithub
Copy link

May I have your email to ask some question?

------------------ Original ------------------ From: qiuhuiGitHub <[email protected]> Date: Wed,Mar 18,2020 1:50 PM To: google-research/lasertagger <[email protected]> Cc: fishfang <[email protected]>, Author <[email protected]> Subject: Re: [google-research/lasertagger] How can I use this model on Chinese dataset? (#8) I test the model in chinese GEC task and it works fine. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

U can use any chinese bert model by simple replace the bert path and it works fine.

@ekQ
Copy link
Collaborator

ekQ commented Mar 18, 2020

I test the model on chinese GEC task and it works fine.

Good to know this!

@varepsilon
Copy link

I test the model on chinese GEC task and it works fine.

Is it a public dataset? If so, could you share a link?

@qiuhuiGithub
Copy link

I test the model on chinese GEC task and it works fine.

Is it a public dataset? If so, could you share a link?

http://tcci.ccf.org.cn/conference/2018/taskdata.php The second task is GEC task.

@f617452296
Copy link
Author

I test the model on chinese GEC task and it works fine.

Is it a public dataset? If so, could you share a link?

http://tcci.ccf.org.cn/conference/2018/taskdata.php The second task is GEC task.

Could you please tell me how I can run this model on GEC task
Such as step1. Phrase Vocabulary Optimization and 2. Converting Target Texts to Tags

@qiuhuiGithub
Copy link

qiuhuiGithub commented Mar 20, 2020

I test the model on chinese GEC task and it works fine.

Is it a public dataset? If so, could you share a link?

http://tcci.ccf.org.cn/conference/2018/taskdata.php The second task is GEC task.

Could you please tell me how I can run this model on GEC task
Such as step1. Phrase Vocabulary Optimization and 2. Converting Target Texts to Tags

First I suggest you read the run_wikisplit_experiment.sh in the project. You can simple run lasertagger by changing the script. Here is a example.

  • You should change your data into wikisplit format, such as "I like you \t I love you".
  • Change the all the Path in the script to your's.
  • Change vocab_size in configs/lasertagger_config.json because the vocab_size is different in Chinese BERT.
  • Run the script step by step.
    Best wishes.

@f617452296
Copy link
Author

I test the model on chinese GEC task and it works fine.

Is it a public dataset? If so, could you share a link?

http://tcci.ccf.org.cn/conference/2018/taskdata.php The second task is GEC task.

Could you please tell me how I can run this model on GEC task
Such as step1. Phrase Vocabulary Optimization and 2. Converting Target Texts to Tags

First I suggest you read the run_wikisplit_experiment.sh in the project. You can simple run lasertagger by changing the script. Here is a example.

  • You should change your data into wikisplit format, such as "I like you \t I love you".
  • Change the all the Path in the script to your's.
  • Change vocab_size in configs/lasertagger_config.json because the vocab_size is different in Chinese BERT.
  • Run the script step by step.
    Best wishes.

It helps a lot! Thank you!

@f617452296
Copy link
Author

I test the model on chinese GEC task and it works fine.

Is it a public dataset? If so, could you share a link?

http://tcci.ccf.org.cn/conference/2018/taskdata.php The second task is GEC task.

By the way, I want to know whether the training data on http://tcci.ccf.org.cn/conference/2018/taskdata.php
need to be seg into words or just I can send a whole sentence into the model? Thanks!

@qiuhuiGithub
Copy link

I test the model on chinese GEC task and it works fine.

Is it a public dataset? If so, could you share a link?

http://tcci.ccf.org.cn/conference/2018/taskdata.php The second task is GEC task.

By the way, I want to know whether the training data on http://tcci.ccf.org.cn/conference/2018/taskdata.php
need to be seg into words or just I can send a whole sentence into the model? Thanks!

eh, the input of the chinese bert is separate word, so you should cut the sentence into separate word.

@Ivy-C-85
Copy link

I test the model on chinese GEC task and it works fine.

Is it a public dataset? If so, could you share a link?

http://tcci.ccf.org.cn/conference/2018/taskdata.php The second task is GEC task.

By the way, I want to know whether the training data on http://tcci.ccf.org.cn/conference/2018/taskdata.php
need to be seg into words or just I can send a whole sentence into the model? Thanks!

eh, the input of the chinese bert is separate word, so you should cut the sentence into separate word.

Hi, I also tested GEC task. But my model didn't work well, it didn't actually 'correct', it just delete every difference and even some same parts between source and target texts. I use JIEBA to cut my sentences and I thought everything was done just fine, only the results were pretty bad. Could you please tell me did you have the same problem and which score did you use?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants