-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corpus Bleu evaluation when number of references and hypothesis are large and code runs out of memory #46
Comments
So this depends on whether you want to report sentence bleu or corpus bleu. Averaging the sentence bleu computed by For corpus bleu, the code is currently set up to have a set of references for each hypothesis so the easiest way to do that would be to replicate the 170k sentences 5k times and pass that to |
Thanks a lot for your reply. I have indeed tried to replicate the 170k sentences 5k times and pass that to compute_metrics, however on a 504 GB RAM machine my script ran out of memory and got killed. Is there any way to make the current code more efficient? Thank you once again! |
If that runs out of memory, then this is going require a bit of work. You'll have to run the code for one hypothesis at a time against all the references nlg-eval/nlgeval/pycocoevalcap/bleu/bleu_scorer.py Lines 221 to 229 in 5908b4c
and save comps , testlen , reflen , totalcomps to disk for each hypothesis.
Then you'll have to iterate over all these files and do the bleu computation nlg-eval/nlgeval/pycocoevalcap/bleu/bleu_scorer.py Lines 236 to 261 in 5908b4c
This would be messy but I can't think of a cleaner way to compute corpus level bleu. For sentence level bleu your averaging approach would work. Both of these are going to be slow. |
We might support this in the code in future if there is a lot of demand for this. |
Hi, thanks for the nice toolkit! I have a question regarding evaluating a list of hypothesis sentences (5,000 in total) against a reference corpus which contains 170,000 sentences. If I try to use compute_metrics, then it gives AssetionError: assert len(refs) == len(hyps), which is due to the different number of items in each file. As I was reading through the documentation, would it be reasonable to use compute_individual_metrics for each sentence in the list of hypothesis against the entire reference corpus, retrieve the scores and average them for all 5,000 hypothesis sentences? If so, what could be an efficient and fast way to do it? Thanks a lot!
The text was updated successfully, but these errors were encountered: