Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corpus Bleu evaluation when number of references and hypothesis are large and code runs out of memory #46

Open
Crista23 opened this issue Sep 23, 2018 · 4 comments

Comments

@Crista23
Copy link

Crista23 commented Sep 23, 2018

Hi, thanks for the nice toolkit! I have a question regarding evaluating a list of hypothesis sentences (5,000 in total) against a reference corpus which contains 170,000 sentences. If I try to use compute_metrics, then it gives AssetionError: assert len(refs) == len(hyps), which is due to the different number of items in each file. As I was reading through the documentation, would it be reasonable to use compute_individual_metrics for each sentence in the list of hypothesis against the entire reference corpus, retrieve the scores and average them for all 5,000 hypothesis sentences? If so, what could be an efficient and fast way to do it? Thanks a lot!

@kracwarlock
Copy link
Member

kracwarlock commented Sep 27, 2018

So this depends on whether you want to report sentence bleu or corpus bleu. Averaging the sentence bleu computed by compute_individual_metrics will not be equivalent to the corpus bleu.

For corpus bleu, the code is currently set up to have a set of references for each hypothesis so the easiest way to do that would be to replicate the 170k sentences 5k times and pass that to compute_metrics instead.

@Crista23
Copy link
Author

Thanks a lot for your reply. I have indeed tried to replicate the 170k sentences 5k times and pass that to compute_metrics, however on a 504 GB RAM machine my script ran out of memory and got killed. Is there any way to make the current code more efficient? Thank you once again!

@kracwarlock
Copy link
Member

If that runs out of memory, then this is going require a bit of work. You'll have to run the code for one hypothesis at a time against all the references

testlen = comps['testlen']
self._testlen += testlen
if self.special_reflen is None: ## need computation
reflen = self._single_reflen(comps['reflen'], option, testlen)
else:
reflen = self.special_reflen
self._reflen += reflen

and save comps, testlen, reflen, totalcomps to disk for each hypothesis.

Then you'll have to iterate over all these files and do the bleu computation

bleu = 1.
for k in range(n):
bleu *= (float(comps['correct'][k]) + tiny) \
/(float(comps['guess'][k]) + small)
bleu_list[k].append(bleu ** (1./(k+1)))
ratio = (testlen + tiny) / (reflen + small) ## N.B.: avoid zero division
if ratio < 1:
for k in range(n):
bleu_list[k][-1] *= math.exp(1 - 1/ratio)
if verbose > 1:
print(comps, reflen)
totalcomps['reflen'] = self._reflen
totalcomps['testlen'] = self._testlen
bleus = []
bleu = 1.
for k in range(n):
bleu *= float(totalcomps['correct'][k] + tiny) \
/ (totalcomps['guess'][k] + small)
bleus.append(bleu ** (1./(k+1)))
ratio = (self._testlen + tiny) / (self._reflen + small) ## N.B.: avoid zero division
if ratio < 1:
for k in range(n):
bleus[k] *= math.exp(1 - 1/ratio)

This would be messy but I can't think of a cleaner way to compute corpus level bleu. For sentence level bleu your averaging approach would work. Both of these are going to be slow.

@kracwarlock
Copy link
Member

We might support this in the code in future if there is a lot of demand for this.

@kracwarlock kracwarlock changed the title Evaluate when no. references > no. hypothesis sentences Corpus Bleu evaluation when number of references and hypothesis are large and code runs out of memory Sep 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants