Corpus Bleu evaluation when number of references and hypothesis are large and code runs out of memory #46

Crista23 · 2018-09-23T22:34:24Z

Hi, thanks for the nice toolkit! I have a question regarding evaluating a list of hypothesis sentences (5,000 in total) against a reference corpus which contains 170,000 sentences. If I try to use compute_metrics, then it gives AssetionError: assert len(refs) == len(hyps), which is due to the different number of items in each file. As I was reading through the documentation, would it be reasonable to use compute_individual_metrics for each sentence in the list of hypothesis against the entire reference corpus, retrieve the scores and average them for all 5,000 hypothesis sentences? If so, what could be an efficient and fast way to do it? Thanks a lot!

kracwarlock · 2018-09-27T17:37:32Z

So this depends on whether you want to report sentence bleu or corpus bleu. Averaging the sentence bleu computed by compute_individual_metrics will not be equivalent to the corpus bleu.

For corpus bleu, the code is currently set up to have a set of references for each hypothesis so the easiest way to do that would be to replicate the 170k sentences 5k times and pass that to compute_metrics instead.

Crista23 · 2018-09-27T17:43:21Z

Thanks a lot for your reply. I have indeed tried to replicate the 170k sentences 5k times and pass that to compute_metrics, however on a 504 GB RAM machine my script ran out of memory and got killed. Is there any way to make the current code more efficient? Thank you once again!

kracwarlock · 2018-09-27T18:09:44Z

If that runs out of memory, then this is going require a bit of work. You'll have to run the code for one hypothesis at a time against all the references

nlg-eval/nlgeval/pycocoevalcap/bleu/bleu_scorer.py

Lines 221 to 229 in 5908b4c

    
           testlen = comps['testlen'] 
        
           self._testlen += testlen 
        
           if self.special_reflen is None: ## need computation 
        
               reflen = self._single_reflen(comps['reflen'], option, testlen) 
        
           else: 
        
               reflen = self.special_reflen 
        
           self._reflen += reflen

and save comps, testlen, reflen, totalcomps to disk for each hypothesis.

Then you'll have to iterate over all these files and do the bleu computation

nlg-eval/nlgeval/pycocoevalcap/bleu/bleu_scorer.py

Lines 236 to 261 in 5908b4c

    
               bleu = 1. 
        
               for k in range(n): 
        
                   bleu *= (float(comps['correct'][k]) + tiny) \ 
        
                           /(float(comps['guess'][k]) + small) 
        
                   bleu_list[k].append(bleu ** (1./(k+1))) 
        
               ratio = (testlen + tiny) / (reflen + small) ## N.B.: avoid zero division 
        
               if ratio < 1: 
        
                   for k in range(n): 
        
                       bleu_list[k][-1] *= math.exp(1 - 1/ratio) 
        
               if verbose > 1: 
        
                   print(comps, reflen) 
        
           totalcomps['reflen'] = self._reflen 
        
           totalcomps['testlen'] = self._testlen 
        
           bleus = [] 
        
           bleu = 1. 
        
           for k in range(n): 
        
               bleu *= float(totalcomps['correct'][k] + tiny) \ 
        
                       / (totalcomps['guess'][k] + small) 
        
               bleus.append(bleu ** (1./(k+1))) 
        
           ratio = (self._testlen + tiny) / (self._reflen + small) ## N.B.: avoid zero division 
        
           if ratio < 1: 
        
               for k in range(n): 
        
                   bleus[k] *= math.exp(1 - 1/ratio)

This would be messy but I can't think of a cleaner way to compute corpus level bleu. For sentence level bleu your averaging approach would work. Both of these are going to be slow.

kracwarlock · 2018-09-27T18:10:55Z

We might support this in the code in future if there is a lot of demand for this.

kracwarlock changed the title ~~Evaluate when no. references > no. hypothesis sentences~~ Corpus Bleu evaluation when number of references and hypothesis are large and code runs out of memory Sep 27, 2018

kracwarlock added the enhancement label Sep 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corpus Bleu evaluation when number of references and hypothesis are large and code runs out of memory #46

Corpus Bleu evaluation when number of references and hypothesis are large and code runs out of memory #46

Crista23 commented Sep 23, 2018 •

edited

Loading

kracwarlock commented Sep 27, 2018 •

edited

Loading

Crista23 commented Sep 27, 2018

kracwarlock commented Sep 27, 2018

kracwarlock commented Sep 27, 2018

Corpus Bleu evaluation when number of references and hypothesis are large and code runs out of memory #46

Corpus Bleu evaluation when number of references and hypothesis are large and code runs out of memory #46

Comments

Crista23 commented Sep 23, 2018 • edited Loading

kracwarlock commented Sep 27, 2018 • edited Loading

Crista23 commented Sep 27, 2018

kracwarlock commented Sep 27, 2018

kracwarlock commented Sep 27, 2018

Crista23 commented Sep 23, 2018 •

edited

Loading

kracwarlock commented Sep 27, 2018 •

edited

Loading