Summ-Eval: Re-evaluating Summarization Evaluation

This paper is a meta-evaluation (evaluating evaluation) for metrics in summarization.
The contributions of the paper include
- Re-evaluate 12 evaluation metrics with expert & crowd-sourced metrics on 23 neural summarization models.
- Release Largest collection of summaries generated by models trained on CNN/DailyMail
- Share a toolkit for evaluating summarization models across a broad range of automated metrics
- Share the largest and most diverse (in terms of model types) collection of human judgements, of model generated summaries by both experts and crowd source workers.
The evaluation metrics considered include
- ROUGE
- ROUGE-WE
- Bert-Score
- S^3
- Mover Score
- Sentence Mover's Similarity
- SummaQA
- BLEU
- CHRF
- METEOR
- CIDEr
- Data Statistics
Human annotations are observed along 4 dimensions, Coherence, Consistency, Fluency and Relevance.
Findings
- Most metrics have highest correlation within relevance dimension (although can we either can called weak or moderate)
- Model correlations decrease considerably across the other dimensions.
- Extractive coverage and percentage of novel bi-grams correlated moderately with consistency, which shows how within the current frameworks, abstraction may be at odds with faithfulness.
- Pegasus, BART and T5 perform best on most dimensions. Notably, they perform best on consistency and fluency while obtaining lower scores for relevance and coherence.

Provide feedback