Skip to content

Latest commit

 

History

History
28 lines (27 loc) · 1.5 KB

summeval.md

File metadata and controls

28 lines (27 loc) · 1.5 KB

Summ-Eval: Re-evaluating Summarization Evaluation

  • This paper is a meta-evaluation (evaluating evaluation) for metrics in summarization.
  • The contributions of the paper include
    • Re-evaluate 12 evaluation metrics with expert & crowd-sourced metrics on 23 neural summarization models.
    • Release Largest collection of summaries generated by models trained on CNN/DailyMail
    • Share a toolkit for evaluating summarization models across a broad range of automated metrics
    • Share the largest and most diverse (in terms of model types) collection of human judgements, of model generated summaries by both experts and crowd source workers.
  • The evaluation metrics considered include
    • ROUGE
    • ROUGE-WE
    • Bert-Score
    • S^3
    • Mover Score
    • Sentence Mover's Similarity
    • SummaQA
    • BLEU
    • CHRF
    • METEOR
    • CIDEr
    • Data Statistics
  • Human annotations are observed along 4 dimensions, Coherence, Consistency, Fluency and Relevance.
  • Findings
    • Most metrics have highest correlation within relevance dimension (although can we either can called weak or moderate)
    • Model correlations decrease considerably across the other dimensions.
    • Extractive coverage and percentage of novel bi-grams correlated moderately with consistency, which shows how within the current frameworks, abstraction may be at odds with faithfulness.
    • Pegasus, BART and T5 perform best on most dimensions. Notably, they perform best on consistency and fluency while obtaining lower scores for relevance and coherence.