Subjective Bias and Consistency in Human Evaluation of Natural Language Generation
Jacopo Amidei (The Open University, Milton Keynes),
The Natural Language Generation (NLG) community relies on shared
evaluation techniques to understand progress in the field. In this talk, I will
focus on the problem of the reliability of human evaluation studies in NLG.
Based on an analysis of papers published over 10 years (from 2008 to 2018) in
NLG-specific conferences and on an observational study, I will show some
shortcomings with existing approaches to reporting the reliability for human
intrinsic evaluation of NLG systems. Then, I will present a new proposal for
reporting reliability based on the use of correlation coefficients. The
correlation coefficients can be used to measure the extent to which judges
follow a systematic pattern in their assessments, even when their individual
interpretations of the phenomena are not identical. Our proposal offers a new
approach to measure judges’ relative consistency, which provides insights about
the trust-ability of human judgements.