Subjective Bias and Consistency in Human Evaluation of Natural Language Generation
The Natural Language Generation (NLG) community relies on shared evaluation techniques to understand progress in the field. In this talk, I will focus on the problem of the reliability of human evaluation studies in NLG. Based on an analysis of papers published over 10 years (from 2008 to 2018) in NLG-specific conferences and on an observational study, I will show some shortcomings with existing approaches to reporting the reliability for human intrinsic evaluation of NLG systems. Then, I will present a new proposal for reporting reliability based on the use of correlation coefficients. The correlation coefficients can be used to measure the extent to which judges follow a systematic pattern in their assessments, even when their individual interpretations of the phenomena are not identical. Our proposal offers a new approach to measure judges’ relative consistency, which provides insights about the trust-ability of human judgements.