Fusion-Eval: Integrating Assistant Evaluators with LLMs

Lei Shu, Nevan Wichers, Liangchen Luo, Yun Zhu, Yinxiao Liu, Jindong Chen, Lei Meng.
In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, Miami, Florida, US. 2024.

Abstract

Evaluating natural language generation (NLG) systems automatically poses significant challenges. Recent studies have employed large language models (LLMs) as reference-free metrics for NLG evaluation, enhancing adaptability to new tasks tasks. However, these methods still show lower correspondence with human judgments compared to specialized neural evaluators. In this paper, we introduce “Fusion-Eval”, an innovative approach that leverages LLMs to integrate insights from various assistant evaluators. The LLM is given the example to evaluate along with scores from the assistant evaluators. Each of these evaluators specializes in assessing distinct aspects of responses. Fusion-Eval achieves a 0.962 system-level Kendall-Tau correlation with humans on SummEval and a 0.744 turn-level Spearman correlation on TopicalChat, which is significantly higher than baseline methods. These results highlight Fusion-Eval’s significant potential in the realm of natural language system evaluation.

BibTex

@inproceedings{Shu2024FusionEval,
  author = {Shu, Lei and Wichers, Nevan and Luo, Liangchen and Zhu, Yun and Liu, Yinxiao and Chen, Jindong and Meng, Lei},
  title = {Fusion-Eval: Integrating Assistant Evaluators with LLMs},
  booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track},
  month = {November},
  year = {2024},
  address = {Miami, Florida, US},
  pages = {225--238},
  publisher = {Association for Computational Linguistics}
}