Evaluating ChatGPT for research impact assessment: comparative insights from the UK REF

Mudassar Arsalan, Omar Mubin, Abdullah Al Mahmud, Mohammed Raza Mehdi, Uneb Gazder, Imran Ahmed Khan

Research output: Contribution to journalArticlepeer-review

Abstract

This study assesses the capacity of a custom research evaluation tool built on OpenAI’s GPT-4, implemented through the ChatGPT interface, to replicate the qualitative depth and structure of the United Kingdom’s Research Excellence Framework (REF) case studies. The REF offers a detailed, narrative-based approach to research impact assessment, but is resource-intensive to implement at scale. In this evaluation, we tested whether a GPT-4–powered system could generate comparable impact summaries in a scalable manner. Using a mixed-methods design, summaries produced by the custom tool for a stratified random sample of 100 REF case studies were assessed by four expert raters. Statistical analysis showed high inter-rater reliability for thematic alignment (Fleiss’ κ = 0.875) and relevance (κ = 0.685), but notably lower agreement for comprehensiveness (κ = 0.440) and insightfulness (κ = 0.267). Qualitative analysis indicated that, while the GPT-4–generated outputs were thematically coherent, they consistently omitted specific, verifiable evidence—most notably quantitative metrics and named stakeholders—required in formal REF evaluation. Within the scope of this study, the findings suggest that GPT-4, when configured through ChatGPT, was better suited to prospective, formative tasks such as generating potential impact pathways for grant proposal development, rather than replacing expert judgment in retrospective research assessment. This paper provides an empirical benchmark for GPT-4’s role in qualitative research evaluation and proposes guidelines for its responsible use.

Original languageEnglish
Number of pages23
JournalScientometrics
DOIs
Publication statusE-pub ahead of print (In Press) - 2025

Keywords

  • AI-Driven Framework
  • ChatGPT
  • GPT-4
  • Large Language Models (LLMs)
  • Research Impact Assessment

Fingerprint

Dive into the research topics of 'Evaluating ChatGPT for research impact assessment: comparative insights from the UK REF'. Together they form a unique fingerprint.

Cite this