TY - JOUR
T1 - Evaluating ChatGPT for research impact assessment
T2 - comparative insights from the UK REF
AU - Arsalan, Mudassar
AU - Mubin, Omar
AU - Al Mahmud, Abdullah
AU - Mehdi, Mohammed Raza
AU - Gazder, Uneb
AU - Khan, Imran Ahmed
PY - 2025
Y1 - 2025
N2 - This study assesses the capacity of a custom research evaluation tool built on OpenAI’s GPT-4, implemented through the ChatGPT interface, to replicate the qualitative depth and structure of the United Kingdom’s Research Excellence Framework (REF) case studies. The REF offers a detailed, narrative-based approach to research impact assessment, but is resource-intensive to implement at scale. In this evaluation, we tested whether a GPT-4–powered system could generate comparable impact summaries in a scalable manner. Using a mixed-methods design, summaries produced by the custom tool for a stratified random sample of 100 REF case studies were assessed by four expert raters. Statistical analysis showed high inter-rater reliability for thematic alignment (Fleiss’ κ = 0.875) and relevance (κ = 0.685), but notably lower agreement for comprehensiveness (κ = 0.440) and insightfulness (κ = 0.267). Qualitative analysis indicated that, while the GPT-4–generated outputs were thematically coherent, they consistently omitted specific, verifiable evidence—most notably quantitative metrics and named stakeholders—required in formal REF evaluation. Within the scope of this study, the findings suggest that GPT-4, when configured through ChatGPT, was better suited to prospective, formative tasks such as generating potential impact pathways for grant proposal development, rather than replacing expert judgment in retrospective research assessment. This paper provides an empirical benchmark for GPT-4’s role in qualitative research evaluation and proposes guidelines for its responsible use.
AB - This study assesses the capacity of a custom research evaluation tool built on OpenAI’s GPT-4, implemented through the ChatGPT interface, to replicate the qualitative depth and structure of the United Kingdom’s Research Excellence Framework (REF) case studies. The REF offers a detailed, narrative-based approach to research impact assessment, but is resource-intensive to implement at scale. In this evaluation, we tested whether a GPT-4–powered system could generate comparable impact summaries in a scalable manner. Using a mixed-methods design, summaries produced by the custom tool for a stratified random sample of 100 REF case studies were assessed by four expert raters. Statistical analysis showed high inter-rater reliability for thematic alignment (Fleiss’ κ = 0.875) and relevance (κ = 0.685), but notably lower agreement for comprehensiveness (κ = 0.440) and insightfulness (κ = 0.267). Qualitative analysis indicated that, while the GPT-4–generated outputs were thematically coherent, they consistently omitted specific, verifiable evidence—most notably quantitative metrics and named stakeholders—required in formal REF evaluation. Within the scope of this study, the findings suggest that GPT-4, when configured through ChatGPT, was better suited to prospective, formative tasks such as generating potential impact pathways for grant proposal development, rather than replacing expert judgment in retrospective research assessment. This paper provides an empirical benchmark for GPT-4’s role in qualitative research evaluation and proposes guidelines for its responsible use.
KW - AI-Driven Framework
KW - ChatGPT
KW - GPT-4
KW - Large Language Models (LLMs)
KW - Research Impact Assessment
UR - http://www.scopus.com/inward/record.url?scp=105024091039&partnerID=8YFLogxK
UR - https://go.openathens.net/redirector/westernsydney.edu.au?url=https://doi.org/10.1007/s11192-025-05498-6
U2 - 10.1007/s11192-025-05498-6
DO - 10.1007/s11192-025-05498-6
M3 - Article
AN - SCOPUS:105024091039
SN - 0138-9130
JO - Scientometrics
JF - Scientometrics
ER -