Scoring structured academic documents with large language models: how well can mediumsized 5 popular LLMs score structured academic documents using the UK Research Excellence Framework (REF) 2021 Impact case studies as testmaterial...and it seems to work.
Purpose: Academic documents require expert time to evaluate, and Large Language Models (LLMs) might support this through score or decision predictions. For confidential structured academic texts, like grants and Impact Case Studies (ICSs), medium-sized LLMs can be run offline without expensive computing infrastructures, enhancing security.
Design/methodology/approach: This study evaluates for the first time how well mediumsized LLMs can score structured academic documents using the UK Research Excellence Framework (REF) 2021 ICSs, and whether LLMs can guess scores from individual sections. We obtained score estimates from five recent popular LLMs (DeepSeek R1 32B, Qwen 3 32B, Magistral Small 24B, Gemma 3 27B, and Llama 4 Scout 27B) across 6,010 REF 2021 ICSs, correlating the scores with a proxy quality rating (departmental average score).
Findings: Scoring the full texts was only moderately effective (in terms of correlations with the proxy quality rating) and Llama4 failed to score most of the longest. Surprisingly, all LLMs except Magistral were able to make statistically significantly above random guesses at ICS scores from each of the individual component sections (summary, underpinning research, references, details of the impacts, and sources to support the impact). A logical two-stage approach mimicking the human reviewer instructions did not outperform focusing on impact alone. The best strategy was to score the summary and the details of the impact sections combined (five times, averaged) with Gemma 3. This gave the highest Spearman correlation (0.37) with departmental average proxy quality scores (0.55 for department-level correlations).
Practical implications: Medium sized LLMs can be used to score structured academic documents to support research assessments.
Research limitations: This uses a single large case study with a public, albeit obscured, gold standard.
Originality/value: This improves on the state of the art despite the additional restrictions and with a much cheaper and potentially private open weights LLMs approach.