Have a personal or library account? Click to login
Evaluating research quality with Large Language Models: An analysis of ChatGPT’s effectiveness with different settings and inputs Cover

Evaluating research quality with Large Language Models: An analysis of ChatGPT’s effectiveness with different settings and inputs

By: Mike Thelwall  
Open Access
|Feb 2025

Figures & Tables

Figure 1.

ChatGPT 3.5-turbo score prediction correlations against human scores for 51 information science article full texts (truncated), article titles and abstracts, or just titles. Averages over n iterations and confidence intervals are calculated as in the methods.
ChatGPT 3.5-turbo score prediction correlations against human scores for 51 information science article full texts (truncated), article titles and abstracts, or just titles. Averages over n iterations and confidence intervals are calculated as in the methods.

Figure 2.

ChatGPT 4o-mini, ChatGPT 3.5-turbo and ChatGPT 4o score prediction correlations against human scores for 51 information science article titles and abstracts. Averages over n iterations and confidence intervals are calculated as in the methods.
ChatGPT 4o-mini, ChatGPT 3.5-turbo and ChatGPT 4o score prediction correlations against human scores for 51 information science article titles and abstracts. Averages over n iterations and confidence intervals are calculated as in the methods.

Figure 3.

ChatGPT 4o score predictions based on abstracts (average of 30) against human scores (from the author) for 51 information science article titles and abstracts.
ChatGPT 4o score predictions based on abstracts (average of 30) against human scores (from the author) for 51 information science article titles and abstracts.

Figure 4.

ChatGPT 4o score predictions based on abstracts (average of 30) against human scores (from the author) for 51 information science article titles and abstracts with seven different system prompts. Strategies 1-5 are abbreviations of Strategy 6, the full REF instructions, and Strategy 0 is a brief instruction without a request for justification.
ChatGPT 4o score predictions based on abstracts (average of 30) against human scores (from the author) for 51 information science article titles and abstracts with seven different system prompts. Strategies 1-5 are abbreviations of Strategy 6, the full REF instructions, and Strategy 0 is a brief instruction without a request for justification.

Figure 5.

ChatGPT 4 (web interface) score prediction correlations against human scores for 51 information science article titles and abstracts. Averages over n iterations and confidence intervals are calculated as in the methods (data from: Thelwall, 2024).
ChatGPT 4 (web interface) score prediction correlations against human scores for 51 information science article titles and abstracts. Averages over n iterations and confidence intervals are calculated as in the methods (data from: Thelwall, 2024).

Mean Average Deviations for direct predictions and predictions with linear regression for each model and input_ The improve column gives the percentage reduction in MAD compared to the baseline strategy of assigning each article the overall human average, 2_75_

Model and inputDirectRegression
MADImproveInterceptCoefficientMADImprove
GPT-3.5 turbo: Titles0.686%-1.161.570.6313%
GPT-3.5 turbo: Abstracts0.6017%-3.462.260.5130%
GPT-3.5 turbo: Truncated text0.704%-7.493.380.5524%
GPT-4o-mini: Abstracts0.6313%-3.322.070.5919%
GPT-4o-mini: Truncated text0.75-3%-2.441.610.6017%
GPT-4o: Abstracts0.6214%-3.402.050.5031%
GPT-4o: Truncated text0.695%-4.442.280.5031%

Spearman correlations between humans scores and model average scores (over 30 iterations) for 51 information science articles_ Values above 0_75 are highlighted_

Spearman correlationGPT-3.5 turbo: AbstractsGPT-3.5 turbo: Truncated textGPT-4o-mini: AbstractsGPT-4o-mini: Truncated textGPT-4o:AbstractsGPT-4o: Truncated textHuman
GPT-3.5 turbo: Titles0.4390.4440.3590.4990.5390.5890.434
GPT-3.5 turbo: Abstracts1.0000.7570.7000.7180.8750.7740.674
GPT-3.5 turbo: Truncated text 1.0000.6720.6860.7320.7830.625
GPT-4o-mini: Abstracts 1.0000.6080.7290.6530.571
GPT-4o-mini: Truncated text 1.0000.8130.8010.506
GPT-4o: Abstracts 1.0000.8580.678
GPT-4o: Truncated text 1.0000.675

Average humans scores and model average scores_

HumanGPT-3.5 turbo: TitlesGPT-3.5 turbo: AbstractsGPT-3.5 turbo: Truncated textGPT-4o-mini: AbstractsGPT-4o-mini: Truncated textGPT-4o: AbstractsGPT-4o: Truncated text
Mean score2.752.492.753.032.933.222.993.16
DOI: https://doi.org/10.2478/jdis-2025-0011 | Journal eISSN: 2543-683X | Journal ISSN: 2096-157X
Language: English
Page range: 7 - 25
Submitted on: Aug 22, 2024
Accepted on: Dec 11, 2024
Published on: Feb 18, 2025
Published by: Chinese Academy of Sciences, National Science Library
In partnership with: Paradigm Publishing Services
Publication frequency: 4 issues per year

© 2025 Mike Thelwall, published by Chinese Academy of Sciences, National Science Library
This work is licensed under the Creative Commons Attribution 4.0 License.