Detecting LLM-assisted writing in scientific communication: Are we there yet?

Lazebnik, Teddy; Rosenfeld, Ariel

doi:10.2478/jdis-2024-0020

Figures & Tables

A schematic view of the LLM-Assisted Writing (LAW) detector. The detection process consists of two phases: First, during training, manuscripts are converted into vectors representing the author’s writing style using the technique provided in (Lazebnik & Rosenfeld, 2023). The average change and standard deviation of the presented writing style are measured to capture the dynamics in one’s writing style. Then, during inference, for each manuscript, we examine whether the change in its author’s writing style is substantial enough to be considered an anomaly and whether this anomaly is aligned with the style of an LLM-generated manuscript of the same title and abstract. If both conditions are met, the manuscript is deemed as an LLM-assisted manuscript.

Pairwise Cohan’s κs calculated for the five detectors_ Each cell contains the results for the assessment set on the left, and the results for the false positive set on the right_

	DetectLLM	ZipPy	ConDA	LAW
LLMDet	0.86 / 0.82	0.68 / 0.74	0.67 / 0.72	0.63 / 0.69
DetectLLM		0.72 / 0.76	0.67 / 0.75	0.59 / 0.62
ZipPy			0.86 / 0.96	0.77 / 0.88
ConDA				0.81 / 0.90

List of manuscripts included in the assessment set_

LLM-assisted Writing	Counterpart
Osterrieder, J., GPTChat, A Primer on Deep Reinforcement Learning for Finance, SSRN (2023)	Finance, F., Osterrieder, J., Generative Adversarial Networks in finance: an overview, arXiv (2021)
Biswas, S., Will ChatGPT take my Job? Replies and Advice by ChatGPT, SSRN (2023)	Biswas, S., Role of Sonography in Ocular Trauma: A Study, ARC Journal of Surgery (2021)
Askr, H., Darwish, A., Hassanien, A.E., ChatGPT, The Future of Metaverse in the Virtual Era and Physical World: Analysis and Applications. Studies in Big Data (2023)	Gad, I., Hassanien, A. E., A wind turbine fault identification using machine learning approach based on pigeon inspired optimizer, Tenth International Conference on Intelligent Computing and Information Systems (2021)
King, M. R., chatGPT, A Conversation on Artificial Intelligence, Chatbots, and Plagiarism in Higher Education, Cellular and Molecular Bioengineering (2023)	King, M. R., CMBE Moves to the Structured Abstract Format: A Note from the Editor, Cellular and Molecular Bioengineering (2017)
Kung et al., Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models, medRxiv (2022)	Kung, H. K., Host physician perspectives to improve predeparture training for global health electives, medical education (2017)
O’Connor S., Open artificial intelligence platforms in nursing education: Tools for academic progress or abuse?, Nurse Education in Practice (2022)	O’Connor S., Exoskeletons in Nursing and Healthcare: A Bionic Future, Clinical nursing research (2021)
Rossoni, L., A inteligencia artificial e eu: escrevendo ô editorial juntamente com o ChatGPT, Revista Eletronicâ de Ciencia Administrativa (2022)	Rossoni, L., Editorial: A RECADM no Redalyc e o Dilema das Bases e Indexadores, Revista Eletronica dê Ciencia Administrativa (2021)
chatGPT, Zhavoronkov, A., Rapamycin in the context of Pascal’s Wager: generative pre-trained transformer perspective, Oncoscience (2022)	Zhavoronkov, A., The inherent challenges of classifying senescence, Science (2020)
Biswas, S., ChatGPT and the Future of Medical Writing, Radiology (2023)	Biswas, S., Biswas, S., A Study on penile doppler, MedCrave Online Journal of Surgery (2017)
Lazebnik, T., ChatGPT, The Impact of Fruit and Vegetable Consumption and Physical Activity on Diabetes Risk among Adults, arXiv (2022)	Lazebnik, T., Bunimovich-Mendrazitsky, S., The Signature Features of COVID-19 Pandemic in a Hybrid Mathematical Model—Implications for Optimal Work–School Lockdown Policy, Advanced Theory and Simulations (2021)
BaHammam, A. S., Trabelsi, K., Pandi-Perumal, S. R., Jahrami, H., Adapting to the Impact of AI in Scientific Writing: Balancing Benefits and Drawbacks while Developing Policies and Regulations, Journal of Nature and Science of Medicine (2023)	Akhtar, N., Ravi Gupta, S.R. Pandi-Perumal, Ahmed S. BaHammam: Clinical Atlas of Polysomnography: A Book Review, Sleep and Vigilance (2021)

The performance of the examined detectors (columns) on the assessment set (first row) and the false-positive set (second row)_ The performance is presented as the accuracy with the F1-score in brackets (for the assessment set) and as the false positive rate (for the false-positive set)_

Model	LLMDet	DetectLLM	ZipPy	ConDA	LAW
Accuracy	0.546	0.591	0.637	0.637	0.727
F1-score	0.286	0.471	0.600	0.600	0.700
Recall	0.334	0.534	0.627	0.627	0.700
Precision	0.250	0.421	0.575	0.575	0.700
False Positive	17.2%	13.8%	9.7%	8.8%	3.1%

Pairwise comparison between the five detectors_ The results are shown as p value with the statistics in brackets_ Each cell contains the results for the assessment set on the left, and the results for the false positive set on the right_

	LLMDet	DetectLLM	ZipPy	ConDA
DetectLLM	0.66(0.19)/< 0.01(10.45)
ZipPy	0.38(0.78)/< 0.01(69.63)	0.66(0.20)/< 0.01(20.96)
ConDA	0.06(3.67)/< 0.01(95.71)	0.66(0.20)/< 0.01(34.21)	1.0(0.0)/0.28(1.13)
LAW	0.01(0.03)/< 0.01(729.19)	0.15(2.06)/< 0.01(34.21)	0.34(0.92)/< 0.01(161.74)	0.34(0.92)/< 0.01(120.46)

Detecting LLM-assisted writing in scientific communication: Are we there yet?

Figures & Tables

Figure 1.

Pairwise Cohan’s κs calculated for the five detectors_ Each cell contains the results for the assessment set on the left, and the results for the false positive set on the right_

List of manuscripts included in the assessment set_

The performance of the examined detectors (columns) on the assessment set (first row) and the false-positive set (second row)_ The performance is presented as the accuracy with the F1-score in brackets (for the assessment set) and as the false positive rate (for the false-positive set)_

Pairwise comparison between the five detectors_ The results are shown as p value with the statistics in brackets_ Each cell contains the results for the assessment set on the left, and the results for the false positive set on the right_

Paradigm

My account