Evaluating the Accuracy and Reliability of AI Content Detectors in Academic Contexts

Mohammad Hadra
Karleen Cambridge
Mostefa Mesbah

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Generative Artificial Intelligence (GenAI) tools capable of producing human-like text have raised considerable concerns regarding academic integrity. In response, AI content detectors such as Turnitin and Originality are increasingly employed in higher education. However, empirical evidence regarding their accuracy, reliability, and fairness, particularly in the context of English as a Foreign Language (EFL) writing remains limited. This study evaluates the performance of both detectors across variations in text length, genre, and authorship type. A balanced dataset of 192 texts was constructed, comprising authentic EFL student writing, professionally authored human texts, AI-generated outputs, and hybrid compositions. Based on the percentage of AI content identified by each detector, texts were categorized as Human, Hybrid, or AI. Detector performance was assessed against ground truth labels using precision, recall, specificity, F1 score, and accuracy. Statistical significance was tested using Pearson’s chi-square and Fisher’s Exact Test.Originality outperformed Turnitin in overall accuracy (0.69 vs. 0.61) and macro-average recall (0.60 vs. 0.51). However, both detectors performed poorly on Hybrid texts, with recall scores of 0.31 for Turnitin and 0.02 for Originality. Performance declined significantly with longer texts (p < 0.015 for Turnitin; p < 0.002 for Originality) and varied across genres, with higher accuracy observed in humanities than in science (p < 0.0001 for both detectors). Originality also exhibited a borderline statistically significant bias favoring professionally authored texts over EFL texts (p = 0.058). These findings suggest that neither detector is sufficiently reliable to serve as the sole basis for academic misconduct decisions. Institutions are advised to supplement AI detection tools with human judgment, incorporate AI literacy into academic curricula, and encourage detector developers to pursue further research into bias mitigation.

Version published to 10.21203/rs.3.rs-7359956/v1 on Research Square
Sep 16, 2025

DEDICAITE - DEtecting AI-generated TExts in a DIdactic Context

This article has 10 authors:
1. Maria Berger
2. Steffen Hessler
3. Johanna Sophie Busse
4. Berin Doru
5. Stephanie Anna Christine Heimgartner
6. Judith Herzog
7. Judith Schönhoff
8. Julien Stein
9. Marianne Tokic
10. Christoph Maier
This article has no evaluationsLatest version Oct 22, 2025
DEDICAITE - DEtecting AI-generated TExts in a DIdactic Context

This article has 10 authors:
1. Maria Berger
2. Steffen Hessler
3. Johanna Sophie Busse
4. Berin Doru
5. Stephanie Anna Christine Heimgartner
6. Judith Herzog
7. Judith Schönhoff
8. Julien Stein
9. Marianne Tokic
10. Christoph Maier
This article has no evaluationsLatest version Oct 22, 2025
Potential Use of ChatGPT for Automated Essay Scoring Based

This article has 3 authors:
1. Roghaye Torki
2. Fariba Rahimi Esfahani
3. Farshad Kiyoumarsi
This article has no evaluationsLatest version Sep 25, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

DEDICAITE - DEtecting AI-generated TExts in a DIdactic Context

DEDICAITE - DEtecting AI-generated TExts in a DIdactic Context

Potential Use of ChatGPT for Automated Essay Scoring Based