Evaluating the Accuracy and Reliability of AI Content Detectors in Academic Contexts

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Generative Artificial Intelligence (GenAI) tools capable of producing human-like text have raised considerable concerns regarding academic integrity. In response, AI content detectors such as Turnitin and Originality are increasingly employed in higher education. However, empirical evidence regarding their accuracy, reliability, and fairness, particularly in the context of English as a Foreign Language (EFL) writing remains limited. This study evaluates the performance of both detectors across variations in text length, genre, and authorship type. A balanced dataset of 192 texts was constructed, comprising authentic EFL student writing, professionally authored human texts, AI-generated outputs, and hybrid compositions. Based on the percentage of AI content identified by each detector, texts were categorized as Human, Hybrid, or AI. Detector performance was assessed against ground truth labels using precision, recall, specificity, F1 score, and accuracy. Statistical significance was tested using Pearson’s chi-square and Fisher’s Exact Test.Originality outperformed Turnitin in overall accuracy (0.69 vs. 0.61) and macro-average recall (0.60 vs. 0.51). However, both detectors performed poorly on Hybrid texts, with recall scores of 0.31 for Turnitin and 0.02 for Originality. Performance declined significantly with longer texts (p < 0.015 for Turnitin; p < 0.002 for Originality) and varied across genres, with higher accuracy observed in humanities than in science (p < 0.0001 for both detectors). Originality also exhibited a borderline statistically significant bias favoring professionally authored texts over EFL texts (p = 0.058). These findings suggest that neither detector is sufficiently reliable to serve as the sole basis for academic misconduct decisions. Institutions are advised to supplement AI detection tools with human judgment, incorporate AI literacy into academic curricula, and encourage detector developers to pursue further research into bias mitigation.

Article activity feed