Evaluating LingualAI: A Prospective Validation of AI-Based Real- Time Translation Against Certified Human Interpreters

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background : Limited English proficiency (LEP) affects >25 million people in the United States and is linked to health disparities in safety, quality, and outcomes. While professional interpreters remain the standard, access is often constrained. Real-time AI translation systems are increasingly available, yet their clinical performance relative to certified interpreters is uncertain. Objective : To evaluate whether an in-house AI application (LingualAI) achieves non-inferior translation quality compared with certified interpreters in English–Spanish otorhinolaryngology encounters. Design, Setting, and Participants : Prospective, within-subject comparison using three standardized outpatient scenarios (33 lines: 18 clinicians, 15 patients) enacted by two pairs of native speakers. Each line was translated by LingualAI and by two certified medical interpreters. Nine bilingual clinicians, blinded to source but given scenario context, independently rated anonymized audio clips. Main Measures : Twelve domains on 5-point Likert scales: primary ( terminology accuracy , adequacy of meaning ), secondary ( completeness , grammar , vocabulary , cultural appropriateness ), voice-related ( fluency , clarity , prosody , pacing ), and conclusive ( overall quality , clinical confidence ). Non-inferiority margin prespecified at 0.30 points (Δ = Human − AI). Analyses used paired tests and mixed-effects models with random intercepts for line; inter-rater reliability via Krippendorff’s α. Results : Across models, LingualAI was non-inferior for adequacy of meaning and terminology accuracy; completeness also met the criterion. Human interpreters scored higher on delivery-related and linguistic-mechanics domains, including clarity/intelligibility (Δ≈0.50), fluency (Δ≈1.1), prosody (Δ≈0.6), pacing (Δ≈0.4), grammar, vocabulary, and cultural appropriateness. Conclusive ratings favored humans for overall quality (Δ≈0.6) and clinical confidence (Δ≈0.6). Findings were consistent in direction-specific contrasts (English→Spanish clinician lines; Spanish→English patient lines). Inter-rater reliability was modest (α=0.31), reflecting first-impression scoring. In exploratory system metrics, mean end-to-end translation latency was ~9.7s with substantially lower estimated per-session costs than phone/video interpreter services. Conclusions : LingualAI preserves core meaning and terminology at near-interpreter levels but lags in speech naturalness and delivery ( fluency , prosody , pacing ), leading to lower overall quality and clinical confidence. AI translation may serve as a useful aid when interpreters are unavailable; however, its use today should remain aligned with professional standards and ideally follow an interpreter-in-the-loop model rather than replacement. Continued refinement of voice and delivery features potentially will improve perceived speech naturalness and delivery and thus, in the long-run applications such as Lingual AI, will more closely approximate the performance of human interpreters on all measures. Technical work, plus clinical validation is necessary for the safe and effective deployment of applications such as Lingual AI in real-world settings (Fig. 1).

Article activity feed