Evaluating the Clinical Competence of Artificial Intelligence Applications in Psychiatry: A Systematic Review
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Artificial intelligence (AI) is an increasingly promising technology in psychiatry with the potential to transform mental healthcare. The number of AI applications designed for psychiatric diagnosis and treatment has grown substantially over the last few years; however, clinicians remain concerned about the real-world readiness of these applications in clinical care settings. These concerns have some validity, given recently reported cases of worsening psychosis and suicide attempts associated with AI application use. While there are a few studies that have examined the performance of AI applications on various multiple-choice question banks, the clinical usefulness and practical application of such test performance have yet to be assessed. Assessing test performance will provide an appraisal of the case complexity and appropriateness of the clinical decision-making process used by AI applications. Such an assessment offers a more definitive understanding of the real-world readiness of various AI applications in clinical scenarios, as well as their comparative performance relative to that of human physicians. We conducted a systematic review of peer-reviewed publications from April 2014 to July 2025, evaluating the real-world clinical readiness of publicly available psychiatric AI applications. Following PRISMA and MOOSE guidelines, 68 publications were identified, of which 24 met the inclusion criteria, yielding 24 unique applications. Each AI application was evaluated based on the test administered for clinical competence, with case complexity rated using the Amsterdam Clinical Challenge Scale and clinical decision-making assessed via Miller’s Pyramid of Clinical Competence. The inclusion of the various domains of psychiatric practice—evaluation, diagnosis, psychopharmacology, psychotherapy, and psychosocial intervention was also taken into consideration. Competence was determined based on the intersection of case complexity and decision-making level.Findings revealed substantial heterogeneity in performance across the various domains of psychiatry, with no AI application achieving human-level competence in all domains of psychiatric practice. Some psychotherapy-focused tools demonstrated moderate complexity and comparatively higher competence, while most applications remained limited to basic knowledge application (Miller Level 2). Core clinical domains, such as evaluation and diagnosis, were underdeveloped, with more than 70% of tools lacking defined complexity or competence ratings. Importantly, none of the applications used board certification-style testing; only 10.5% reported training on board-level material, and just 4.2% disclosed the use of training datasets.These results highlight a critical gap between the potential of psychiatric AI and its demonstrated real-world readiness. Standardized, board-style evaluation, transparency in training data, and more rigorous measures of clinical decision-making are essential to building trust and supporting safe integration into psychiatric care.