Development of research ethics guidelines for healthcare generative artificial intelligence: deriving expert consensus through a Delphi study

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background While generative artificial intelligence (AI) is rapidly proliferating in healthcare research and clinical settings, there is a lack of actionable research ethics standards that reflect the unique technical specificities of generative AI, such as hallucination, agentic autonomy, and decontextualization. Existing AI ethics studies primarily focus on presenting universal principles, and Delphi studies report difficulties in deriving expert consensus owing to the excessive broadness of AI concepts. We aim to systematically identify ethical issues in generative AI research within the healthcare domain and to develop practical ethical guidelines and a checklist that researchers can utilize across the entire research and development lifecycle. Methods We applied a three-stage modified Delphi method in accordance with the Conducting and REporting DElphi Studies guidelines. Thirty-five experts spanning the fields of medicine, law/ethics/policy, and AI technology were invited. Round 1 was conducted as an in-person workshop involving 18 experts, while Rounds 2 (n = 32) and 3 (n = 27) were conducted via online surveys rating 56 items using a 7-point Likert scale. Consensus criteria were set at interquartile range ≤ 1.5 and coefficient of variation < 0.5 (high consensus) for Round 2, and a stricter criterion of interquartile range ≤ 1.0 (strengthened consensus) for Round 3. Results Across Rounds 2 and 3, 96.4% of all items ultimately entered the acceptable range for guideline adoption, and 60.5% reached strong consensus even under the strengthened consensus criteria. By domain, documentation standards (mean 6.06), safety measures (mean 5.95), and evaluation methods (mean 5.86) recorded the highest importance. For individual items, explainable AI (mean 6.48), human-in-the-loop (mean 6.33), and ensuring the diversity of training data (mean 6.44) were derived as core items and top-priority strategies. Based on these findings, an ethical framework comprising three domains (data, governance, and design-by-value) and eight value dimensions, alongside a lifecycle checklist categorized into pre-development, development, and post-deployment stages, was developed. Conclusions We developed a differentiated ethical framework and practical checklist that reflect the technical specificities of generative AI. These outputs can be utilized as sector-specific sub-guidelines for healthcare under the Framework Act on AI and as criteria for institutional review board reviews.

Article activity feed