Identification of primary sclerosing cholangitis: ICD-10 code validation and comparison with a large language model approach
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Retrospective studies investigating primary sclerosing cholangitis (PSC) have been limited by the absence of a PSC-specific diagnostic code. In 2018, a new PSC-specific ICD-10 code was introduced. Aims We aimed to validate the new ICD-10 code and compare it to other methods of identifying patients with PSC. Methods All gastroenterology/hepatology and primary clinic notes and discharge summaries were extracted from UCSF Epic Clarity database and potential PSC patients were identified using natural language processing (NLP). PSC diagnosis was determined by physician adjudication through chart review. LASSO regression was used to develop and internally validate a PSC prediction model. Separately, we tested large language model’s (LLM) ability to distinguish PSC from non-PSC patients. Results Among 867 patients identified using NLP, 226 (26%) patients were adjudicated to have a true PSC diagnosis. The LASSO model selected ICD-10 code, alkaline phosphatase > 120 IU/L, ursodiol use, inflammatory bowel disease, and history of cholangitis. ICD-10 code alone had a c-statistic of 0.87, sensitivity 87.6%, and PPV 68.8%. The LASSO model had a c-statistic of 0.92, sensitivity 87.4%, and PPV 70.7%. LLM had a c-statistic 0.77, sensitivity 91.7%, and PPV 51.0%. Conclusions The PSC-specific ICD-10 code had excellent discriminatory capacity for identifying patients with PSC. While an optimized PSC prediction algorithm had slightly improved test characteristics, ICD-10 code alone was sufficient in identifying patients with PSC, supporting the use of the ICD-10 code in future database studies of PSC. In contrast, LLM had inferior discrimination compared to either ICD-10 code or the prediction model.