Systematic Characterization of the Effectiveness of Alignment in Large Language Models for Categorical Decisions
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
As large language models (LLMs) are increasingly deployed in high-stakes domains like healthcare, understanding how well their decision-making aligns with human preferences and values becomes crucial, especially when we recognize that there is no single gold standard for these preferences. This paper applies a systematic methodology for evaluating preference alignment in LLMs on categorical decision-making with medical triage as a domain-specific use case. It also measures how effectively an alignment procedure will change the alignment of a specific model. Key to this methodology is a novel simple measure, the Alignment Compliance Index (ACI), that quantifies how effectively a LLM can be aligned to a given preference function or gold standard. Since the ACI measures the effect rather than the process of alignment, it is applicable to alignment methods beyond the in-context learning used in this study.
Using a dataset of simulated patient pairs, three frontier LLMs (GPT4o, Claude 3.5 Sonnet, and Gemini Advanced) were assessed on their ability to make triage decisions consistent with an expert clinician’s preferences. The models’ performance before and after alignment attempts was evaluated using various prompting strategies. The results reveal significant variability in alignment effectiveness across models and alignment approaches. Notably, models that performed well, as measured by ACI, pre-alignment sometimes degraded post-alignment, and small changes in the target preference function led to large shifts in model rankings.
The implicit ethical principles, as understood by humans, underlying the LLMs’ decisions were also explored through targeted questioning. These findings highlight the complex, multifaceted nature of decision-making and the challenges of robustly aligning AI systems with human values. They also motivate the use of a practical set of methods and the ACI, in the near term, to understand the correspondence between the variety of human and LLM decision-making values in specific scenarios.