Incorporating LLM-Derived Information into Hypothesis Testing for Genomics Applications
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
We propose strategies for incorporating the information in large language models (LLMs) into statistical hypothesis tests in genomics studies. Using gene embeddings derived from text inputs to OpenAI’s GPT-3.5 model, we show that biological signals in a variety of genomics datasets reside near the principal subspace spanned by the embeddings. We then use a frequentist and Bayesian (FAB) framework to propose three hypothesis tests that are optimal with respect to prior information based on the gene embedding subspace. In three separate real-world genomics examples, the FAB tests guided by the LLM-derived information achieve more power than classical counterparts.