Incorporating LLM-Derived Information into Hypothesis Testing for Genomics Applications

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

We propose strategies for incorporating the information in large language models (LLMs) into statistical hypothesis tests in genomics studies. Using gene embeddings derived from text inputs to OpenAI’s GPT-3.5 model, we show that biological signals in a variety of genomics datasets reside near the principal subspace spanned by the embeddings. We then use a frequentist and Bayesian (FAB) framework to propose three hypothesis tests that are optimal with respect to prior information based on the gene embedding subspace. In three separate real-world genomics examples, the FAB tests guided by the LLM-derived information achieve more power than classical counterparts.

Article activity feed