Closing the Final Mile in Data-Driven Discovery: Interpreting Uncharted Celestial Sources with Large Language Models Across Multimodal Data

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

For decades, astronomers have anticipated that vast data archives hold the potential for serendipitous discoveries, promising to reveal novel phenomena. However, the rapid growth of these datasets now risks outpacing our capacity for analysis, potentially burying intriguing ``known unknowns’’ and even ``unknown unknowns’’ beneath a deluge of numbers. This challenge has given rise to a data-driven research paradigm, which is aimed at uncovering potential patterns and anomalies within data that may lead to novel scientific discoveries. Although machine learning algorithms have proven highly effective in identifying patterns and anomalous candidates from these data, the critical final step of determining their physical nature from a diverse array of possibilities remains a formidable bottleneck. The sheer breadth of modern astrophysics exceeds the expertise of any individual, or any single team, especially when the focus of data-driven researchers is always directed at the data and algorithms themselves. The advent of large language models (LLMs), with their unprecedented breadth of knowledge, offers a transformative solution. In this study, we present a novel framework that leverages LLMs to interpret the nature of unusual sources. Our investigation commenced with anomalous sources identified by machine learning algorithms through the analysis of NEOWISE infrared light curves. We provided the models with both the light curves and spectral energy distributions (SEDs) of these objects, tasking them with describing the data and inferring the underlying physical nature and source type. We validated the reasonableness of the model's inferences using a curated set of well-studied rare variable sources. Subsequently, we applied the same methodology to previously unclassified sources absent from the SIMBAD database, successfully identified dozens of objects with high scientific potential and generated AI-proposed follow-up observation plans. Although the computational cost of running LLMs is often nontrivial, this framework, which focuses on a small subset of algorithmically preselected anomalies, remains computationally tractable. We argue that this methodology is not confined to time-domain and SED data but can be readily extended to other data modalities, including images and spectra, paving the way for accelerated discovery in the era of future large-scale sky surveys.

Article activity feed