Can LLMs Bridge Domain and Visualization? A Case Study on High-Dimension Data Visualization in Single-Cell Transcriptomics

Qianwen Wang
Xinyi liu
Nils Gehlenborg

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

While many visualizations are built for domain users (e.g., biologists, machine learning developers), understanding how visualizations are used in the domain has long been a challenging task. Previous research has relied on either interviewing a limited number of domain users or reviewing relevant application papers in the visualization community, neither of which provides comprehensive insight into visualizations ``in the wild'' of a specific domain. This paper aims to fill this gap by examining the potential of using Large Language Models (LLM) to analyze visualization usage in domain literature. We use high-dimension (HD) data visualization in sing-cell transcriptomics as a test case, analyzing 1,203 papers that describe 2,056 HD visualizations with highly specialized domain terminologies (e.g., biomarkers, cell lineage). To facilitate this analysis, we introduce a human-in-the-loop LLM workflow that can effectively analyze a large collection of papers and translate domain-specific terminology into standardized data and task abstractions. Instead of relying solely on LLMs for end-to-end analysis, our workflow enhances analytical quality through 1) integrating image processing and traditional NLP methods to prepare well-structured inputs for three targeted LLM subtasks (\ie, translating domain terminology, summarizing analysis tasks, and performing categorization), and 2) establishing checkpoints for human involvement and validation throughout the process.The analysis results, validated with expert interviews and a test set, revealed three often overlooked aspects in HD visualization: trajectories in HD spaces, inter-cluster relationships, and dimension clustering.This research provides a stepping stone for future studies seeking to use LLMs to bridge the gap between visualization design and domain-specific usage.

Version published to 10.31219/osf.io/qtsak_v2 on OSF Preprints
Apr 2, 2025
Version published to 10.31219/osf.io/qtsak_v1 on OSF Preprints
Apr 3, 2024

Geranium: Multimodal Retrieval of Genomics Data Visualizations

This article has 6 authors:
1. Huyen N. Nguyen
2. Sehi L'Yi
3. Thomas Chris Smits
4. Shanghua Gao
5. Marinka Zitnik
6. Nils Gehlenborg
This article has no evaluationsLatest version Dec 27, 2025
DQVis Dataset: Natural Language to Biomedical Visualization

This article has 5 authors:
1. Devin Lange
2. Pengwei Sui
3. Shanghua Gao
4. Marinka Zitnik
5. Nils Gehlenborg
This article has no evaluationsLatest version Dec 15, 2025
Integrating Microbiome Data Visualization into FAIRDatabase using Edge Functions

This article has 3 authors:
1. Roman van Eldijk
2. Shivam Kumar
3. Vivek Sheraton M
This article has no evaluationsLatest version Jan 27, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Geranium: Multimodal Retrieval of Genomics Data Visualizations

DQVis Dataset: Natural Language to Biomedical Visualization

Integrating Microbiome Data Visualization into FAIRDatabase using Edge Functions