From Prompt to Pipeline: Large Language Models for Scientific Workflow Development in Bioinformatics

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Scientific Workflow Systems (SWSs) such as Galaxy and Nextflow are essential for scalable, reproducible, and automated bioinformatics analyses. However, developing and understanding scientific workflows remains challenging for many domain scientists due to the complexity of tool/module selection, infrastructure requirements, and limited programming expertise. This study explores whether state-of-the-art Large Language Models (LLMs) such as GPT-4o, Gemini 2.5 Flash, and DeepSeek-V3 can assist in generating accurate, complete, and usable bioinformatics workflows. We evaluate a set of representative workflows covering tasks such as RNA-seq, SNP analysis, and DNA methylation across both Galaxy (graphical) and Nextflow (script-based) platforms. To simulate realistic usage, we adopt a tiered prompting strategy: each workflow is first generated using an instruction-only prompt; if the output is incomplete or incorrect, we escalate to a role-based prompt, and finally to chain-of-thought prompting if needed. The generated workflows are evaluated against community-curated baselines from the Galaxy Training Network (GTN) and nf-core, using criteria including correctness, completeness, tool appropriateness, and executability. Results show that LLMs exhibit strong potential in workflow development. Gemini 2.5 Flash produced the most accurate and user-friendly workflows in Galaxy, while DeepSeek-V3 excelled in Nextflow pipeline generation. GPT-4o performed nicely with structured prompts. Prompting strategy significantly influenced output quality, with rolebased and chain-of-thought prompts enhancing correctness and completeness. Overall, LLMs can reduce the cognitive and technical barriers to workflow development, making SWSs more accessible to novice and expert users. This work highlights the practical utility of LLMs and provides actionable insights for integrating them into real-world bioinformatics workflow design.

Article activity feed