OpenSeeSimE: A Large-Scale Benchmark to Assess Vision-Language Model Question Answering Capabilities in Engineering Simulations

Jessica Ezemba
Jason Pohl
Conrad Tucker
Christopher McComb

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Engineering simulation interpretation is a major bottleneck in design cycles, requiring expensive domain expertise to validate complex outputs and ensure safety and performance. While modern large language models (LLMs) may assist in interpretation, they face fundamental scalability limitations, as even modest simulations exceed the context windows of best-in-class LLMs. Vision-language models (VLMs), having demonstrated success across technical visual reasoning domains from medical imaging to materials characterization, represent a promising alternative for processing simulation visualizations as compressed representations. However, their effectiveness for engineering simulation interpretation remains unknown, constrained by the absence of large-scale evaluation frameworks and prohibitive expert annotation costs. We introduce OpenSeeSimE, a large-scale benchmark consisting of 200,000+ question-answer pairs across 10,000 parametrically-varied simulations. This 850× scale increase, enables statistically robust evaluation across diverse simulation configurations and question types. Evaluation of ten state-of-the-art VLMs reveals a fundamental discovery: models that demonstrate strong performance on general visual reasoning benchmarks perform at random chance levels (29-47%) on engineering simulations with negligible effect sizes, establishing critical baselines for domain-specific model development

Version published to 10.21203/rs.3.rs-8389251/v1 on Research Square
Dec 25, 2025

Image and Video Question Answering with Large Language Models: A Comprehensive Review

This article has 3 authors:
1. Alexander Davis
2. Justin Parker
3. Julian Perry
This article has no evaluationsLatest version Dec 19, 2025
Large Language Models: A Survey of Architectures, Training Paradigms, and Alignment Methods

This article has 5 authors:
1. Deepshikha Bhati
2. Fnu Neha
3. Devi Sri Bandaru
4. Matthew Weber
5. Ishan Dilipbhai Gajera
This article has no evaluationsLatest version Jan 15, 2026
Emergence of Biological Structural Discovery in General-Purpose Language Models

This article has 1 author:
1. Liang Wang
This article has no evaluationsLatest version Jan 8, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Image and Video Question Answering with Large Language Models: A Comprehensive Review

Large Language Models: A Survey of Architectures, Training Paradigms, and Alignment Methods

Emergence of Biological Structural Discovery in General-Purpose Language Models