Shared utilities and infrastructure for evaluating vision-language models on the OpenSeeSimE benchmark datasets.
This repository provides standardized tools for prompt construction, response parsing, checkpoint management, and evaluation protocols to ensure reproducible and fair comparison of VLM performance on engineering simulation visualization tasks.
OpenSeeSimE is a large-scale benchmark for evaluating vision-language models on engineering simulation interpretation tasks. The benchmark consists of two comprehensive datasets covering different physics domains:
Available Datasets:
| Dataset | Domain | Examples | HuggingFace Repository |
|---|---|---|---|
| OpenSeeSimE-Structural | Structural Mechanics & FEA | ~103K | 🤗 cmudrc/OpenSeeSimE-Structural |
| OpenSeeSimE-Fluid | Computational Fluid Dynamics | ~98K | 🤗 cmudrc/OpenSeeSimE-Fluid |
What This Repository Provides:
While the full datasets are hosted on HuggingFace, this repository contains the shared utilities for working with both datasets including prompt construction, video processing, response parsing, checkpoint management, and evaluation protocols.
- Dataset Loading: Load and filter OpenSeeSimE datasets by media type (image/video)
- Prompt Construction: Build standardized system and user prompts for consistent evaluation
- Video Processing: Extract frames with middle-frame-centered symmetric sampling
- Response Parsing: Parse and validate model responses with exact-match checking
- Evaluation: Calculate accuracy metrics overall and per question type
- Checkpoint Management: Save and resume evaluation progress automatically
git clone https://github.com/cmudrc/OpenSeeSimE-Full.git
cd OpenSeeSimE-Fullpip install -r requirements.txtCore dependencies include: datasets, transformers, torch, pillow, opencv-python, numpy, pandas, tqdm
Set your HuggingFace token (required for dataset access):
export HUGGING_FACE_HUB_TOKEN="hf_..."Or login via CLI:
huggingface-cli loginfrom utils import load_benchmark_dataset
# Load Structural dataset
dataset = load_benchmark_dataset(
dataset_name="cmudrc/OpenSeeSimE-Structural",
media_type='image'
)
print(f"Successfully loaded {len(dataset)} examples")from utils import (
load_benchmark_dataset,
build_system_prompt,
build_user_prompt,
parse_model_response,
evaluate_response
)
# Load dataset
dataset = load_benchmark_dataset(
dataset_name="cmudrc/OpenSeeSimE-Structural",
media_type='image'
)
# Get an example
example = dataset[0]
# Build prompts
system_prompt = build_system_prompt()
user_prompt = build_user_prompt(
question=example['question'],
answer_choices=example['answer_choices'],
is_video=False
)
# Call your model
# model_response = your_model.generate(system_prompt, user_prompt, example['image'])
# Parse and evaluate
model_answer, explanation = parse_model_response(
model_response,
example['answer_choices']
)
is_correct = evaluate_response(
model_answer,
example['answer'],
example['answer_choices']
)# Load Structural dataset
dataset = load_benchmark_dataset(dataset_name="cmudrc/OpenSeeSimE-Structural")
# Load Fluid dataset, images only
dataset = load_benchmark_dataset(
dataset_name="cmudrc/OpenSeeSimE-Fluid",
media_type='image'
)system_prompt = build_system_prompt()
user_prompt = build_user_prompt(question, answer_choices, is_video=False)# Extract 8 frames with middle frame guaranteed
frames = extract_video_frames(video_path, num_frames=8)# Parse model response
answer, explanation = parse_model_response(response_text, answer_choices)
# Evaluate against ground truth
is_correct = evaluate_response(model_answer, ground_truth, answer_choices)# Load checkpoint to resume
processed_indices, results = load_checkpoint("checkpoint.pkl")
# Save progress
save_checkpoint("checkpoint.pkl", processed_indices, results)
# Clean up after completion
cleanup_checkpoint("checkpoint.pkl")- Use standardized prompts: Always use
build_system_prompt()andbuild_user_prompt()for consistency - Validate responses: Use
parse_model_response()to ensure answers match provided choices - Enable checkpointing: Save progress frequently for long evaluations
- Use deterministic settings: Set temperature=0.0 and do_sample=False for evaluation
- Middle-frame sampling: For videos, use
middle_frame_guarantee=True(frame 100 contains maximum deformation/flow development)
Both datasets share the same structure:
| Field | Type | Description |
|---|---|---|
question |
str | Question text |
answer |
str | Ground truth answer |
answer_choices |
List[str] | Multiple choice options |
question_id |
int | Question identifier (1-20) |
question_type |
str | Binary, Multiple Choice, or Spatial |
media_type |
str | "image" or "video" |
image |
PIL.Image | Image for image examples |
video |
str | Video path for video examples |
file_name |
str | Original file identifier |
source_file |
str | Source simulation model |
- 5 structural models: Dog Bone, Hip Implant, Pressure Vessel, Thermal Beam, Wall Bracket
- Physics: Stress analysis, deformation patterns, structural mechanics
- Visualizations: Stress contours, displacement fields, strain distributions
- 5 fluid models: Bent Pipe, Converging Nozzle, Mixing Pipe, Heat Sink, Heat Exchanger
- Physics: Turbulent flow, heat transfer, complex flow patterns
- Visualizations: Velocity contours, pressure fields, streamlines, pathlines
For complete dataset details, see the HuggingFace repositories.
The system prompt enforces structured output:
- Line 1: Exact copy of answer from choices
- Line 2+: Brief explanation (10-15 words)
- No paraphrasing or summarizing of the answer
User prompt format:
{question}
Answer options:
- {choice_1}
- {choice_2}
...
Instructions:
1. First line: Provide ONLY your answer exactly as it appears in the options above.
2. Second line onwards: Provide a brief summary explaining your reasoning.
Answer:
If you use the OpenSeeSimE benchmark or these utilities, please cite:
@article{ezemba2024opensesime,
title={OpenSeeSimE: A Large-Scale Benchmark to Assess Vision-Language Model Question Answering Capabilities in Engineering Simulations},
author={Ezemba, Jessica and Pohl, Jason and Tucker, Conrad and McComb, Christopher},
year={2025}
}MIT License - See LICENSE file for details.
Authors: Jessica Ezemba ([email protected]), Jason Pohl, Conrad Tucker, Christopher McComb
Institution: Department of Mechanical Engineering, Carnegie Mellon University
For questions or issues, open an issue on GitHub or email [email protected]
Last Updated: December 24, 2025