Reproduction and lmms-eval Differences in VSI-Bench Results

Thank you for the amazing work !!!

I have two questions regarding the VSI-Bench results:

1. From your HuggingFace eval.log, I saw that the reported vsibench_score is 50.6616 (rounded to 50.7 in the paper). May I ask which lmms-eval git hash or version was used for that evaluation, if you happen to remember?
When I re-ran the evaluation using your released pretrained model and the same configuration as in your eval.sh script (32 frames, greedy decoding, 0-shot, full test split), I obtained a score of 50.8634. I am wondering whether this small difference could be due to lmms-eval version differences or possible dataset revisions.

<img width="1209" height="149" alt="Image" src="https://github.com/user-attachments/assets/5e97e681-4115-4979-a46b-b1acc95289b3" />


2. I also trained the model following the same training setup described in your work in order to reproduce the results. However, under the same evaluation configuration, my reproduced model achieves 49.35 on VSI-Bench. This gap (around 1.3–1.5 points) seems slightly larger than what I would expect from pure training randomness.  I am using cross-node distributed training, but other than that, all settings (batch size, learning rate, scheduler, number of frames, seed, etc.) are kept identical to your released configuration.
In your experience, are there any specific factors that are particularly sensitive for this benchmark that might lead to such a difference?


<img width="1192" height="136" alt="Image" src="https://github.com/user-attachments/assets/ece26777-2238-4d61-9ebc-7cedd74184fd" />

Thank you very much for your time and for sharing your work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduction and lmms-eval Differences in VSI-Bench Results #66

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reproduction and lmms-eval Differences in VSI-Bench Results #66

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions