Thank you for the amazing work !!!
I have two questions regarding the VSI-Bench results:
- From your HuggingFace eval.log, I saw that the reported vsibench_score is 50.6616 (rounded to 50.7 in the paper). May I ask which lmms-eval git hash or version was used for that evaluation, if you happen to remember?
When I re-ran the evaluation using your released pretrained model and the same configuration as in your eval.sh script (32 frames, greedy decoding, 0-shot, full test split), I obtained a score of 50.8634. I am wondering whether this small difference could be due to lmms-eval version differences or possible dataset revisions.
- I also trained the model following the same training setup described in your work in order to reproduce the results. However, under the same evaluation configuration, my reproduced model achieves 49.35 on VSI-Bench. This gap (around 1.3–1.5 points) seems slightly larger than what I would expect from pure training randomness. I am using cross-node distributed training, but other than that, all settings (batch size, learning rate, scheduler, number of frames, seed, etc.) are kept identical to your released configuration.
In your experience, are there any specific factors that are particularly sensitive for this benchmark that might lead to such a difference?
Thank you very much for your time and for sharing your work!
Thank you for the amazing work !!!
I have two questions regarding the VSI-Bench results:
When I re-ran the evaluation using your released pretrained model and the same configuration as in your eval.sh script (32 frames, greedy decoding, 0-shot, full test split), I obtained a score of 50.8634. I am wondering whether this small difference could be due to lmms-eval version differences or possible dataset revisions.
In your experience, are there any specific factors that are particularly sensitive for this benchmark that might lead to such a difference?
Thank you very much for your time and for sharing your work!