Vinoground

Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

arXiv 2024

Jianrui Zhang^*, Mu Cai^*, Yong Jae Lee

University of Wisconsin-Madison

^*Equal Contribution

🔥🔥🔥🔥[NEW!] Vinoground has been integrated into lmms-eval. One can begin using it by cloning their repository. Evaluation is now made easier.

🔥[NEW!] We introduce Vinoground, a temporal counterfactual LMM evaluation benchmark encompassing 1000 short and natural video-caption pairs.

🔥[NEW!] We are the first dataset to use truly natural negative videos within each counterfactual pair instead of synthetic or unnatural alternatives, making our benchmark even more difficult.

🔥[NEW!] The best model, GPT-4o, performs at only ~50% on text and video score metrics while only 35% on the group score metric, which is severely overshadowed by the human baseline of ~90%.

🔥[NEW!] We discover that LMMs are much better at analyzing coarse-level information than discovering fine-grained details.

Short video comprehension is still a problem not yet fully resolved, especially with dense temporal reasoning.

Abstract

There has been growing sentiment recently that modern large multimodal models (LMMs) have addressed most of the key challenges related to short video comprehension. As a result, both academia and industry are gradually shifting their attention towards the more complex challenges posed by understanding long-form videos. However, is this really the case? Our studies indicate that LMMs still lack many fundamental reasoning capabilities even when dealing with short videos. We introduce Vinoground, a temporal counterfactual LMM evaluation benchmark encompassing 1000 short and natural video-caption pairs. We demonstrate that existing LMMs severely struggle to distinguish temporal differences between different actions and object transformations. For example, the best model GPT-4o only obtains ~50% on our text and video scores, showing a large gap compared to the human baseline of ~90%. All open-source multimodal models and CLIP-based models perform much worse, producing mostly random chance performance. Through this work, we shed light onto the fact that temporal reasoning in short videos is a problem yet to be fully solved.

Inspiration

GPT-4o, one of the state-of-the-art Large Multimodal Models (LMMs), is unable to answer a simple question with regards to the order two events happened. It not only did not mention any temporality in its response, its analyses for both videos are completely wrong.

Above: an example instance from Winoground.

Above: an example instance from Vinoground.

Our work is inspired by Winoground, a challenging counterfactual benchmark for visio-linguistic compositional reasoning in images. In Winoground, a model must correctly match two images with their corresponding captions, where both captions use the same set of words, but are rearranged to describe each image. Our benchmark's name changes the `W' to a `V' for ``video", and further employs temporal counterfactuals to emphasize this unique element in video data.

Data Curation

We use GPT-4 to generate counterfactual caption pair candidates, then find the corresponding videos using VATEX's captions as the index with the help of a sentence transformer and the FAISS library. If no such video can be found, we search YouTube with the caption in hopes of finding the corresponding video.

Overview

We provide an overview of the seven categories Vinoground encompasses in the flashcards below.

Metrics

We use text score, video score, and group score as our metrics to evaluate a model's textual, visual and temporal understanding capabilities in a balanced manner.

Performance

Overall Results

The best model, GPT-4o, even with Chain-of-Thought prompting, can only perform at ~50% accuracy on text and video score, which is significantly undermined by the human baseline of ~90%. All other models perform worse. Most open-source models perform at random chance level. This demonstrates how Vinoground can be easily and accurately completed by an average human and the huge gap between LMMs and human intelligence with respect to temporal reasoning.

Here we provide an even more direct comparison on different models' performance using text score vs. video score. The trend of text score being significantly higher than video score can be clearly seen, and all models perform much worse than humans.

Ablation Study: Frames Sampled

It can be seen that using more frames increases LMM performance on our benchmark. This shows that temporality is indeed needed to perform well for Vinoground, and that we are not suffering from "single-frame bias". Too many frames, however, does signficantly harm performance, indicating how modern LMMs lack the ability to ignore useless visual signals from the inputs.
On the other hand, humans perform better when the entire video with audio is given when compared to most model's 32-frame sampling method. This indicates that finding ways for models to process more frames at once is an important research direction for temporal reasoning.

Ablation Study: Performance by Category

Interestingly, many models perform significantly better on the "viewpoint" and "contextual" categories that involve drastic frame changes, while being significantly worse on other categories. This shows how models are much better at analyzing coarse-level information rather than fine-grained details.

BibTeX


        @article{zhang2024vinoground,
          title={Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos},
          author={Zhang, Jianrui and Mu, Cai and Lee, Yong Jae}
          journal={arXiv},
          year={2024},
          eprint={2410.02763},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2410.02763}, 
        }

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We thank the LLaMA team for giving us access to their models, and open-source projects, including Alpaca and Vicuna.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of CLIP, LLaMA, and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.