SVBench Leaderboard
Welcome to the leaderboard of the SVBench! ๐
SVBench is a benchmark specifically designed to evaluate the performance of Large Vision-Language Models (LVLMs) in long-context streaming video understanding tasks. This benchmark comprehensively assesses the models' capabilities in handling streaming videos through its unique temporal multi-turn question-answering chains. To facilitate research and development, SVBench provides a detailed leaderboard showcasing the performance results of over a dozen models on this benchmark. By ranking the models based on their performance on SVBench, users can quickly identify models that excel in specific tasks, thereby guiding subsequent research and applications. Detailed information about SVBench and the leaderboard can be accessed via the following link: SVBench Benchmark. The paper is available at: SVBench Paper. Additionally, the related dataset is hosted on the Hugging Face platform, and researchers can access it at SVBench Dataset for further experiments and model development. This leaderboard not only provides a fair competitive environment for current models but also serves as an important reference standard for future model improvements and innovations.
Model | Type | Size | F/FPS | Dialogue_OS | Streaming_OS | Average |
---|---|---|---|---|---|---|
LLaVA-NeXT-Video | ImageLLM | 7B | 1fps | 62.57 | 59.97 | 61.27 |
Despite the significant advancements of Large Vision-Language Models (LVLMs) on established benchmarks, there remains a notable gap in suitable evaluation regarding their applicability in the emerging domain of long-context streaming video understanding. Current benchmarks for video understanding typically emphasize isolated single-instance text inputs and fail to evaluate the capacity to sustain temporal reasoning throughout the entire duration of video streams. To address these limitations, we introduce SVBench, a pioneering benchmark with temporal multi-turn question-answering chains specifically designed to thoroughly assess the capabilities of streaming video understanding of current LVLMs. We design a semi-automated annotation pipeline to obtain 49,979 Question-Answer (QA) pairs of 1,353 streaming videos, which includes generating QA chains that represent a series of consecutive multi-turn dialogues over video segments and constructing temporal linkages between successive QA chains. Our experimental results, obtained from 14 models in dialogue and streaming evaluations, reveal that while the closed-source GPT-4o outperforms others, most open-source LVLMs struggle with long-context streaming video understanding. We also construct a StreamingChat model, which significantly outperforms open-source LVLMs on our SVBench and achieves comparable performance on diverse vision-language benchmarks. We expect SVBench to advance the research of streaming video understanding by providing a comprehensive and in-depth analysis of current LVLMs.