OmniAssistBench

Assistant-style Interaction Benchmark for Omni-LLMs

Project Leader *Equal Contribution
1Nanjing University, 2ByteDance Seed,

Introduction

OmniAssistBench summarizes critical abilities

OmniAssistBench summarizes critical abilities and tasks from real world Omni-LLM applications
to provide quantified evaluation of Omni-LLMs as assistants.

Recent advances in Omni-LLMs are paving the way for real-time video chat applications. However, existing benchmarks fail to systematically evaluate models under these dynamic, interactive settings. OmniAssistBench is the first benchmark specially designed to evaluate Omni-LLMs under these assistant-style, real-time video chat scenarios by discussing two critical questions:

What abilities should be evaluated?
Under the scenario of a video assistant, we systematically summarize and categorize the core capabilities required by common, real-world applications to form a two-tier evaluation taxonomy.
How should we evaluate them?
To mimic genuine real-world interactions, we adopt open-ended questions and seamlessly embed user prompts directly into the source videos as audio or visual elements, rather than providing them as separate text.

OmniAssistBench is highly challenging. According to our scoring rubrics, even state-of-the-art commercial Omni-LLMs only provide partially correct answers, indicating that there is substantial room for improvement before Omni-LLMs can become reliable real-world assistants.

Leaderboard

OmniAssistBench requires candidate models to be capable of processing videos along with the corresponding audios. All models are graded using our LLM-as-a-Judge pipeline on a 0 - 5 scale.

Please refer to the Github page for detailed evaluation pipeline and example codes.

# Model Size Frames Basic Tier Advanced Tier Real Cases Overall (/5.0) Percentage (%)
1 Gemini-3-Pro 🥇 - - 3.18 3.41 3.40 3.32 66.4
2 Gemini-2.5-Pro 🥈 - - 3.27 3.32 2.24 3.23 64.6
3 MiMo-V2-Omni 🥉 - 1fps 2.68 2.76 2.05 2.69 53.8
4 MiniCPM-o-4.5 9B 1fps 2.23 2.39 1.89 2.30 46.0
5 Baichuan-Omni-1.5 7B 32 2.25 2.09 2.24 2.16 43.2
6 Qwen2.5-Omni 7B 1fps 1.85 2.33 2.26 2.16 43.2
7 MiniCPM-o-2.6 8B 1fps 2.09 2.01 1.56 2.01 40.2
8 Qwen3-Omni-Instruct 30B-A3B 1fps 1.42 1.82 1.76 1.68 33.6
9 VITA-1.5 7B 4 1.23 1.29 0.73 1.23 24.6
â–ºClick here to expand for detailed sub-task level evaluation results.

Sub-task level performance comparison of evaluated models.

OmniAssistBench Dataset

Task Design

OmniAssistBench decomposes real-world assistant capabilities into two complementary levels.

Basic Tier
This tier focuses on core perception and multimodal instruction following capabilities that are frequently encountered in real-world interactions but are under represented in existing video benchmarks.
â–º Click here to expand for Basic tasks
Advanced Tier
This tier evaluates the higher-level capabilities required for an interactive assistant. QA pairs in this tier are user goal-driven. Consequently, models need to go beyond accurate perception; they need to actively extract and align information to fulfill the user's objectives.
â–º Click here to expand for Advanced tasks
Scale & Diversity
There are a total of 687 open-ended question-answer pairs from 300 videos, covering 7 major task types and 16 fine-grained tasks.
Real World Cases
There are also three comprehensive multi-turn real world interaction case studies filmed by our team to reflect genuine assistant usage.

Task construction of OmniAssistBench.

Number of test samples in each sub-task.

Statistics of video durations. For multi-turn tasks, durations are calculated as the sum of each turn.

Data Format

OmniAssistBench adopts open-ended questions. A typical QA pair contains a video clip embedded with one user question, a Ground Truth (GT) answer, and a group of key points of the GT answer. There are both single and multi-turn samples depending on the tasks.

QA Format
Unlike traditional benchmarks with text-only prompts, all user questions are embedded directly into the videos as realistic audio or typing/handwriting. Only the videos are input into the candidate models.
Single-Turn Example
Offline Simulation of Online Interaction
We adopt multi-turn interaction format to simulate online interaction in an offline manner. To mimic the temporal causality of online streaming interactions, user prompt is always embedded at the very end of each video clip. Also, the video clip in the current turn always starts at where the video in the previous turn ends.
Multi-Turn Example

Annotation Pipeline

OmniAssistBench follows a 4-phases annotation pipeline to ensure data quality. Every sample was rigorously annotated by human experts, demanding ~4 hours of labor per sample. In total, this required more than 900 hours of dedicated labor to construct the whole dataset.

The labling process of OmniAssistBench

The annotation process of OmniAssistBench.

â–º Step 1: Raw Video Collection Based on Plots
â–º Step 2: QA Design and User Goal Design
â–º Step 3: Video Editing
â–º Step 4: Quality Refinement

Data Examples

Examples of Basic and Advanced Interaction Understanding Tasks

Key Plots of the Real World Cases

Key Insights

According to our scoring rubric, average scores ranged from 2-3 suggest that models generally succeed in understanding what to do but struggle with providing accurate and comprehensive answers.
From the evaluation results, we observe the following key bottlenecks in current Omni-LLMs:

Deficiency in Gesture Following
Models struggle significantly with gesture-based commands. This reveals a critical weakness in interpreting visual instructions, as models usually fail to recognize gestures as user prompts.
Long-Context Interaction Disparity
Proprietary models substantially outperform open source alternatives on multi-turn tasks, exposing a pronounced weakness in long-term memory retention and sustained context management among current open-source systems.
Failure to Keep Silen
Models systematically struggle to determine when a response is unnecessary, frequently giving video captions which are less related to the user prompt, instead of appropriately remaining silent when the visual input provides insufficient evidence.
Over-reliance on Text
Open-source models tend to echo their own replies during multi-turn tasks. This suggests an over-reliance on the text modality (the assistant's past replies) and an inability to dynamically update their context using the complex visual changes in the video.

BibTeX

@article{omniassistbench,
  author    = {},
  title     = {OmniAssistBench: Assistant-style Interaction Benchmark for Omni-LLMs},
  journal   = {},
  year      = {2026},
}