OmniAssistBench

Assistant-style Interaction Benchmark for Omni-LLMs

Project Leader *Equal Contribution
1Nanjing University, 2Nankai University, 3University of Waterloo

Introduction

Recent advances in Omni-LLMs are paving the way for real-time video assistant applications, where models constantly perceive the environment and guide users to achieve certain goals through multi-turn conversations. However, evaluations under these assistant-style interaction scenarios are still challenging. OmniAssistBench aims at addressing this challenge by proposing an annotation pipeline which allows annotators to build test samples from existing Internet videos.

Why evaluation is challenging?
In traditional video understanding evaluation, the model's answer would not change the content of the test video. However, during interaction with assistant models, users will do what the model suggests. This means that the model's unpredictable response dynamically changes the subsequent video contents, which static offline datasets cannot accommodate.
How to address this challenge?
During annotation, experts first filter for videos with certain plots and deduce prior knowledge from the video content to enforce a fixed interaction path.
Based on the paths, experts then design detailed interaction turns and edit the videos accordingly to build human-assistant interaction recordings.
OmniAssistBench summarizes critical abilities

Assistant models may guide the user to achieve one goal through different paths (e.g., both Response 1 and Response 2 shown in figure (b) are valid, but test data only contains the consequent video of Response 1.)
We propose an annotation pipeline to address this problem of path diversity.

OmniAssistBench is highly challenging. According to our scoring rubrics, even state-of-the-art commercial Omni-LLMs only provide partially correct answers, indicating that there is substantial room for improvement before Omni-LLMs can become reliable real-world assistants.

Leaderboard

OmniAssistBench requires candidate models to be capable of processing videos along with the corresponding audios. Models are graded using our LLM-as-a-Judge pipeline on a 0 - 5 scale. Reported scores have been normalized to 0-100.

Please refer to the Github page for detailed evaluation pipeline and example codes.

# Model Size Frames Basic Tier Advanced Tier Real Cases Overall (/100)
1 Gemini-3-Pro 🥇 - - 63.6 68.2 68.0 66.4
2 Gemini-2.5-Pro 🥈 - - 65.4 66.4 44.8 64.6
3 Doubao-Seed-2.0-lite🥉 - 1fps 53.2 62.1 35.5 57.3
4 MiMo-V2-Omni - 1fps 53.6 55.2 41.0 53.8
5 Qwen3.5-Omni-Plus - 1fps 41.6 57.8 50.6 51.6
6 Qwen3-Omni-Instruct 30B-A3B 1fps 46.4 53.8 53.8 51.2
7 MiniCPM-o-4.5 9B 1fps 44.6 47.8 37.8 46.0
8 Baichuan-Omni-1.5 7B 32 45.0 41.8 44.8 43.2
9 Qwen2.5-Omni 7B 1fps 37.0 46.6 45.2 43.2
10 MiniCPM-o-2.6 8B 1fps 41.8 40.2 31.2 40.2
11 VITA-1.5 7B 4 24.6 25.8 14.6 24.6
â–ºClick here to expand for detailed sub-task level evaluation results.

Sub-task level performance comparison of evaluated models.
Click on the model names on the right to display the corresponding lines.

OmniAssistBench Dataset

Task Design

OmniAssistBench summarizes critical abilities

OmniAssistBench summarizes critical abilities and tasks from real world applications to form a two-tier task framework.

To build a comperhensive benchmark, OmniAssistBench decomposes real-world assistant capabilities into two levels.

Basic Tier
This tier focuses on core perception and multimodal instruction following capabilities that are frequently encountered in real-world interactions but are under represented in existing video benchmarks.
â–º Click here to expand for Basic tasks
Advanced Tier
This tier evaluates the higher-level capabilities required for an interactive assistant. QA pairs in this tier are user goal-driven. Consequently, models need to go beyond accurate perception; they need to actively extract and align information to fulfill the user's objectives.
â–º Click here to expand for Advanced tasks
Scale & Diversity
There are a total of 685 open-ended question-answer pairs, covering 7 major task types and 16 fine-grained tasks. Video domains cover common daily topics such as sports, cooking, lectures, DIYs, and talk shows.
Real World Cases
We design and film 3 cases with an average of 15 interaction turns. Each case specifically targets the combinations of a group of the abilities evaluated in our benchmark to reflect genuine assistant usage.

Task construction of OmniAssistBench.

Number of test samples in each sub-task.

Statistics of video durations. For multi-turn tasks, durations are calculated as the sum of each turn.

Data Format

OmniAssistBench adopts open-ended questions. A typical QA pair contains a video clip embedded with one user question, a Ground Truth (GT) answer, and a group of key points of the GT answer. There are both single and multi-turn samples depending on the tasks.

QA Format
Unlike traditional benchmarks with text-only prompts, all user questions are embedded directly into the videos as realistic audio or typing/handwriting. Only the videos are input into the candidate models.
Single-Turn Example
Offline Simulation of Online Interaction
We adopt multi-turn interaction format to simulate online interaction in an offline manner. To mimic the temporal causality of online streaming interactions, user prompt is always embedded at the very end of each video clip. Also, the video clip in the current turn always starts at where the video in the previous turn ends.
Multi-Turn Example

Annotation Pipeline

OmniAssistBench follows a 4-phases annotation pipeline to ensure data quality. Every sample was rigorously annotated by human experts, demanding ~4 hours of labor per sample. In total, this required more than 900 hours of dedicated labor to construct the whole dataset.

The labling process of OmniAssistBench

The annotation process of OmniAssistBench.

â–º Step 1: Raw Video Collection Based on Plots
â–º Step 2: QA Design and User Goal Design
â–º Step 3: Video Editing
â–º Step 4: Quality Refinement

Data Examples

Examples of Basic and Advanced Interaction Understanding Tasks

Key Plots of the Real World Cases

Key Insights

According to our scoring rubric, average scores ranged from 40-60 suggest that models generally succeed in understanding what to do but struggle with providing accurate and comprehensive answers.
Apart from basic perception shortcomings, we observe the following key bottlenecks in current Omni-LLMs,

Deficiency in Visual Prompts
Models struggle significantly with gesture-based commands. This reveals a critical weakness in interpreting visual instructions, as models usually fail to recognize gestures as user prompts.
Context Limitation
Input videos and audios can be token-consuming. Without specialized mechanisms for long-term memory, context length limits are quickly exhausted within minutes. As a result, models quickly forget user goals and fail to provide useful guidence.
Failure in Delayed Response
Models systematically struggle to determine when a response is unnecessary, frequently giving video captions which are less related to the user prompt, instead of appropriately remaining silent when the visual input provides insufficient evidence.

BibTeX

@article{omniassistbench,
  author    = {},
  title     = {OmniAssistBench: Assistant-style Interaction Benchmark for Omni-LLMs},
  journal   = {},
  year      = {2026},
}