OmniAssistBench

Introduction

Recent advances in Omni-LLMs are paving the way for real-time video assistant applications, where models constantly perceive the environment and guide users to achieve certain goals through multi-turn conversations. However, evaluations under these assistant-style interaction scenarios are still challenging. OmniAssistBench aims at addressing this challenge by proposing an annotation pipeline which allows annotators to build test samples from existing Internet videos.

Why evaluation is challenging?

In traditional video understanding evaluation, the model's answer would not change the content of the test video. However, during interaction with assistant models, users will do what the model suggests. This means that the model's unpredictable response dynamically changes the subsequent video contents, which static offline datasets cannot accommodate.

How to address this challenge?

During annotation, experts first filter for videos with certain plots and deduce prior knowledge from the video content to enforce a fixed interaction path.
Based on the paths, experts then design detailed interaction turns and edit the videos accordingly to build human-assistant interaction recordings.

OmniAssistBench summarizes critical abilities

Assistant models may guide the user to achieve one goal through different paths (e.g., both Response 1 and Response 2 shown in figure (b) are valid, but test data only contains the consequent video of Response 1.)
We propose an annotation pipeline to address this problem of path diversity.

OmniAssistBench is highly challenging. According to our scoring rubrics, even state-of-the-art commercial Omni-LLMs only provide partially correct answers, indicating that there is substantial room for improvement before Omni-LLMs can become reliable real-world assistants.

Leaderboard

OmniAssistBench requires candidate models to be capable of processing videos along with the corresponding audios. Models are graded using our LLM-as-a-Judge pipeline on a 0 - 5 scale. Reported scores have been normalized to 0-100.

Please refer to the Github page for detailed evaluation pipeline and example codes.

#	Model	Size	Frames	Basic Tier	Advanced Tier	Real Cases	Overall (/100)
1	Gemini-3-Pro 🥇	-	-	63.6	68.2	68.0	66.4
2	Gemini-2.5-Pro 🥈	-	-	65.4	66.4	44.8	64.6
3	Doubao-Seed-2.0-lite🥉	-	1fps	53.2	62.1	35.5	57.3
4	MiMo-V2-Omni	-	1fps	53.6	55.2	41.0	53.8
5	Qwen3.5-Omni-Plus	-	1fps	41.6	57.8	50.6	51.6
6	Qwen3-Omni-Instruct	30B-A3B	1fps	46.4	53.8	53.8	51.2
7	MiniCPM-o-4.5	9B	1fps	44.6	47.8	37.8	46.0
8	Baichuan-Omni-1.5	7B	32	45.0	41.8	44.8	43.2
9	Qwen2.5-Omni	7B	1fps	37.0	46.6	45.2	43.2
10	MiniCPM-o-2.6	8B	1fps	41.8	40.2	31.2	40.2
11	VITA-1.5	7B	4	24.6	25.8	14.6	24.6

►Click here to expand for detailed sub-task level evaluation results.

Performance comparison on Basic Interactive Understanding tasks

#	Model	Overall Avg.	Basic Interactive Understanding Tasks														Basic Tier Avg.
			Social Perception				Temporal Perception				Referential Perception			Non-audio Prompt
			II	AI	CE	Avg.	ER	AO	DC	Avg.	AR	LR	Avg.	GPF	OPF	Avg.
1	Gemini-3-Pro 🥇	66.4	71.8	62.8	66.2	67.0	61.2	74.6	52.6	62.6	72.6	66.4	69.8	55.0	66.6	57.4	63.6
2	Gemini-2.5-Pro 🥈	64.6	77.2	59.0	69.6	68.8	62.6	72.0	55.0	63.0	74.0	78.2	76.0	54.8	65.4	56.8	65.4
3	Doubao-Seed-2.0-lite🥉	57.3	64.5	54.3	50.0	55.9	56.3	78.7	51.3	61.7	72.6	65.5	69.4	29.5	57.3	35.0	53.2
4	MiMo-V2-Omni	53.8	66.6	56.0	64.0	62.4	50.0	77.4	60.0	62.2	67.6	56.4	62.4	27.2	62.6	34.6	53.6
5	Qwen3.5-Omni-Plus	51.6	61.8	33.4	55.2	50.6	25.0	65.4	50.0	46.4	51.2	50.0	50.6	21.0	37.4	24.4	41.6
6	Qwen3-Omni-Instruct	51.2	59.0	50.4	54.6	54.8	52.6	56.0	41.2	49.8	54.8	56.4	55.6	28.2	41.4	30.8	46.4
7	MiniCPM-o-4.5	46.0	62.8	53.4	50.8	55.4	37.6	65.4	43.8	48.4	60.0	52.8	56.8	24.6	24.0	24.4	44.6
8	Baichuan-Omni-1.5	43.2	43.6	40.0	47.0	43.8	43.8	49.4	40.0	44.2	50.4	44.6	47.8	43.2	52.0	45.0	45.0
9	Qwen2.5-Omni	43.2	29.0	42.8	48.4	40.6	48.8	52.0	42.6	47.6	48.8	40.0	44.8	21.6	24.0	22.2	37.0
10	MiniCPM-o-2.6	40.2	54.6	46.6	44.6	48.4	35.0	46.6	38.8	40.0	47.4	51.0	49.0	33.8	25.4	32.2	41.8
11	VITA-1.5	24.6	21.8	22.8	20.0	21.4	25.0	28.0	26.6	26.6	31.8	14.6	24.0	26.2	29.4	26.8	24.6

Performance comparison on Advanced Interactive Understanding tasks and Real World Cases

#	Model	Overall Avg.	Advanced Interactive Understanding Tasks									Real World Cases
			CR	Proactive Response			Process Tracking				Avg.	Real World Cases
			CR	SER	MER	Avg.	ST	CT	MT	Avg.	Avg.	Ms	Ht	Ba	Avg.
1	Gemini-3-Pro 🥇	66.4	76.0	67.8	48.4	58.6	71.8	75.0	70.2	72.2	68.2	76.4	65.8	65.0	68.0
2	Gemini-2.5-Pro 🥈	64.6	74.0	61.2	47.8	54.8	67.8	82.2	69.4	72.0	66.4	51.0	32.8	50.0	44.8
3	Doubao-Seed-2.0-lite🥉	57.3	76.0	46.6	62.6	54.9	68.6	73.8	50.3	64.9	62.1	26.0	21.4	50.0	35.5
4	MiMo-V2-Omni	53.8	78.0	39.4	46.6	42.8	56.8	61.6	63.0	59.8	55.2	52.6	27.2	44.4	41.0
5	Qwen3.5-Omni-Plus	51.6	56.0	58.0	63.8	61.0	54.6	62.6	53.0	56.2	57.8	51.0	41.6	56.0	50.6
6	Qwen3-Omni-Instruct	51.2	72.0	40.4	58.8	50.0	55.8	52.6	53.6	54.4	53.8	67.2	40.0	56.0	53.8
7	MiniCPM-o-4.5	46.0	63.0	53.2	53.0	53.0	44.6	46.0	40.2	43.8	47.8	5.4	50.0	47.0	37.8
8	Baichuan-Omni-1.5	43.2	35.0	56.8	48.4	52.8	32.0	40.4	40.6	36.6	41.8	38.2	51.4	44.0	44.8
9	Qwen2.5-Omni	43.2	45.0	47.2	55.4	51.2	51.0	32.8	44.4	44.4	46.6	18.2	60.0	49.0	45.2
10	MiniCPM-o-2.6	40.2	39.0	42.0	32.6	37.4	43.2	43.8	37.6	41.8	40.2	9.0	34.2	41.0	31.2
11	VITA-1.5	24.6	13.0	23.8	34.4	28.8	34.2	16.0	18.2	25.0	25.8	11.0	15.8	16.0	14.6

Sub-task level performance comparison of evaluated models.
Click on the model names on the right to display the corresponding lines.

Task Design

OmniAssistBench summarizes critical abilities and tasks from real world applications to form a two-tier task framework.

To build a comperhensive benchmark, OmniAssistBench decomposes real-world assistant capabilities into two levels.

Basic Tier

This tier focuses on core perception and multimodal instruction following capabilities that are frequently encountered in real-world interactions but are under represented in existing video benchmarks.

► Click here to expand for Basic tasks

Advanced Tier

This tier evaluates the higher-level capabilities required for an interactive assistant. QA pairs in this tier are user goal-driven. Consequently, models need to go beyond accurate perception; they need to actively extract and align information to fulfill the user's objectives.

► Click here to expand for Advanced tasks

Scale & Diversity

There are a total of 685 open-ended question-answer pairs, covering 7 major task types and 16 fine-grained tasks. Video domains cover common daily topics such as sports, cooking, lectures, DIYs, and talk shows.

Real World Cases

We design and film 3 cases with an average of 15 interaction turns. Each case specifically targets the combinations of a group of the abilities evaluated in our benchmark to reflect genuine assistant usage.

Basic Interactive Understanding Tasks

Major task	Sub task	Turns	Description
Social Perception	Identity Identification (II)	Single	"Who said that sentence? Who performed that action?"
	Addressee Identification (AI)	Single	"Who is he/she talking to?"
	Complex Emotion Understanding (CE)	Single	"What emotion or attitude is the person expressing?" One modality (e.g. expression or dialogue) alone cannot lead to the correct answer.
Temporal Perception	Event Retrieval (ER)	Single	"When the main person A said xxx, what color clothes were the person passing behind them wearing?" The model needs to establish temporal correlations between the main and non-salient objects or details.
	Appearance Order Perception (AO)	Single	"Which colors of clothes did the blogger try on in sequence?" Model may need to distinguish the target object from similar ones. For example, the blogger may have picked up a lot of clothes but only tried on a part of them.
	Dynamic Counting (DC)	Single	"How many standard pull-ups did the person perform?" Counting a large number of instances (>10) or requires the model to filter specific objects.
Referential Perception	Action Reference Perception (AR)	Single	"Which student is the teacher pointing at? Which file is the mouse point at on the screen?" The "pointing" action alone might be ambiguous, so contextual information may be required by the correct answer.
Referential Perception	Linguistic Reference Perception (LR)	Single	"What is the title of the second book on the right?"
Non-audio Prompt Following	Gesture-based Prompt Following (GPF)	4 turns in avg.	4 gestures are defined at the beginning of the video, then these gestures are embedded using a picture-in-picture layout as prompts. The gestures as prompts may be made by a different person. Not all pre-defined gestures necessarily appear as prompts, nor do they strictly follow the defined order.
Non-audio Prompt Following	OCR-based Prompt Following (OPF)	Single	User prompts are embedded as subtitles or images / video clips of hand writing using a picture-in-picture layout. To balance task difficulty, most questions only ask about obvious properties of the target object.

Basic Interactive Understanding Tasks

Major task	Sub task	Turns	Description
Social Perception	Identity Identification (II)	Single	"Who said that sentence? Who performed that action?"
	Addressee Identification (AI)	Single	"Who is he/she talking to?"
	Complex Emotion Understanding (CE)	Single	"What emotion or attitude is the person expressing?" One modality (e.g. expression or dialogue) alone cannot lead to the correct answer.
Temporal Perception	Event Retrieval (ER)	Single	"When the main person A said xxx, what color clothes were the person passing behind them wearing?" The model needs to establish temporal correlations between the main and non-salient objects or details.
	Appearance Order Perception (AO)	Single	"Which colors of clothes did the blogger try on in sequence?" Model may need to distinguish the target object from similar ones. For example, the blogger may have picked up a lot of clothes but only tried on a part of them.
	Dynamic Counting (DC)	Single	"How many standard pull-ups did the person perform?" Counting a large number of instances (>10) or requires the model to filter specific objects.
Referential Perception	Action Reference Perception (AR)	Single	"Which student is the teacher pointing at? Which file is the mouse point at on the screen?" The "pointing" action alone might be ambiguous, so contextual information may be required by the correct answer.
Referential Perception	Linguistic Reference Perception (LR)	Single	"What is the title of the second book on the right?"
Non-audio Prompt Following	Gesture-based Prompt Following (GPF)	4 turns in avg.	4 gestures are defined at the beginning of the video, then these gestures are embedded using a picture-in-picture layout as prompts. The gestures as prompts may be made by a different person. Not all pre-defined gestures necessarily appear as prompts, nor do they strictly follow the defined order.
Non-audio Prompt Following	OCR-based Prompt Following (OPF)	Single	User prompts are embedded as subtitles or images / video clips of hand writing using a picture-in-picture layout. To balance task difficulty, most questions only ask about obvious properties of the target object.

Task construction of OmniAssistBench.

Number of test samples in each sub-task.

Statistics of video durations. For multi-turn tasks, durations are calculated as the sum of each turn.

Data Format

OmniAssistBench adopts open-ended questions. A typical QA pair contains a video clip embedded with one user question, a Ground Truth (GT) answer, and a group of key points of the GT answer. There are both single and multi-turn samples depending on the tasks.

QA Format

Unlike traditional benchmarks with text-only prompts, all user questions are embedded directly into the videos as realistic audio or typing/handwriting. Only the videos are input into the candidate models.

Offline Simulation of Online Interaction

We adopt multi-turn interaction format to simulate online interaction in an offline manner. To mimic the temporal causality of online streaming interactions, user prompt is always embedded at the very end of each video clip. Also, the video clip in the current turn always starts at where the video in the previous turn ends.

Annotation Pipeline

OmniAssistBench follows a 4-phases annotation pipeline to ensure data quality. Every sample was rigorously annotated by human experts, demanding ~4 hours of labor per sample. In total, this required more than 900 hours of dedicated labor to construct the whole dataset.

The annotation process of OmniAssistBench.

► Step 1: Raw Video Collection Based on Plots

► Step 2: QA Design and User Goal Design

► Step 3: Video Editing

► Step 4: Quality Refinement

Examples of Basic and Advanced Interaction Understanding Tasks

Examples for the Non-audio Prompt Following tasks:
The Gesture-based Prompt Following Task (GPF, left) and the OCR-based Prompt Following Task (OPF, right).

For GPF tasks, the meaning of 4 gestures are defined at the beginning, then the gesture prompts are embedded in a picture-in-picture manner. For the OPF tasks, prompts appear as subtitles or handwritings.

An example for the Context-Aware Response task

Examples for the Context-Aware Response (CR) task.

User questions in this task are designed not to directly mention the target objects (e.g. The blue download button for Windows system in this example). Models need to actively associate the video content with the user questions and filter out related information that can be presented in the replies.

An example for the Single Event-triggered Response task

Examples for the Single Event-triggered Response (SER) task.

Models need to delay responsing until the target event happends. Different from other benchmarks which periodically questioning the model wether to reply or not, we only question once at the beginning, leaving the model to decide the proper response time on its own.

An example for the Multi Event-triggered Response task

Examples for the Multi Event-triggered Response (MER) task.

Models need to keep replying whenever the target events appear, with the user prompt input only once.

An example for the Multitask Tracking task

Examples for the Single Multitask Tracking (MT) task.

Models need to simultaneously keep track of the statuses of multiple processes listed at the beginning of the video.

Key Plots of the Real World Cases

Key plots of the Meeting Simulation task

Key plots of the Meeting Simulation (Ms) task.

The user was watching a round-table discussion video and taking notes,
mimicking the scenario of attending online meetings.

Key plots of the Blind Assistance (Ba) task.

The user pretended to be blind. They asked the assistant to guide them around the office
based on a hand-drawn map, and to tell them when another person arrives.

Key plots of the Handicraft Process Tracking task

Key plots of the Handicraft Process Tracking (Ht) task.

Three people were making handicrafts together. The assistant was required not only to track each person's progress, but also to understand the social context by correctly identifying which questions were directed at it.

According to our scoring rubric, average scores ranged from 40-60 suggest that models generally succeed in understanding what to do but struggle with providing accurate and comprehensive answers.
Apart from basic perception shortcomings, we observe the following key bottlenecks in current Omni-LLMs,

Deficiency in Visual Prompts

Models struggle significantly with gesture-based commands. This reveals a critical weakness in interpreting visual instructions, as models usually fail to recognize gestures as user prompts.

Context Limitation

Input videos and audios can be token-consuming. Without specialized mechanisms for long-term memory, context length limits are quickly exhausted within minutes. As a result, models quickly forget user goals and fail to provide useful guidence.

Failure in Delayed Response

Models systematically struggle to determine when a response is unnecessary, frequently giving video captions which are less related to the user prompt, instead of appropriately remaining silent when the visual input provides insufficient evidence.

BibTeX

@article{omniassistbench,
  author    = {},
  title     = {OmniAssistBench: Assistant-style Interaction Benchmark for Omni-LLMs},
  journal   = {},
  year      = {2026},
}

OmniAssistBench

Assistant-style Interaction Benchmark for Omni-LLMs

Introduction

Leaderboard

Performance comparison on Basic Interactive Understanding tasks

Performance comparison on Advanced Interactive Understanding tasks and Real World Cases

OmniAssistBench Dataset

Task Design

Basic Interactive Understanding Tasks

Basic Interactive Understanding Tasks

Data Format

Annotation Pipeline

Data Examples

Examples of Basic and Advanced Interaction Understanding Tasks

Key Plots of the Real World Cases

Key Insights

BibTeX