Qwen3-Omni: Everything you need to know about the omnimodal model

Last update: 24th September 2025
  • Native omnimodal model with text, image, audio and video, and real-time streaming.
  • SOTA in 22/36 audio/video benchmarks and multilingual (119/19/10 languages).
  • Thinker–Talker architecture with MoE, low latency, and system prompt control.
  • Recommended deployment with vLLM/Transformers, Docker and official utilities.

Qwen3-Omni omnimodal model

The arrival of Qwen3-Omni has made a move on the AI ​​board: a single native model capable of understanding and responding to text, images, audio and video, with on-the-fly responses both written and spoken. We are not talking about multimodal “patches,” but rather an architecture designed from the ground up to integrate modalities with low latency and fine control of behavior.

At a time when almost everyone is trying out chatbots and assistants, Qwen3-Omni is coming with ambition: Supports 119 languages ​​for text, recognizes voice in 19 and speaks in 10, understands long audio (up to 30 minutes) and boasts reference measurements in dozens of benchmarks. In addition, its Thinker-Talker design and a Mixture of Experts approach aim to speed of response and quality of reasoning in real-life scenarios.

gpt-5-0
Related article:
GPT-5: All about the next big revolution in Artificial Intelligence

What is Qwen3-Omni and what does it offer?

Qwen3‑Omni is a family of end-to-end “omnimodal” and multilingual foundational models, designed to process text, images, audio and video with output in both text and natural voice. The key is not only the variety of inputs and outputs, but also how they work in Streaming with fluid conversation turns and the ability to respond immediately.

The team has introduced several architectural improvements for performance and efficiency: early “text-first” pretraining combined with mixed multimodal training, and a design with MoE (Mixture of Experts) that maintains the type in text and image while boosting audio and audiovisual. With this, the model achieves SOTA in 22 of 36 audio/video benchmarks and SOTA open‑source in 32 out of 36, with results comparable to Gemini 2.5 Pro in ASR, audio understanding and voice conversation.

Qwen3-Omni multimodal interaction

Key capabilities and modalities

Qwen3‑Omni comes prepared for real-life audio, vision, and audiovisual use cases, with extensive multilingual support: 119 text-to-speech languages, 19 voice-to-speech input languages, and 10 voice-to-speech output languagesVoice input languages ​​include English, Chinese, Korean, Japanese, German, Russian, Italian, French, Spanish, Portuguese, Malay, Dutch, Indonesian, Turkish, Vietnamese, Cantonese, Arabic, and Urdu; and output languages ​​include English, Chinese, French, German, Russian, Italian, Spanish, Portuguese, Japanese, and Korean.

The official cookbook suite illustrates the breadth of uses. The audio version shows Multilingual and long audio speech recognition (ASR), voice-to-text and voice-to-voice translation, music analysis (style, rhythm, genres), description of sound effects and captioning of any audio. It also supports mixed analysis of tracks with voice, music, and ambience.

In vision there is “hard” OCR for complex images, object detection and grounding, QA on images, math solution in image (where the Thinking model shines), video description, first-person video-based navigation and scene transition analysis. In audiovisual scenarios, demonstrates audio-video QA with time alignment, guided interaction with AV inputs and dialogs with assistant behavior.

As an agent, he stands out for his ability to function calling from audio, which opens voice workflows that activate tools, and in derived tasks there is a Omni‑Captioner to subtitle in great detail, which demonstrates the generalization capacity of the foundational.

  What is GPT-5: How it works, what's new, and pricing

Thinker-Talker Architecture and Design with MoE

One of the differentiating ideas is to separate responsibilities: the Thinker generates the text (with variants with explicit chain-of-thought reasoning), and the Talker produces audio in real time. This decoupling allows for natural voice conversation while the system retains a high level of understanding and planning in text.

The MoE base distributes the load among experts and relies on AuT pre-training for powerful general representations. In addition, the use of a multicode coding in the audio channel reduces latency to a minimum, something key for calls or assistants where each hundredth of a second bill.

Performance and benchmarks: text, vision, audio and audiovisual

Qwen3-Omni maintains cutting-edge text and image performance without degrading compared to Qwen models of the same size focused on a single mode, while in audio and audiovisual sets the pace in most tests. In the battery of 36 audio and audiovisual benchmarks it achieves open-source SOTA in 32 and total SOTA in 22, surpassing several points Gemini 2.5 Pro and GPT‑4o.

Some highlights in text: in AIME25 The Flash-Instruct variant is around 65,9; in ZebraLogic the Instruct reaches 90, and in Multiple It achieves competitive figures compared to GPT-4o. In alignment tasks such as IFEval and WritingBench, the Instruct and Thinking models show high and consistent scores.

In audio, the ASR results for Chinese and English are excellent: in WenetSpeech y LibriSpeech reduces the word error rate significantly, with figures close to 1,22/2,48 in LibriSpeech clean/other, and in sets such as FLOWERS (multilingual) offers very low rates. In VoiceBench, metrics such as AlpacaEval, CommonEval and WildVoice put Qwen3-Omni on par with closed reference systems, and in audio reasoning it stands out in MMAU v05.15.25.

In audiovisual, the most cited data is WorldSense≈54,1, above Gemini‑2.5‑Flash; also in sets such as DailyOmni y VideoHolmes The Thinking variant achieves improvements over previous open-source SOTAs. In pure vision, it shines in MMMU, MathVista, MathVision and document understanding (AI2D, ChartQA), with very good numbers in counting (CountBench) and in video understanding (Video‑MME, MLVU).

Zero-shot voice generation is also measured: compared to families such as CosyVoice and Seed-TTS, Qwen3-Omni records better content consistency in several languages ​​and high speaker similarity. In the multilingual section, the “Content Consistency” and “Speaker Similarity” tables show Qwen3‑Omni 30B‑A3B to be very competitive in Chinese and English, and solid in German, Italian, Portuguese, Spanish, Japanese, Korean, French, and Russian. cross-lingual TTS, achieves better WER/consistencies on multiple pairs (e.g., zh→en, ja→en, ko→zh) compared to CosyVoice 2/3.

Available models and what each one is used for

The Qwen3-Omni line includes three main parts, each designed for a specific type of use: Instruction, Thinking y CaptionerThey all come from the same core, but with different capabilities activated or fine-tuned for specific tasks.

Qwen3‑Omni‑30B‑A3B‑Instruction contains Thinker and Talker, accepts audio, video and text and returns text and audio. It's the one if you want full interaction and real-time spoken results, and it's the one that recommended for demos with voice or video.

Qwen3‑Omni‑30B‑A3B‑Thinking focuses on the Thinker with chain reasoning, supports audio, video, and text with textual output. It is useful for in-depth analysis, complex problem solving, image-based math, or workflows where you don't need voice output but the best structured thinking.

  Google launches Gemini 2.0 Flash and Pro with AI improvements for everyone

Qwen3‑Omni‑30B‑A3B‑Captioner It is a refined derivative in audio subtitling High-precision, low-hallucination. It's open source, covers arbitrary audio in great detail, and closes a historical gap in the open-source ecosystem: reliable and rich captions for general audio.

Latency, real time and behavioral control

The system is optimized for instant interaction, with figures of ≈211 ms in audio and ≈507 ms in audio-video. In addition to streaming, emphasis is placed on naturalness in speaking turns and stability in voice delivery, something to which the clear role between Thinker (text) and Talker (voice).

To split hairs, you can customize the style with system prompts. In AV scenarios where the video audio acts as a query, the team proposes a system prompt that maintains the Thinker's reasoning and a more readable and conversational text, making it easier for the Talker vocalizes fluently. It is also suggested to keep the parameter consistent use_audio_in_video throughout a multi-turn conversation.

In evaluation, there are specific guidelines: do not establish system prompt, follow the ChatML format of each benchmark and, when there is no prompt, use the following by default: Chinese ASR (“请将这段中文语音转换为纯文本。”), other language ASR (“transcribe the audio into text.”), S2TT (“Listen to the provided speech …”), and song lyrics (“transcribe the song lyrics” … no punctuation, lines separated by breaks”).

Deployment, requirements and tools

For a complete local experience, the team recommends Hugging Face Transformers and review the phases of software engineering, but be careful: being a MoE architecture, it can be slow with HF in inference; for production or low latency, they advise using vLLM or the DashScope API, and they even provide a Docker image that includes environments for both. The Transformers code is already merged, but the PyPI package not yet published and you have to install from source.

They provide utilities for handling audio and image/video (base64, URLs, embedded input), and recommend FlashAttention 2 with Transformers to reduce GPU memory whenever you load in float16 o bfloat16. With vLLM, FlashAttn2 is included, and parameters such as limit_mm_per_prompt (pre-allocates memory on GPU) and max_num_seqs for parallelism; also, upload tensor_parallel_size enables multi-GPU inference.

There are useful details to save resources: if you don't need audio, you can disable the Talker after initialization, saving ~10 GB of VRAM. And if you want faster text results, use return_audio=False in generation. Minimum theoretical memory values ​​are also provided for BF16 with FlashAttn2: for example, Instruct 30B‑A3B is around ~78,9 GB with 15-s video and ~144,8 GB at 120 s; the Thinking drops to ~68,7 GB and ~131,7 GB respectively.

to raise a web demo local, they recommend preparing the vLLM environment (or Transformers, slower), ensuring that you have ffmpeg and use their scripts. They offer GPU-ready Docker images “qwenllm/qwen3‑omni” with NVIDIA Container Toolkit, port mapping (e.g. host 8901 → container 80) and the indication to serve on 0.0.0.0. The container can be re-entered or removed at any time.

Demos, APIs, and ecosystem

If you don't want to deploy locally, you can try demos in Hugging Face Spaces and ModelScope Studio, with experiences for Qwen3‑Omni‑Realtime, Instruct, Thinking, and the Captioner. Also available Qwen Chat with real-time streaming: just choose the option voice/video call in the interface.

  DeepMind's AlphaGeometry 2 revolutionizes mathematical problem solving

To integrate at scale and with low latency, the recommended route is DashScope API, which offers the most predictable performance. Additionally, the community is coordinated through channels such as Discord and WeChat, and publish cookbooks with real execution logs that allow reproducing results by changing prompts or models.

Roadmap and ongoing improvements

The team is working on additional features such as multi-speaker speech recognition, OCR applied to video, improvements in audiovisual proactive learning and agent flows richer. They have also indicated that the support of Audio output in vLLM for the Instruct model will arrive shortly, which will close the loop on realtime deployment from that backend.

FAQ: Runtime support and quantization

Some users have commented that they cannot run Qwen3‑Omni even with the “usual suspects” and that they do not see quants in Hugging Face; furthermore, the native 16-bit format is around 70 GB, a complicated size for modest computers. The project itself clarifies that Transformers is already merged but without PyPI package, which must be installed from source, and that vLLM is the preferred choice for inference, although Instruct audio support in vLLM is will be released in the short term.

Regarding quantization, there are no HF prepared placeholders listed yet for Qwen3‑Omni 30B‑A3B, and it is worth remembering that the nature MoE and multimodal complicates immediate compatibility with runtimes like llama.cpp. For those who need to try it now, the official recommendation is to use Docker + Transformers/vLLM from source and API, and monitor the repo for support and future PRs quants when they are ready.

Good evaluation practices and prompts

In order to reproduce the numbers, guidelines are detailed: in most benchmarks it is used greedy decoding in Instruct without sampling, and for Thinking the parameters of the generation_config.json. The video is also set to fps=2 under evaluation, and it is indicated that the user prompt should go after multimodal data unless the set specifies otherwise.

When a benchmark does not include a prompt, the default prompts can be used (Chinese/other ASR, S2TT, song lyrics). In addition, a prompt should not be set. system prompt under evaluation, so that the results are comparable between systems and executions.

Qwen3‑Omni is positioned as a true omnimodal platform, with contained latency, wide multilingual coverage, cutting-edge results in audio and audiovisual and a clear deployment path using Transformers, vLLM, and Docker. For those looking for a single model that reason well in text and images without losing steam, and that also Hear, speak and understand video, is a proposal that is difficult to match today.