For creators and audio engineers, the latest development in open-source AI represents a shift from passive dictation to active, context-aware listening. Existing models operate like a recording device that only speaks once the input stops, whereas this new system acts as a true conversational partner. It processes continuous streams of sound, distinguishing between meaningful dialogue, background noise, and ambient events in real time. This capability allows makers to build applications where the AI can translate speech, transcribe notes, or react to specific sounds like a glass breaking, all within a single, unified architecture.
A decision every 0.4 seconds
The core innovation lies in how the model processes time. Instead of waiting for a full sentence or a pause, the system slices the audio stream into 0.4-second segments. After analysing each chunk, it must decide via a special token whether to output silence or generate a response. This continuous evaluation loop enables the model to handle multiple tasks simultaneously—such as translating a conversation while identifying a sudden noise—without the latency associated with traditional pipelines.
Performance and benchmarks
In independent testing, the Audio Interaction model scored 58.15 points on the MMAU audio benchmark. This result narrowly surpassed its base model, Qwen2.5-Omni-3B, and approached the performance of significantly larger 7B parameter models. The system showed marked improvement over the base version specifically in English-Chinese translation tasks.
Training on synthetic reality
To teach the model when to speak and when to wait, the research team from China, Hong Kong, and Singapore constructed a bespoke training environment. Existing datasets often consist of short, isolated clips that fail to capture the nuance of long conversations with sparse response signals. The team instead generated a dataset called StreamAudio-2M.
The creation process involved three stages: a language model designed plausible scenarios, such as a kitchen scene in the morning, containing three to 15 sub-events; a search of existing databases for matching audio clips, or the generation of missing sounds using tools like AudioX or ElevenLabs; and a final preprocessing step to smooth out the transitions between generated and real audio. The resulting dataset spans approximately 302,000 hours of audio across seven skill areas and 28 subtasks.
Solving the silence problem
During development, two recurring issues emerged. First, the model struggled to retain information from earlier in long, noisy sequences. The solution was to include questions in the training data that referenced content from much further back in the audio, forcing the system to build robust long-term memory. Second, the model initially triggered responses to irrelevant background sounds. The team addressed this by training with vast amounts of verified silence and ambient audio explicitly marked as non-trigger events.
This approach proved effective on the new ProactiveSound Bench, which features 644 human-curated events. On this metric, the model outperformed Gemini 3 Flash, Kimi-Audio-Instruct, and Step-Audio 2.
A queue-based architecture
For practical deployment, the researchers separated incoming audio processing from response generation. Both processes run in parallel, communicating via a data queue. The audio processing side continuously writes new chunks, while the response generation side only reads them when it has nothing to say. This architectural change prevented the system from stalling.
Without this split, the time-to-first-response increased from 392 milliseconds to 831 milliseconds, and the system became stuck 5.2 percent of the time. The chosen 0.4-second chunk size represents a calculated trade-off: reducing it to 0.2 seconds provides insufficient context for dialogue, while increasing it to 0.8 seconds causes latency to climb to 786 milliseconds.
The code and instructions for downloading the model weights are available on GitHub under an Apache 2.0 license, with no restrictions on commercial use. The full training dataset is expected to be released later.
Key takeaways
- The model processes audio in 0.4-second segments, deciding every cycle whether to speak or remain silent, enabling true real-time interaction.
- By splitting audio processing and response generation into parallel queues, the system reduces latency from 831ms to 392ms and eliminates blocking errors.
- Trained on a synthetic dataset of 302,000 hours, the model outperforms current leaders like Gemini 3 Flash on proactive noise detection tasks.
- The open-source weights are available under the Apache 2.0 license, allowing immediate adoption for commercial applications.
Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.




