MOSS Audio

MOSS Audio

Model family: MOSS

MOSS-Audio is a modular audio foundation model family for unified real-world audio understanding rather than just transcription. The repository says it combines a dedicated audio encoder, a modality adapter, and a Qwen3 language-model backbone, with cross-layer feature injection and explicit time-marker insertion for stronger temporal reasoning. It supports ASR with timestamps, speaker and emotion analysis, sound-scene understanding, music understanding, audio QA, summarization, and complex reasoning. The initial release includes 4B and 8B Instruct and Thinking variants, with the 8B-Thinking model reported as the strongest open-source model on the repo’s general audio-understanding benchmark summary.

Overview

MOSS-Audio is an open-source unified audio understanding model family from MOSI.AI, OpenMOSS, and the Shanghai Innovation Institute. It is built to handle speech, environmental sound, music, captioning, time-aware QA, and complex audio reasoning in one system, with 4B and 8B Instruct and Thinking variants.

🗒Transcription 🎙️Voice recognition 🎧Audio translation 🎵Music analysis

About OpenMOSS

View Company Profile

Last updated: July 9, 2026

Go to section

Search

Overview

About OpenMOSS

Other models from this family

Related Models

Help

People also viewed

Create AI Tools

Mini Tool

Vibe code an AI Tool

Choose listing type: