VITA-Audio is an open-source, end-to-end large speech model designed for efficient and fast audio-text token generation with low latency and strong performance on ASR, TTS, and SQA benchmarks.