Qwen2.5-Omni is an end-to-end multimodal model capable of understanding text, audio, vision, and video, and performing real-time speech generation.