MoshiVis is a Vision Speech Model that enables natural conversations about images while maintaining low latency. It builds on the Moshi speech-text model and adds visual discussion capabilities with a cross-attention mechanism.