MoshiVis: Audio AI model that understands images

MoshiVis: Audio AI model that understands images

MoshiVis: Conversational AI model that understands images

1st Speech Vision model built on Kyutai Moshi

Photo by Richard Horvath on Unsplash

Audio AI models are on trend now, where every other tech company is releasing some crazy audio AI model. Earlier it was the Sesame CSM 1B model, today it was OpenAI FM, and now we have a release from Kyutai labs, releasing MoshiVis, a one-of-a-kind model that can converse with human audio alongside images as well.

MoshiVis is the first audio AI model which can intake images.

https://medium.com/media/74336892229242c0ce0e83757c9ca5a1/href

Key Features of the MoshiVis Model

  • Vision-Speech Model (VSM): MoshiVis extends Moshi, a speech-text foundation model, by incorporating visual understanding. It allows users to have natural, real-time spoken conversations about images, making it a multimodal AI system.
  • Low Latency & Conversational Style: Despite integrating vision capabilities, it maintains Moshi’s natural dialogue flow with minimal latency, making it suitable for interactive applications like virtual assistants or accessible AI interfaces.

it is almost real-time.

  • Parameter Expansion: MoshiVis adds 206M adapter parameters on top of Moshi’s 7B base model and integrates a pretrained frozen 400M PaliGemma2 vision encoder, enhancing multimodal reasoning.

Data Science in Your Pocket – No Rocket Science

2. Architecture & Mechanisms

  • Cross-Attention Mechanism: A key architectural enhancement that injects visual features into the speech token stream, enabling the model to understand and discuss images within spoken dialogues.
  • Gating Mechanism: To preserve Moshi’s original conversational abilities, MoshiVis introduces a gating mechanism that controls the influence of visual information. This ensures that the model does not become overly reliant on image data and can switch between vision-enhanced and pure speech-text modes as needed.
  • Memory Optimization: Unlike traditional models that expand in size when adding new capabilities, MoshiVis shares cross-attention projection weights across layers, reducing memory usage while keeping processing efficient.

How it works?

1. Inputs: Multimodal Streams

  • The model takes in three types of user input:
  • User’s Audio (spoken input)
  • Assistant’s Previous Audio (for context)
  • Assistant’s Previous Text (for maintaining conversational flow)

These inputs are aggregated (Σ) and passed into the Speech LLM (Transformer Backbone) for processing.

2. Image Encoder (Vision Input)

  • If an image is provided, the Image Encoder extracts visual features from the image.
  • These features are fed into Cross-Attention (CA) Modules within the Speech LLM, allowing the model to integrate vision with speech processing.

3. Speech LLM (Transformer Backbone)

  • The core processing unit handles speech and text-based reasoning.
  • Cross-Attention (CA) Modules inject visual features into the speech token stream.
  • Gating Mechanism (⊓) ensures controlled integration of vision into the conversation, preventing excessive dependence on image features.

4. Speech LLM (Audio Depth Transformer)

  • After multimodal processing, outputs are sent to a secondary Speech LLM, which specializes in generating speech responses with correct intonation and timing.
  • This final model ensures the assistant speaks naturally while maintaining low latency.

5. Outputs: Assistant’s Response

  • The model generates two outputs:
  • Textual Response (e.g., “cat.”)
  • Audio Response (spoken answer)

Both are delivered in real-time, making MoshiVis a fully conversational vision-speech model.

3. Model Components

MoshikaVis: A specific MoshiVis variant based on Moshika, a female-voice checkpoint from the open-source Moshi model. This variant ensures diversity in speech output options.

PaliGemma2 Vision Encoder: The model relies on frozen weights from a PaliGemma2 image-text encoder, which allows it to process and interpret images without the need for additional vision model fine-tuning.

Bundled Components: Each MoshiVis checkpoint includes all necessary modules:

Vision Adaptation Modules: Additional layers to integrate visual data into speech.

Mimi Speech Codec: A specialized speech encoding/decoding system for better voice generation.

Helium Text Tokenizer: Optimized text tokenization for efficient speech-to-text conversion.

Base Moshi Model: The core conversational AI system.

Image Encoder: A pre-trained component responsible for extracting visual embeddings from images.

How to use MoshVis for free?

A live demo can be tried out at the given URL below.

moshi.chat

model weights should be updated on Hugging Face shortly.

kyutai/moshika-vis-pytorch-bf16 · Hugging Face

do try out MoshiVis. It’s a unique model and the output is great. Good days for open-source AI.

See you next with something amazing on AI soon.


MoshiVis: Audio AI model that understands images was originally published in Data Science in your pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.

Share this article
0
Share
Shareable URL
Prev Post

OpenAI FM: OpenAI releases text-speech model playground

Next Post

How to install Blender-MCP in Windows for Claude AI?

Read next
Subscribe to our newsletter
Get notified of the best deals on our Courses, Tools and Giveaways..