JoyAI-VL-Interaction

The first open, vision-driven real-time interaction model β€” it watches a live video stream and decides on its own when to speak, stay silent, or delegate.

πŸ“„ Paper Β· 🌐 Project Page & Demos Β· πŸ’» GitHub Β· πŸ€— Paper Page


Overview

Most large models today are turn-based: they answer only when you ask. But many moments in the real world don't wait for a question β€” a fire starts on a security feed, someone falls, a product flashes by in a livestream. Once missed, the moment is gone.

JoyAI-VL-Interaction is built for exactly these moments. It is an 8B-scale, vision-first interaction model that continuously watches a live video stream and, every second, decides on its own to take one of three actions:

  • Speak β€” respond when something is worth saying
  • Stay silent β€” keep watching when nothing warrants a response (a first-class, trained action)
  • Delegate β€” hand a hard subtask to a background model/agent, keep watching, and weave the result back in when it returns

The decision of when to act is learned inside the model (from second-by-second time-aligned data + RL), not bolted on by an external turn-detector or polling loop. Vision is the first-class driver; speech (ASR/TTS) is treated as pluggable I/O.

To our knowledge, this is the first open, vision-driven interaction model released together with its training recipe, data, and a complete deployable system.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for jdopensource/JoyAI-VL-Interaction-Preview