JoyAI-Echo open-source framework targets stable 5-minute AI long videos
Original: 5分钟AI长视频不翻车!国产开源框架杀到全球第一梯队
JoyAI-Echo is an open-source long-video generation framework focused on consistency, latency, editing, and upscaling.
QbitAI reports that JD’s team has open-sourced JoyAI-Echo, a long audio-video generation framework for multi-minute AI videos. It targets character drift, unstable voice, slow inference, and blurry output through cross-modal memory, memory-driven post-training, and lightweight real-time super-resolution. The system also includes a Director Agent for script planning, shot-level generation, localized edits, and iterative video production.
QbitAI reports that a team affiliated with JD.com has open-sourced the long audio-video generation framework JoyAI-Echo, which is positioned to push AI video from short-clip demonstrations toward a more complete long-form video production workflow. The article notes that AI video generation is already relatively mature for clips ranging from a few seconds to tens of seconds, but once extended to several minutes, it runs into problems such as inconsistent character appearance from frame to frame, drifting speech timbre, broken cross-scene narrative, high inference latency, and excessive post-editing costs. One of JoyAI-Echo's core solutions is to build a cross-modal audio-video memory bank that stores not only a character's visual features but also synchronously records their voice features, continuously drawing on them during subsequent shot generation so that a character's identity, appearance, and timbre remain consistent across multi-shot videos of around five minutes. The second key point is a memory-driven post-training pipeline that includes SFT, reinforcement learning from human feedback, and DMD distillation; the article claims that DMD-related optimizations can yield roughly a 7.5x inference speedup, bringing long-video generation closer to a usable tool rather than a one-off demonstration. The third key point is lightweight real-time super-resolution: it first generates 720P video and audio, then outputs 1K or 2K results in a single forward pass, attempting to reduce the waiting and deviations introduced by the traditional "generate first, then super-resolve offline" approach. JoyAI-Echo also includes a Director Agent that breaks natural-language requirements down into script, characters, scenes, and shots, and supports local regeneration of problematic shots rather than redoing the entire video. The article cites official evaluations claiming that JoyAI-Echo leads on user-preference metrics such as long-video tasks, audio quality, prompt adherence, and IP consistency. However, these figures still need to be further verified by developers and researchers through the open-source project. Overall, this is a noteworthy open-source AI long-video framework, with potential significance especially for digital humans, brand videos, educational content, short dramas, and interactive narrative creation.
Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.
See Pro plans →Want the original English / full article?
Read on 量子位 QbitAI →Summaries are AI-generated; the original article is authoritative.