JoyAI-Echo open-source framework targets stable 5-minute AI long videos | EveryCorner

QbitAI reports that a team affiliated with JD.com has open-sourced the long audio-video generation framework JoyAI-Echo, which is positioned to push AI video from short-clip demonstrations toward a more complete long-form video production workflow. The article notes that AI video generation is already relatively mature for clips ranging from a few seconds to tens of seconds, but once extended to several minutes, it runs into problems such as inconsistent character appearance from frame to frame, drifting speech timbre, broken cross-scene narrative, high inference latency, and excessive post-editing costs. One of JoyAI-Echo's core solutions is to build a cross-modal audio-video memory bank that stores not only a character's visual features but also synchronously records their voice features, continuously drawing on them during subsequent shot generation so that a character's identity, appearance, and timbre remain consistent across multi-shot videos of around five minutes. The second key point is a memory-driven post-training pipeline that includes SFT, reinforcement learning from human feedback, and DMD distillation; the article claims that DMD-related optimizations can yield roughly a 7.5x inference speedup, bringing long-video generation closer to a usable tool rather than a one-off demonstration. The third key point is lightweight real-time super-resolution: it first generates 720P video and audio, then outputs 1K or 2K results in a single forward pass, attempting to reduce the waiting and deviations introduced by the traditional "generate first, then super-resolve offline" approach. JoyAI-Echo also includes a Director Agent that breaks natural-language requirements down into script, characters, scenes, and shots, and supports local regeneration of problematic shots rather than redoing the entire video. The article cites official evaluations claiming that JoyAI-Echo leads on user-preference metrics such as long-video tasks, audio quality, prompt adherence, and IP consistency. However, these figures still need to be further verified by developers and researchers through the open-source project. Overall, this is a noteworthy open-source AI long-video framework, with potential significance especially for digital humans, brand videos, educational content, short dramas, and interactive narrative creation.