Video wird geladen...
Video konnte nicht geladen werden
We propose Long Context Tuning (LCT) for scene-level video generation to bridge the gap between current single-shot generation and real-world narrative video productions. Homepage: Report:
46,813 Aufrufe • vor 1 Jahr •via X (Twitter)
9 Kommentare

The faith that too much inductive bias might compromise scalability guides us to expand context window of attention to multishot. Combining interleaved 3D Rope, asynchronous timesteps and context-causal attention with KV-cache, LCT supports efficient auto-regressive sampling.

Benefiting from auto-regressive sampling, LCT also enables several emerging model abilities without explicit objectives: interactive generation. For example, we can feed the SoRA-generated video as the start, continue to produce videos, following text prompts.

Besides, through joint training on SHORT single-shot and LONG multi-shot videos, LCT also enables single shot extension interactively.

Remarkably, despite no extra explicit training objective, our model enables compositional generation by accepting separate identity and environment images to synthesize coherent videos that integrate these distinct elements.

Our bidirectional model accepts visual conditions in arbitrary order and location, supporting "scene interpolation" applications. As shown below, given the first and last shots, we can generate intermediate scenes with semantic coherence.

This is the longest video I've generated so far. So does this thread lol. Many thanks to Yuwei, Ziyan, Zhibei, Zhijie, Zhenheng, Dahua and Lu.

🎥Darkest Before Dawn Limited-Time Free Viewing on GJW+ Belgian climber Siebe Vanhee tackles Yosemite’s Dawn Wall in Darkest Before Dawn, a stunning film blending raw storytelling and cinematic beauty. Award-winning and festival favorite worldwide.

Dark mode for this paper for those who read at night 🌚

Cool
