正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Can we simplify video generation by decomposing it into interleaved text-video co-generation? Would explicit, repeated thinking in language improve generation in pixels? We introduce TV2TV: a unified model that jointly learns - language modeling (next-token prediction) - video flow matching (next-frame prediction) At inference, TV2TV dynamically alternates between textual... show more

Xiaochuang Han

1,087 subscribers

31,749 次观看 • 7 个月前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

Our research team just released Flex-Forcing: a video generation method that lets a single model switch between generation methods at inference time. Right now there are two main approaches to video generation. Bidirectional diffusion models attend to every frame at once, holding structure well at the cost of speed. Autoregressive models generate frame by frame, so they stream fast and scale to long clips, but accumulate error and drift over time. Flex-Forcing trains a single model to do both, letting you choose from the range at inference based on your compute budget.

Our research team just released Flex-Forcing: a video generation method that lets a single model switch between generation methods at inference time. Right now there are two main approaches to video generation. Bidirectional diffusion models attend to every frame at once, holding structure well at the cost of speed. Autoregressive models generate frame by frame, so they stream fast and scale to long clips, but accumulate error and drift over time. Flex-Forcing trains a single model to do both, letting you choose from the range at inference based on your compute budget.

NVIDIA AI

31,412 次观看 • 16 天前

Make Pixels Dance: High-Dynamic Video Generation paper page: Creating high-dynamic videos such as motion-rich actions and sophisticated visual effects poses a significant challenge in the field of artificial intelligence. Unfortunately, current state-of-the-art video generation methods, primarily focusing on text-to-video generation, tend to produce video clips with minimal motions despite maintaining high fidelity. We argue that relying solely on text instructions is insufficient and suboptimal for video generation. In this paper, we introduce PixelDance, a novel approach based on diffusion models that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation. Comprehensive experimental results demonstrate that PixelDance trained with public data exhibits significantly better proficiency in synthesizing videos with complex scenes and intricate motions, setting a new standard for video generation.

Make Pixels Dance: High-Dynamic Video Generation paper page: Creating high-dynamic videos such as motion-rich actions and sophisticated visual effects poses a significant challenge in the field of artificial intelligence. Unfortunately, current state-of-the-art video generation methods, primarily focusing on text-to-video generation, tend to produce video clips with minimal motions despite maintaining high fidelity. We argue that relying solely on text instructions is insufficient and suboptimal for video generation. In this paper, we introduce PixelDance, a novel approach based on diffusion models that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation. Comprehensive experimental results demonstrate that PixelDance trained with public data exhibits significantly better proficiency in synthesizing videos with complex scenes and intricate motions, setting a new standard for video generation.

AK

102,444 次观看 • 2 年前

A breakthrough in real-time video generation. As a research preview developed with NVIDIA and shared at NVIDIAGTC this week, we trained a new real-time video model running on Vera Rubin. HD videos generate instantly, with time-to-first-frame under 100ms. Unlocking an entirely new creative paradigm and bolstering the foundations of our General World Model, GWM-1. Real-time generation opens a fundamentally different design space for video models and world simulation. We're investing in co-designing our models alongside advances in hardware to keep pushing this frontier.

A breakthrough in real-time video generation. As a research preview developed with NVIDIA and shared at NVIDIAGTC this week, we trained a new real-time video model running on Vera Rubin. HD videos generate instantly, with time-to-first-frame under 100ms. Unlocking an entirely new creative paradigm and bolstering the foundations of our General World Model, GWM-1. Real-time generation opens a fundamentally different design space for video models and world simulation. We're investing in co-designing our models alongside advances in hardware to keep pushing this frontier.

Runway

1,162,805 次观看 • 4 个月前

1/4 🚀We are launching Qwen-Image-2.0, a next-generation foundational image generation model. The key highlights of Qwen-Image-2.0 include: Professional Typography Rendering: Supports 1k-token instructions for direct generation of professional infographics, including PPTs, posters, comics, and more. Stronger Semantic Adherence: Native 2K resolution support for finely detailed realistic scenes, including people, nature, and architecture. Improved Text Rendering: Integrated understanding and generation capabilities, unifying image generation and editing in a single mode Lighter Model Architecture: Smaller model size with faster inference speed.

1/4 🚀We are launching Qwen-Image-2.0, a next-generation foundational image generation model. The key highlights of Qwen-Image-2.0 include: Professional Typography Rendering: Supports 1k-token instructions for direct generation of professional infographics, including PPTs, posters, comics, and more. Stronger Semantic Adherence: Native 2K resolution support for finely detailed realistic scenes, including people, nature, and architecture. Improved Text Rendering: Integrated understanding and generation capabilities, unifying image generation and editing in a single mode Lighter Model Architecture: Smaller model size with faster inference speed.

Tongyi Lab

164,097 次观看 • 5 个月前

🚀 We open-sourced LongLive — interactive, real-time long-video generation. 👥Generates video in real time as users enter text prompts. ⚡️20.7 FPS on a single H100,⏱️up to 240s per clip. 🎬Fine-tunes SOTA short-video models (e.g., Wan) into long-video generators. 🌍One step closer to World Models. All code for training & inference, model weights, demo page, and videos released! Paper: Code: Model: Demo Page: Introduction Video:

🚀 We open-sourced LongLive — interactive, real-time long-video generation. 👥Generates video in real time as users enter text prompts. ⚡️20.7 FPS on a single H100,⏱️up to 240s per clip. 🎬Fine-tunes SOTA short-video models (e.g., Wan) into long-video generators. 🌍One step closer to World Models. All code for training & inference, model weights, demo page, and videos released! Paper: Code: Model: Demo Page: Introduction Video:

Yukang Chen

11,835 次观看 • 10 个月前

Meta announces Movie Gen A Cast of Media Foundation Models We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user’s image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models

Meta announces Movie Gen A Cast of Media Foundation Models We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user’s image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models

AK

62,719 次观看 • 1 年前

Runway has unveiled a massive breakthrough in AI video generation at NVIDIA's GTC event, showcasing a new model capable of generating high-definition video in genuine real-time. The new model boasts a "time-to-first-frame" of under 100 milliseconds. This means HD video begins generating and playing almost the instant a command is given. This research preview was co-developed with NVIDIA and runs on their newly announced Vera Rubin architecture. This real-time capability is a foundational step for Runway's General World Model (GWM-1). It pushes the technology beyond simple video generation and into the realm of live, interactive environments.

Runway has unveiled a massive breakthrough in AI video generation at NVIDIA's GTC event, showcasing a new model capable of generating high-definition video in genuine real-time. The new model boasts a "time-to-first-frame" of under 100 milliseconds. This means HD video begins generating and playing almost the instant a command is given. This research preview was co-developed with NVIDIA and runs on their newly announced Vera Rubin architecture. This real-time capability is a foundational step for Runway's General World Model (GWM-1). It pushes the technology beyond simple video generation and into the realm of live, interactive environments.

Wes Roth

82,506 次观看 • 4 个月前

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

AK

125,319 次观看 • 3 年前

🚨ELON: WE’RE ABOUT TO TRAIN A VIDEO MODEL WITH 100,000 GPUS “We expect to be training a video model with over a hundred thousand GB200s, and to begin that training within the next three or four weeks. So we're confident it's gonna be pretty spectacular in video generation and video understanding.” Source: xAI

🚨ELON: WE’RE ABOUT TO TRAIN A VIDEO MODEL WITH 100,000 GPUS “We expect to be training a video model with over a hundred thousand GB200s, and to begin that training within the next three or four weeks. So we're confident it's gonna be pretty spectacular in video generation and video understanding.” Source: xAI

Mario Nawfal

236,544 次观看 • 1 年前

The reception since we launched the first unified audio + text + image to video model in production last week has been incredible! We wanted to take a moment and highlight some firsts: - the first omnimodal modal that jointly operates over text, image, and audio. Fun fact: we shipped our first audio-to-video model Character-1 last June! - Character-3 is generating 5 minute uncut clips, and capable of temporally stable longform prediction. - Character-3 handles realistic humans, cartoons, side profiles, dynamic backgrounds, anime, animals, and more. - Character-3 sets a new standard for fast & efficient video diffusion at HD and SD resolutions, leveraging our advancements in realtime video generation. We'll have an announcement soon, but you can read more about an earlier version of this line of work in our paper here:

The reception since we launched the first unified audio + text + image to video model in production last week has been incredible! We wanted to take a moment and highlight some firsts: - the first omnimodal modal that jointly operates over text, image, and audio. Fun fact: we shipped our first audio-to-video model Character-1 last June! - Character-3 is generating 5 minute uncut clips, and capable of temporally stable longform prediction. - Character-3 handles realistic humans, cartoons, side profiles, dynamic backgrounds, anime, animals, and more. - Character-3 sets a new standard for fast & efficient video diffusion at HD and SD resolutions, leveraging our advancements in realtime video generation. We'll have an announcement soon, but you can read more about an earlier version of this line of work in our paper here:

Hedra

180,676 次观看 • 1 年前

Google presents Still-Moving Customized Video Generation without Customized Video Data Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on "frozen videos" (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

Google presents Still-Moving Customized Video Generation without Customized Video Data Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on "frozen videos" (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

AK

40,474 次观看 • 2 年前

We present MotionStream — real-time, long-duration video generation that you can interactively control just by dragging your mouse. All videos here are raw, real-time screen captures without any post-processing. Model runs on a single H100 at 29 FPS and 0.4s latency.

We present MotionStream — real-time, long-duration video generation that you can interactively control just by dragging your mouse. All videos here are raw, real-time screen captures without any post-processing. Model runs on a single H100 at 29 FPS and 0.4s latency.

Xun Huang

98,559 次观看 • 8 个月前

I spent 5 hours today making OpenClaw even better at Motion Graphics I created the video below in 2 prompts with no asset uploads. It scraped or generated everything from a text prompt. I've added: - better brand scraping - image generation - video generation - music generation and scraping - better transitions on first prompt - basically zero errors - Always outputs a link to the editor you see below so I can open it from any device... (I can create and edit from telegram) Still work to do I will make a video this weekend and release the skill. It's the best remotion skill that exists imo.

Riley Brown

120,252 次观看 • 4 个月前

At Avalon we are building "Real-time creating" - the ability to generate gameplay ready persistent worlds prompted from text. While others are building real-time video world models, Avalon is building real-time world generation inside a fully playable, persistent multiplayer engine. Internally running at 3840×2180 at 60 FPS. Built on Unreal Engine. Multiplayer by default. Persistent by default. Gameplay-ready by default. This is not a video latent replay. Not a simulation of interaction. It is a real 3D world with physics, logic, and authoritative multiplayer state. Avalon is trained on proprietary Avalon interaction data and powered by a hybrid system that combines language understanding, 3D model generation, procedural systems, and structured gameplay logic synthesis. Players can walk through a live world and generate environments, assets, mechanics, and entirely new gameplay modes using natural language. We accomplish this through a combination of 3D model generation, game logic generation based on our proprietary systems, and AI driven world creation. While other players are inside it. Changes persist instantly. State is synchronized in real time. Creation happens inside the world, not outside of it. Describe a biome. Spawn a civilization. Create a survival mode. Build a dungeon crawler. Launch a new game inside the world. Avalon interprets intent and integrates it directly into the live multiplayer environment. This is not a world model predicting video. This is a gameplay engine that understands language. If you can describe it, you can build it. And others can walk into it instantly.

At Avalon we are building "Real-time creating" - the ability to generate gameplay ready persistent worlds prompted from text. While others are building real-time video world models, Avalon is building real-time world generation inside a fully playable, persistent multiplayer engine. Internally running at 3840×2180 at 60 FPS. Built on Unreal Engine. Multiplayer by default. Persistent by default. Gameplay-ready by default. This is not a video latent replay. Not a simulation of interaction. It is a real 3D world with physics, logic, and authoritative multiplayer state. Avalon is trained on proprietary Avalon interaction data and powered by a hybrid system that combines language understanding, 3D model generation, procedural systems, and structured gameplay logic synthesis. Players can walk through a live world and generate environments, assets, mechanics, and entirely new gameplay modes using natural language. We accomplish this through a combination of 3D model generation, game logic generation based on our proprietary systems, and AI driven world creation. While other players are inside it. Changes persist instantly. State is synchronized in real time. Creation happens inside the world, not outside of it. Describe a biome. Spawn a civilization. Create a survival mode. Build a dungeon crawler. Launch a new game inside the world. Avalon interprets intent and integrates it directly into the live multiplayer environment. This is not a world model predicting video. This is a gameplay engine that understands language. If you can describe it, you can build it. And others can walk into it instantly.

AVALON

62,343 次观看 • 5 个月前

Tencent presents GameGen-O Open-world Video Game Generation We introduce GameGen-O, the first diffusion transformer model tailored for the generation of open-world video games. This model facilitates high-quality, open-domain generation by simulating a wide array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, thus allowing for the gameplay simulation. The development of GameGen-O involves a comprehensive data collection and processing effort from scratch. We collect and build the first Open-World Video Game Dataset (OGameData), amassed extensive data from over a hundred of next-generation open-world games, employing a proprietary data pipeline for efficient sorting, scoring, filtering, and decoupled captioning. This robust and extensive OGameData forms the foundation of our model's training process. GameGen-O undergoes a two-stage training process, consisting of foundation model pretraining and instruction tuning. In the first phase, the model is pre-trained on the OGameData via the text-to-video and video continuation, endowing GameGen-O with the capability for open-domain video game generation. In the second phase, the pre-trained model is frozen, and we fine-tuned using a trainable InstructNet, which enables the production of subsequent frames based on multimodal structural instructions. This whole training process imparts the model with the ability to generate and interactively control content. In summary, GameGen-O represents a notable initial step forward in the realm of open-world video game generation via generative models. It underscores the potential of generative models to serve as an alternative to rendering techniques, which can efficiently combine creative generation with interactive capabilities.

Tencent presents GameGen-O Open-world Video Game Generation We introduce GameGen-O, the first diffusion transformer model tailored for the generation of open-world video games. This model facilitates high-quality, open-domain generation by simulating a wide array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, thus allowing for the gameplay simulation. The development of GameGen-O involves a comprehensive data collection and processing effort from scratch. We collect and build the first Open-World Video Game Dataset (OGameData), amassed extensive data from over a hundred of next-generation open-world games, employing a proprietary data pipeline for efficient sorting, scoring, filtering, and decoupled captioning. This robust and extensive OGameData forms the foundation of our model's training process. GameGen-O undergoes a two-stage training process, consisting of foundation model pretraining and instruction tuning. In the first phase, the model is pre-trained on the OGameData via the text-to-video and video continuation, endowing GameGen-O with the capability for open-domain video game generation. In the second phase, the pre-trained model is frozen, and we fine-tuned using a trainable InstructNet, which enables the production of subsequent frames based on multimodal structural instructions. This whole training process imparts the model with the ability to generate and interactively control content. In summary, GameGen-O represents a notable initial step forward in the realm of open-world video game generation via generative models. It underscores the potential of generative models to serve as an alternative to rendering techniques, which can efficiently combine creative generation with interactive capabilities.

AK

367,088 次观看 • 1 年前

🚨 AI just crossed into reality. A single human eye reflects an entire AI-generated world — futuristic cities, robots, and digital landscapes forming in real time. We are no longer imagining the future. We are actively creating it. Made with AI video generation (Kling model) on Yapper “This is not the future. It’s now.”

🚨 AI just crossed into reality. A single human eye reflects an entire AI-generated world — futuristic cities, robots, and digital landscapes forming in real time. We are no longer imagining the future. We are actively creating it. Made with AI video generation (Kling model) on Yapper “This is not the future. It’s now.”

Mimu | AI Tools & News

72,897 次观看 • 3 个月前

Robot Learning needs 4D world models! Robot Learning needs 4D world models! Robot Learning needs 4D world models! We introduce TesserAct, a 4D embodied world model that can simulate how agents interact with the 3D world over time! We achieve this by simply extending a pre-trained 2D video generation model to jointly predict RGB, depth, and surface normals. It enables: 1️⃣ Much better policy learning in the wild 2️⃣ Temporal + spatial coherence in 4D dynamic prediction 3️⃣ Novel view synthesis for embodied scenes Code: Paper Link: Project page:

Robot Learning needs 4D world models! Robot Learning needs 4D world models! Robot Learning needs 4D world models! We introduce TesserAct, a 4D embodied world model that can simulate how agents interact with the 3D world over time! We achieve this by simply extending a pre-trained 2D video generation model to jointly predict RGB, depth, and surface normals. It enables: 1️⃣ Much better policy learning in the wild 2️⃣ Temporal + spatial coherence in 4D dynamic prediction 3️⃣ Novel view synthesis for embodied scenes Code: Paper Link: Project page:

Chuang Gan

43,265 次观看 • 1 年前

training a model that takes a text prompt and generates audio that renders video on an oscilloscope AgenC agents live inside worlds the model generates the pipeline: real videos -> edge detection -> vectorization -> path ordering -> 192kHz 3-channel WAV where X/Y control beam position and Z controls beam intensity 3 values per timestep. that's all the model is learning. compare that to video gen models trying to predict millions of pixels per frame. transformers are already great at sequence prediction and that's literally all this is. waveform generation the output IS the playback. generate the audio, feed it to a scope, it draws the scene in real-time. there's no rendering step. it's analog so there's no pixel grid. you get continuous curves and effectively infinite resolution bootstrapped with procedural data, lissajous curves, wireframe 3D, stick figures, then scaled on real-world video converted to trace format. 90 TB of source video the model learns edges, contours, spatial relationships, motion. once it has that, describing a scene it's never seen is novel trajectory through the same learned space. generative geometry

training a model that takes a text prompt and generates audio that renders video on an oscilloscope AgenC agents live inside worlds the model generates the pipeline: real videos -> edge detection -> vectorization -> path ordering -> 192kHz 3-channel WAV where X/Y control beam position and Z controls beam intensity 3 values per timestep. that's all the model is learning. compare that to video gen models trying to predict millions of pixels per frame. transformers are already great at sequence prediction and that's literally all this is. waveform generation the output IS the playback. generate the audio, feed it to a scope, it draws the scene in real-time. there's no rendering step. it's analog so there's no pixel grid. you get continuous curves and effectively infinite resolution bootstrapped with procedural data, lissajous curves, wireframe 3D, stick figures, then scaled on real-world video converted to trace format. 90 TB of source video the model learns edges, contours, spatial relationships, motion. once it has that, describing a scene it's never seen is novel trajectory through the same learned space. generative geometry

tetsuo

18,397 次观看 • 5 个月前