Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

You can't 3D reconstruct glass from images... ...WRONG! Thanks for video diffusion, now just about anything is possible! Introducing...Diffusion Knows Transparency (DKT) Transparent and reflective objects usually break robot vision and photogrammetry pipelines because they don't follow the "solid object" rules standard cameras expect. DKT is a new AI... model that repurposes the "internal physics engine" found in video generation models to solve this problem. Researchers took a massive video diffusion model (WAN) and fine-tuned it using a custom-built synthetic dataset to turn it into a high-precision depth sensor. To train the AI, they built the first massive synthetic video library of transparent objects, 1.32 million frames of perfectly labeled glass and metal objects in motion. Without ever seeing a "real" labeled video of glass during training, the model (DKT) outperformed all previous specialized systems on real-world benchmarks (ClearPose, DREDS). They created a "lightweight" 1.3B parameter version that runs fast enough (0.17s per frame) to be used on actual robot hardware. Two reasons I find this project important: 1. It further proves that synthetic data will be essential for training the next generation vision models. 2. In real-world robotic tests, using DKT's depth maps nearly doubled the success rate of robot arms trying to pick up objects on tricky reflective or translucent surfaces. At home robots will need to interact with these types of objects on a daily basis. Check out the project page here: Code is LIVE! #Computervision #Robotics #AIshow more

Jonathan Stephens

12,928 subscribers

17,712 Aufrufe • vor 6 Monaten •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

Depth Any Video with Scalable Synthetic Data AI physicists and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

Depth Any Video with Scalable Synthetic Data AI physicists and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

MrNeRF

27,428 Aufrufe • vor 1 Jahr

I'll always root for a team that open-sources its best work, and Robbyant just did it properly. Robbyant, Ant Group's embodied-AI company, released LingBot-Vision, a vision foundation model for robots, and the part I love is the data. They trained it on 161M images, filtered down from 2B raw ones and mostly pulled straight from the open web, with no human labels, no edge detectors, no depth sensors anywhere in the loop. It learns the exact edges of objects from raw pixels. That's roughly a tenth of the data DINOv3 saw, and under a third of the training. And it shows in the results. On depth, working out how far away things are, the 1B model edges out a 7B on NYU-Depth. It also powers LingBot-Depth 2.0, which reads the surfaces cameras usually choke on, glass and mirrors, and halves indoor depth error. LingBot-Vision is fully open. Weights from the 1.1B flagship down to a tiny 21M version, code, and the paper. This is the timeline I want more of. Robbyant

I'll always root for a team that open-sources its best work, and Robbyant just did it properly. Robbyant, Ant Group's embodied-AI company, released LingBot-Vision, a vision foundation model for robots, and the part I love is the data. They trained it on 161M images, filtered down from 2B raw ones and mostly pulled straight from the open web, with no human labels, no edge detectors, no depth sensors anywhere in the loop. It learns the exact edges of objects from raw pixels. That's roughly a tenth of the data DINOv3 saw, and under a third of the training. And it shows in the results. On depth, working out how far away things are, the 1B model edges out a 7B on NYU-Depth. It also powers LingBot-Depth 2.0, which reads the surfaces cameras usually choke on, glass and mirrors, and halves indoor depth error. LingBot-Vision is fully open. Weights from the 1.1B flagship down to a tiny 21M version, code, and the paper. This is the timeline I want more of. Robbyant

Chubby♨️

48,249 Aufrufe • vor 22 Tagen

Video diffusion models have strong implicit representations of 3D shape, material, and lighting, but controlling them with language is cumbersome, and control is critical for artists and animators. GenLit connects these implicit representations with a continuous 5D control signal describing the direction and intensity of a point light source. This enables single-image near-field relighting of an image using a video diffusion model. We use a ControlNet-like approach and show that, with a small amount of synthetic data, GenLit generalizes to complex real-world images. Given a single image and the 5D lighting signal, GenLit creates a video of a moving light source that is inside the scene. It moves around and behind scene objects, producing effects such as shading, cast shadows, secularities, and interreflections with a realism that is hard to obtain with traditional inverse rendering methods. GenLit shows that it is possible to get continuous control over implicit physical processes within a video model. I think this is just the beginning and promises to make such models much more practical for creators. Shrisha Bharadwaj will present today at SIGGRAPH Asia Room: S423/S424, Level 4 @ 13:50 on 15 of Dec.

Video diffusion models have strong implicit representations of 3D shape, material, and lighting, but controlling them with language is cumbersome, and control is critical for artists and animators. GenLit connects these implicit representations with a continuous 5D control signal describing the direction and intensity of a point light source. This enables single-image near-field relighting of an image using a video diffusion model. We use a ControlNet-like approach and show that, with a small amount of synthetic data, GenLit generalizes to complex real-world images. Given a single image and the 5D lighting signal, GenLit creates a video of a moving light source that is inside the scene. It moves around and behind scene objects, producing effects such as shading, cast shadows, secularities, and interreflections with a realism that is hard to obtain with traditional inverse rendering methods. GenLit shows that it is possible to get continuous control over implicit physical processes within a video model. I think this is just the beginning and promises to make such models much more practical for creators. Shrisha Bharadwaj will present today at SIGGRAPH Asia Room: S423/S424, Level 4 @ 13:50 on 15 of Dec.

Michael Black

22,182 Aufrufe • vor 7 Monaten

Trained on zero real-world data. Learned to walk, pick up boxes, and follow multi-step instructions... in the REAL world. ( 📌 Paper below) Researchers from Amazon FAR, Berkeley, Stanford, and CMU scanned real rooms with an iPhone, rebuilt them as 3D Gaussian Splatting scenes, then generated 48,000 synthetic trajectories of a Unitree G1 walking, grasping, and placing objects inside those virtual replicas. They rendered the robot's first-person camera view from each run and paired it with the matching language instruction and motion data. That's the dataset every humanoid team needs and nobody has: synced egocentric video + language + kinematics, at scale. Instead of collecting it in the real world, they manufactured it. They trained a vision-language-kinematics policy on that synthetic data alone, then deployed it on the physical G1 across five task types: navigation to a named object, lifting boxes of three different sizes with no per-size tuning, chained multi-step tasks, robustness to mid-task layout changes and flickering lights, and multi-minute long-horizon runs. No real-world fine-tuning at any point. Real-world interaction data has been the hard limit on humanoid learning... slow, expensive, and small. If scanning a room once and synthesizing thousands of labeled interactions holds up as a general recipe, that limit moves. Data stops being the bottleneck robotics teams have to solve for. 📌 Paper: Project: ——- Weekly robotics and AI insights. Subscribe free:

Trained on zero real-world data. Learned to walk, pick up boxes, and follow multi-step instructions... in the REAL world. ( 📌 Paper below) Researchers from Amazon FAR, Berkeley, Stanford, and CMU scanned real rooms with an iPhone, rebuilt them as 3D Gaussian Splatting scenes, then generated 48,000 synthetic trajectories of a Unitree G1 walking, grasping, and placing objects inside those virtual replicas. They rendered the robot's first-person camera view from each run and paired it with the matching language instruction and motion data. That's the dataset every humanoid team needs and nobody has: synced egocentric video + language + kinematics, at scale. Instead of collecting it in the real world, they manufactured it. They trained a vision-language-kinematics policy on that synthetic data alone, then deployed it on the physical G1 across five task types: navigation to a named object, lifting boxes of three different sizes with no per-size tuning, chained multi-step tasks, robustness to mid-task layout changes and flickering lights, and multi-minute long-horizon runs. No real-world fine-tuning at any point. Real-world interaction data has been the hard limit on humanoid learning... slow, expensive, and small. If scanning a room once and synthesizing thousands of labeled interactions holds up as a general recipe, that limit moves. Data stops being the bottleneck robotics teams have to solve for. 📌 Paper: Project: ——- Weekly robotics and AI insights. Subscribe free:

Ilir Aliu

12,950 Aufrufe • vor 9 Tagen

Chop the gradients ✂️! We found that truncating decoder gradients in latent video diffusion to a fixed window allows us to finetune on videos with pixel-wise perceptual losses without running out of memory. Pixel losses have been essential for image generation and reconstruction, but until now, they haven't scaled to long-duration, high-resolution video diffusion due to recursive activation accumulation in causal decoders, leading to OOM during training 💥📉. Project: Video diffusion models can do a lot more 🚀 when you can backprop the decoder! Post-process neural rendered scenes, super-resolve videos, harmonize lighting in controlled synthetic driving scenes, and inpaint videos — all in a single step ⚡ with a quick finetune from a standard diffusion model.

Chop the gradients ✂️! We found that truncating decoder gradients in latent video diffusion to a fixed window allows us to finetune on videos with pixel-wise perceptual losses without running out of memory. Pixel losses have been essential for image generation and reconstruction, but until now, they haven't scaled to long-duration, high-resolution video diffusion due to recursive activation accumulation in causal decoders, leading to OOM during training 💥📉. Project: Video diffusion models can do a lot more 🚀 when you can backprop the decoder! Post-process neural rendered scenes, super-resolve videos, harmonize lighting in controlled synthetic driving scenes, and inpaint videos — all in a single step ⚡ with a quick finetune from a standard diffusion model.

Felix Heide

28,399 Aufrufe • vor 3 Monaten

We’ve seen humanoid robots walk around for a while, but when will they actually help with useful tasks in daily life? The challenge here is the diversity and complexity of real-world scenes. Our new work tackles this problem via 3D visuomotor policy learning. Using data from only 1 scene, our Improved 3D Diffusion Policy (iDP3) enables a full-sized humanoid robot to autonomously pick&place objects, pour water, and wipe tables, in the wild open world. (and all these skills are useful, right?) Web: Fully open-sourced code:

We’ve seen humanoid robots walk around for a while, but when will they actually help with useful tasks in daily life? The challenge here is the diversity and complexity of real-world scenes. Our new work tackles this problem via 3D visuomotor policy learning. Using data from only 1 scene, our Improved 3D Diffusion Policy (iDP3) enables a full-sized humanoid robot to autonomously pick&place objects, pour water, and wipe tables, in the wild open world. (and all these skills are useful, right?) Web: Fully open-sourced code:

Yanjie Ze

75,271 Aufrufe • vor 1 Jahr

Check out this Stereo4D paper from Google DeepMind. It's a pretty clever approach to a persistent problem in computer vision -- getting good training data for how things move in 3D. The key insight is using VR180 videos -- those stereo fisheye videos we launched back in 2017 for YouTubeVR. It was always clear that structured stereo datasets would be valuable for computer vision -- and we launched some powerful VR tools with it back in 2017 (link below). But what's the game changer now in 2024 is the scale -- they're providing 110K high quality clips :-) That's the kind of massive, real-world AI dataset that was just a dream back then! They're using it to train this model called DynaDUSt3R that can predict both 3D structure and motion from video frames. Which means it tracks how objects move between frames while simultaneously reconstructing their 3D shape. And given we're dealing with real stereoscopic content, results are notably better than synthetic data, giving you a faithful rendition of the real-world with a diverse set of subject matter. It's one of those through lines when tackling a timeless mission like mapping the world or spatial computing -- VR content created for immersion becoming the foundation for teaching machines to understand how the world moves. Sometimes innovation chains together in unexpected ways! Links to projects below⛓️

Check out this Stereo4D paper from Google DeepMind. It's a pretty clever approach to a persistent problem in computer vision -- getting good training data for how things move in 3D. The key insight is using VR180 videos -- those stereo fisheye videos we launched back in 2017 for YouTubeVR. It was always clear that structured stereo datasets would be valuable for computer vision -- and we launched some powerful VR tools with it back in 2017 (link below). But what's the game changer now in 2024 is the scale -- they're providing 110K high quality clips :-) That's the kind of massive, real-world AI dataset that was just a dream back then! They're using it to train this model called DynaDUSt3R that can predict both 3D structure and motion from video frames. Which means it tracks how objects move between frames while simultaneously reconstructing their 3D shape. And given we're dealing with real stereoscopic content, results are notably better than synthetic data, giving you a faithful rendition of the real-world with a diverse set of subject matter. It's one of those through lines when tackling a timeless mission like mapping the world or spatial computing -- VR content created for immersion becoming the foundation for teaching machines to understand how the world moves. Sometimes innovation chains together in unexpected ways! Links to projects below⛓️

Bilawal Sidhu

67,919 Aufrufe • vor 1 Jahr

NVIDIA AI Released DiffusionRenderer: An AI Model for Editable, Photorealistic 3D Scenes from a Single Video In a groundbreaking new paper, researchers at NVIDIA, University of Toronto, Vector Institute and the University of Illinois Urbana-Champaign have unveiled a framework that directly tackles this challenge. DiffusionRenderer represents a revolutionary leap forward, moving beyond mere generation to offer a unified solution for understanding and manipulating 3D scenes from a single video. It effectively bridges the gap between generation and editing, unlocking the true creative potential of AI-driven content. DiffusionRenderer treats the “what” (the scene’s properties) and the “how” (the rendering) in one unified framework built on the same powerful video diffusion architecture that underpins models like Stable Video Diffusion..... Read full article here: Paper: GitHub Page: NVIDIA NVIDIA AI NVIDIAnewsroom NVIDIA AIDev

NVIDIA AI Released DiffusionRenderer: An AI Model for Editable, Photorealistic 3D Scenes from a Single Video In a groundbreaking new paper, researchers at NVIDIA, University of Toronto, Vector Institute and the University of Illinois Urbana-Champaign have unveiled a framework that directly tackles this challenge. DiffusionRenderer represents a revolutionary leap forward, moving beyond mere generation to offer a unified solution for understanding and manipulating 3D scenes from a single video. It effectively bridges the gap between generation and editing, unlocking the true creative potential of AI-driven content. DiffusionRenderer treats the “what” (the scene’s properties) and the “how” (the rendering) in one unified framework built on the same powerful video diffusion architecture that underpins models like Stable Video Diffusion..... Read full article here: Paper: GitHub Page: NVIDIA NVIDIA AI NVIDIAnewsroom NVIDIA AIDev

Marktechpost AI Dev News ⚡

104,741 Aufrufe • vor 1 Jahr

AI Is Moving Beyond “Generating Videos” — Toward “Generating Worlds” Over the past two years, AI video models have advanced at an astonishing pace. From Runway and Pika to Sora and Veo, AI-generated videos have become increasingly realistic and more consistent with the physical laws of the real world. Many people believe the next objective is simply to generate videos that are longer, sharper, and more lifelike. But if we take a step back, we can see that the real transformation is not happening in video itself. It is happening in world models. What Is a World Model? In 1943, psychologist Kenneth Craik proposed an idea that would influence artificial intelligence research for decades. He argued that the human brain does not merely react to the outside world. Instead, it maintains an internal model of how the world works. Because we have this internal model, we can predict the outcome of an action before we actually take it. Before crossing a road, we estimate whether a car will pass by. Before catching a ball, we predict its trajectory. These abilities come from continuously simulating the world in our minds, rather than relying entirely on trial and error. This idea later became known by a more formal term: World Model. A world model does not describe a single image or a fixed video clip. It is an internal representation capable of continuously simulating the rules and dynamics of the real world. Why Is AI Research Turning Toward World Models? Because predicting “what comes next” is becoming increasingly central to how AI systems work. Language models predict the next token. Image models predict the next step in the denoising process. Video models predict the next frame. A world model, however, attempts to predict something broader: What should the world look like in the next moment? In 2018, David Ha and Jürgen Schmidhuber proposed in their paper World Models that an intelligent agent could first learn a model of the world, and then use that internal model to plan its actions. The Dreamer series later demonstrated that many complex tasks could be learned by training agents inside an “imagined world.” At the same time, the development of video models such as Sora and Veo led researchers to another realization: A model capable of continuously generating video has already learned, at least implicitly, many of the rules governing the real world. As a result, these two research directions have gradually begun to converge. But Video Is Not Yet a World This is where the distinction is often misunderstood. For a world model to support meaningful real-time interaction, it must solve several critical problems. Most video models today are essentially answering one question: What should the next frame look like? A true world model needs to answer much more: What happens if I take one step forward? If I walk behind a building and then return, will the building still be there? If I suddenly change the camera angle, will the entire space remain consistent? If I enter a command such as: “Summon a dragon.” Will the world respond immediately? In other words, a world model must do more than generate content. It must understand space. It must understand time. It must understand causality. And it must understand interaction. Moving from watching to participating is where the real difficulty of world models begins. World Models Are Entering the Interactive Era One of the latest attempts in this direction is Alaya World, recently open-sourced by Alaya World, or Alaya Lab. Instead of generating a fixed video clip, it generates a world that users can explore in real time. Users can begin with text, an image, or a video, enter the generated scene, move freely through it, and introduce new prompts at any moment during generation. The world responds immediately. According to the publicly released information, Alaya World provides: Real-time streaming generation at 720p and 24 FPS Stable continuous exploration for more than one minute The ability to switch prompts and trigger skills or events during generation Model weights and inference code released under the Apache 2.0 License Training code and datasets planned for future release What makes these capabilities important is not simply the technical specifications. It is that the generated “world” can now support continuous interaction. The official demo shows that users can genuinely control, transform, and explore the generated environment. AI Is Evolving From a Tool Into an Environment Over the past few years, most discussions around AI have focused on content generation. Generating text. Generating images. Generating videos. But world models raise a fundamentally different question: Can AI generate an environment that people can inhabit, explore, and continuously evolve? If the answer is yes, the impact will extend far beyond video generation. Game development, robotics training, embodied intelligence, digital twins, virtual production, and many other fields could be transformed by the development of world models. World models are still at a very early stage. Yet from Craik’s proposal of an internal mental model more than eighty years ago to the emergence of today’s interactive world-generation systems, a clear evolutionary path is beginning to take shape. Perhaps what AI is ultimately learning has never been limited to images, videos, or language. Perhaps it is learning the world itself. References GitHub: Technical Report:

AI Is Moving Beyond “Generating Videos” — Toward “Generating Worlds” Over the past two years, AI video models have advanced at an astonishing pace. From Runway and Pika to Sora and Veo, AI-generated videos have become increasingly realistic and more consistent with the physical laws of the real world. Many people believe the next objective is simply to generate videos that are longer, sharper, and more lifelike. But if we take a step back, we can see that the real transformation is not happening in video itself. It is happening in world models. What Is a World Model? In 1943, psychologist Kenneth Craik proposed an idea that would influence artificial intelligence research for decades. He argued that the human brain does not merely react to the outside world. Instead, it maintains an internal model of how the world works. Because we have this internal model, we can predict the outcome of an action before we actually take it. Before crossing a road, we estimate whether a car will pass by. Before catching a ball, we predict its trajectory. These abilities come from continuously simulating the world in our minds, rather than relying entirely on trial and error. This idea later became known by a more formal term: World Model. A world model does not describe a single image or a fixed video clip. It is an internal representation capable of continuously simulating the rules and dynamics of the real world. Why Is AI Research Turning Toward World Models? Because predicting “what comes next” is becoming increasingly central to how AI systems work. Language models predict the next token. Image models predict the next step in the denoising process. Video models predict the next frame. A world model, however, attempts to predict something broader: What should the world look like in the next moment? In 2018, David Ha and Jürgen Schmidhuber proposed in their paper World Models that an intelligent agent could first learn a model of the world, and then use that internal model to plan its actions. The Dreamer series later demonstrated that many complex tasks could be learned by training agents inside an “imagined world.” At the same time, the development of video models such as Sora and Veo led researchers to another realization: A model capable of continuously generating video has already learned, at least implicitly, many of the rules governing the real world. As a result, these two research directions have gradually begun to converge. But Video Is Not Yet a World This is where the distinction is often misunderstood. For a world model to support meaningful real-time interaction, it must solve several critical problems. Most video models today are essentially answering one question: What should the next frame look like? A true world model needs to answer much more: What happens if I take one step forward? If I walk behind a building and then return, will the building still be there? If I suddenly change the camera angle, will the entire space remain consistent? If I enter a command such as: “Summon a dragon.” Will the world respond immediately? In other words, a world model must do more than generate content. It must understand space. It must understand time. It must understand causality. And it must understand interaction. Moving from watching to participating is where the real difficulty of world models begins. World Models Are Entering the Interactive Era One of the latest attempts in this direction is Alaya World, recently open-sourced by Alaya World, or Alaya Lab. Instead of generating a fixed video clip, it generates a world that users can explore in real time. Users can begin with text, an image, or a video, enter the generated scene, move freely through it, and introduce new prompts at any moment during generation. The world responds immediately. According to the publicly released information, Alaya World provides: Real-time streaming generation at 720p and 24 FPS Stable continuous exploration for more than one minute The ability to switch prompts and trigger skills or events during generation Model weights and inference code released under the Apache 2.0 License Training code and datasets planned for future release What makes these capabilities important is not simply the technical specifications. It is that the generated “world” can now support continuous interaction. The official demo shows that users can genuinely control, transform, and explore the generated environment. AI Is Evolving From a Tool Into an Environment Over the past few years, most discussions around AI have focused on content generation. Generating text. Generating images. Generating videos. But world models raise a fundamentally different question: Can AI generate an environment that people can inhabit, explore, and continuously evolve? If the answer is yes, the impact will extend far beyond video generation. Game development, robotics training, embodied intelligence, digital twins, virtual production, and many other fields could be transformed by the development of world models. World models are still at a very early stage. Yet from Craik’s proposal of an internal mental model more than eighty years ago to the emergence of today’s interactive world-generation systems, a clear evolutionary path is beginning to take shape. Perhaps what AI is ultimately learning has never been limited to images, videos, or language. Perhaps it is learning the world itself. References GitHub: Technical Report:

雪踏乌云

112,114 Aufrufe • vor 14 Tagen

✨ Made a new mini feature on Photo AI: [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!

✨ Made a new mini feature on Photo AI: [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!

@levelsio

119,210 Aufrufe • vor 1 Jahr

Introducing Attio Objects 🚀 We know how hard it is to find a CRM that fits your unique business model. That's why we built Attio Objects – our powerful data model with custom objects that gives you complete flexibility to structure your CRM exactly how you need it. Along with custom objects, we've also introduced new standard objects: - Workspaces and Users objects for PLG businesses. - A robust Deals object for sales-driven companies. This is the culmination of a 4-year effort, with 3 years of work put in even before launching Attio. Since day one, we've been determined to solve the fundamental problem in the CRM space: the trade-off between power and time-to-value. If you wanted power and flexibility, your CRM would take forever to build and not work well with your stack. If you wanted speed, you'd need to use highly opinionated, inflexible software that doesn't really work for your business. That ends today. With Attio, you no longer have to compromise. Build your CRM your way, fast. Iterate as you grow. High-growth startups like Replicate, , and Modal and more are already using Attio's object architecture to perfectly match their businesses and accelerate their growth. To get all the details, check out our blog post 👇

Introducing Attio Objects 🚀 We know how hard it is to find a CRM that fits your unique business model. That's why we built Attio Objects – our powerful data model with custom objects that gives you complete flexibility to structure your CRM exactly how you need it. Along with custom objects, we've also introduced new standard objects: - Workspaces and Users objects for PLG businesses. - A robust Deals object for sales-driven companies. This is the culmination of a 4-year effort, with 3 years of work put in even before launching Attio. Since day one, we've been determined to solve the fundamental problem in the CRM space: the trade-off between power and time-to-value. If you wanted power and flexibility, your CRM would take forever to build and not work well with your stack. If you wanted speed, you'd need to use highly opinionated, inflexible software that doesn't really work for your business. That ends today. With Attio, you no longer have to compromise. Build your CRM your way, fast. Iterate as you grow. High-growth startups like Replicate, , and Modal and more are already using Attio's object architecture to perfectly match their businesses and accelerate their growth. To get all the details, check out our blog post 👇

Attio

26,821 Aufrufe • vor 2 Jahren

AI in robotics gets all the attention right now, but sometimes the most interesting work is very practical. Viet built a small vision system that counts potatoes on a conveyor belt. No giant dataset. No huge model. Just a clear problem and a smart setup. He used Ultralytics’ ObjectCounter, trained a tiny YOLO11 nano model, and because there was no potato dataset, he annotated a single frame with SAM 2 and trained from that. One frame. Still works across the whole video. It is a good reminder that useful AI in industry often looks like this. Focused. Lightweight. Solves a real task. If you work in manufacturing or robotics, these small systems are usually the fastest wins. They save time, reduce errors, and do not need massive infrastructure. Nice work, Viet. His projects: —- Weekly robotics and AI insights. Subscribe free:

AI in robotics gets all the attention right now, but sometimes the most interesting work is very practical. Viet built a small vision system that counts potatoes on a conveyor belt. No giant dataset. No huge model. Just a clear problem and a smart setup. He used Ultralytics’ ObjectCounter, trained a tiny YOLO11 nano model, and because there was no potato dataset, he annotated a single frame with SAM 2 and trained from that. One frame. Still works across the whole video. It is a good reminder that useful AI in industry often looks like this. Focused. Lightweight. Solves a real task. If you work in manufacturing or robotics, these small systems are usually the fastest wins. They save time, reduce errors, and do not need massive infrastructure. Nice work, Viet. His projects: —- Weekly robotics and AI insights. Subscribe free:

Ilir Aliu

1,675,497 Aufrufe • vor 8 Monaten

doodles AI beta. next week. we're building the tools for a new era of dynamic world-building. it starts with an image model that reimagines anything and everything through the doodles lens. this is the first iteration of many. as the product evolves, we'll introduce the ability to turn your generations into physical objects. video with sound and dialogue, realtime AR, and gaming are all on the roadmap. doodles AI aligns us with the speed and scale of the AI industry at large. our colourful world can now be plugged into new tech as it unfolds. create with us.

doodles AI beta. next week. we're building the tools for a new era of dynamic world-building. it starts with an image model that reimagines anything and everything through the doodles lens. this is the first iteration of many. as the product evolves, we'll introduce the ability to turn your generations into physical objects. video with sound and dialogue, realtime AR, and gaming are all on the roadmap. doodles AI aligns us with the speed and scale of the AI industry at large. our colourful world can now be plugged into new tech as it unfolds. create with us.

burnt toast

61,243 Aufrufe • vor 4 Monaten

This is some quietly impressive work on making video world models actually controllable in 4D space. VerseCrafter lets you take an input image, use something like Blender to animate the 3D camera path and object trajectories, then uses that to condition generation. Scribbling in 2D feels so crude in comparison. The authors represent everything in a shared 4D world state - static background as a point cloud, moving objects as 3D gaussian trajectories. The gaussians are an interesting choice because they capture position, shape, and orientation probabilistically rather than forcing rigid bounding boxes or category specific models like SMPL-X for human bodies. They bolt this onto frozen Wan2.1 with a lightweight adapter, so they get a strong video prior. They also built a pipeline to auto extract 4D annotations from real world videos to train this puppy. It doesn't look sexy yet, but IMO this is the interface video world models need - actual 3D authoring tools to exert control rather than crude scribbles and prompt incantations.

This is some quietly impressive work on making video world models actually controllable in 4D space. VerseCrafter lets you take an input image, use something like Blender to animate the 3D camera path and object trajectories, then uses that to condition generation. Scribbling in 2D feels so crude in comparison. The authors represent everything in a shared 4D world state - static background as a point cloud, moving objects as 3D gaussian trajectories. The gaussians are an interesting choice because they capture position, shape, and orientation probabilistically rather than forcing rigid bounding boxes or category specific models like SMPL-X for human bodies. They bolt this onto frozen Wan2.1 with a lightweight adapter, so they get a strong video prior. They also built a pipeline to auto extract 4D annotations from real world videos to train this puppy. It doesn't look sexy yet, but IMO this is the interface video world models need - actual 3D authoring tools to exert control rather than crude scribbles and prompt incantations.

Bilawal Sidhu

25,802 Aufrufe • vor 6 Monaten

✨ Every week a new AI model comes out and it suddenly makes my half broken features work a lot better Yesterday Seedream-4-Edit came out and it made my [ Hold product ] feature on Photo AI a lot better You can now go from: 🎁 Product photo -> 👱‍♀️ Talking video with your AI model while holding your product. In just a few minutes! Here's a photo I took from the weekly farm box we get in our kitchen, I set it as the product and then with Photo AI made it into a talking video where my trained AI model presents it It's not perfect, as the objects inside the farm box still move around a bit, but pretty close. If the product is more uniform (like lip gloss, a product box or a book) it does a pretty good job at keeping it exactly the same This "consistency" as they call it is quite important for actual real world use. Product sellers don't want to have an image or video of an AI model if the product doesn't look exactly the same as what they sell With that, I'm getting pretty close now and every week with every new model that comes out, a bit closer And it's interesting cause now I'm finally moving from B2C a bit more to B2B where businesses can use Photo AI more, designers and stores already use it for trying on clothes etc. but now they can generate content for real products! 😊 LIVE now on Photo AI

✨ Every week a new AI model comes out and it suddenly makes my half broken features work a lot better Yesterday Seedream-4-Edit came out and it made my [ Hold product ] feature on Photo AI a lot better You can now go from: 🎁 Product photo -> 👱‍♀️ Talking video with your AI model while holding your product. In just a few minutes! Here's a photo I took from the weekly farm box we get in our kitchen, I set it as the product and then with Photo AI made it into a talking video where my trained AI model presents it It's not perfect, as the objects inside the farm box still move around a bit, but pretty close. If the product is more uniform (like lip gloss, a product box or a book) it does a pretty good job at keeping it exactly the same This "consistency" as they call it is quite important for actual real world use. Product sellers don't want to have an image or video of an AI model if the product doesn't look exactly the same as what they sell With that, I'm getting pretty close now and every week with every new model that comes out, a bit closer And it's interesting cause now I'm finally moving from B2C a bit more to B2B where businesses can use Photo AI more, designers and stores already use it for trying on clothes etc. but now they can generate content for real products! 😊 LIVE now on Photo AI

@levelsio

361,558 Aufrufe • vor 10 Monaten

A Letter to Our Community: The Road Ahead for Robotics To our Community and Partners, As we step into 2026, our mission at Axis is clearer than ever: Constructing the definitive End-to-End Scaling Layer for Robotics. Our goal is to accelerate the transfer of diverse human intelligence into Robotics General Intelligence (RGI). By owning the critical path of intelligence creation, we are turning the physical limitations of robotics into a scalable, software-driven future. Here is our strategic outlook and roadmap for the year ahead. The Core Thesis: Simulation is the Only Way Out The path to RGI is currently blocked by Data Scarcity, Generalization Fragility, and Hardware Fragmentation. At Axis, we believe Simulation is the only way out. Our Simulation Data Platform and Data Augmentation Engine transform raw data into "Synthetic Gold". Backed by academic milestones like Roboverse, Skill Blending, and GraspVLA, we have proven that pure simulation can achieve the generalization required for the real world. We don’t just collect data; we architect it. The Engine: Why Crypto? We believe RGI should come from all, not a few. Crypto is not just a feature; it is the primitive that powers our entire ecosystem flywheel: - Incentive Mechanism: Democratizing contribution and rewarding the trainers and developers. - Assetization: Turning proprietary data and refined models into liquid, ownable assets. - Verifiable Workflow: We are opening the "Black Box" of AI. By bringing total transparency to the Task Generation → Data Collection → Model Training pipeline, we ensure every byte of intelligence is verifiable, traceable, and secure. 2026 Strategic Deliverables This year, we are committed to delivering three foundational pillars: - The World's Largest Training Dataset for Robots: A robot training set—diverse, high-quality interaction data at an unprecedented scale. - A Robotics Foundation Model: A universal robotic brain trained on our pure simulation and synthetic data, capable of robust cross-embodiment transfer and open-world adaptability. - Evolvable Robot Hardware: Robots deployed with Axis models that autonomously evolve through continuous interaction, turning every deployment into a self-improving node within our RGI network. The Ultimate Vision We are building more than models; we are architecting the Distributed Machine Economy. A future where every dataset, model, and robotic embodiment is a verifiable asset in a global, autonomous network. Thank you for building the future of intelligence with us✌️📷

A Letter to Our Community: The Road Ahead for Robotics To our Community and Partners, As we step into 2026, our mission at Axis is clearer than ever: Constructing the definitive End-to-End Scaling Layer for Robotics. Our goal is to accelerate the transfer of diverse human intelligence into Robotics General Intelligence (RGI). By owning the critical path of intelligence creation, we are turning the physical limitations of robotics into a scalable, software-driven future. Here is our strategic outlook and roadmap for the year ahead. The Core Thesis: Simulation is the Only Way Out The path to RGI is currently blocked by Data Scarcity, Generalization Fragility, and Hardware Fragmentation. At Axis, we believe Simulation is the only way out. Our Simulation Data Platform and Data Augmentation Engine transform raw data into "Synthetic Gold". Backed by academic milestones like Roboverse, Skill Blending, and GraspVLA, we have proven that pure simulation can achieve the generalization required for the real world. We don’t just collect data; we architect it. The Engine: Why Crypto? We believe RGI should come from all, not a few. Crypto is not just a feature; it is the primitive that powers our entire ecosystem flywheel: - Incentive Mechanism: Democratizing contribution and rewarding the trainers and developers. - Assetization: Turning proprietary data and refined models into liquid, ownable assets. - Verifiable Workflow: We are opening the "Black Box" of AI. By bringing total transparency to the Task Generation → Data Collection → Model Training pipeline, we ensure every byte of intelligence is verifiable, traceable, and secure. 2026 Strategic Deliverables This year, we are committed to delivering three foundational pillars: - The World's Largest Training Dataset for Robots: A robot training set—diverse, high-quality interaction data at an unprecedented scale. - A Robotics Foundation Model: A universal robotic brain trained on our pure simulation and synthetic data, capable of robust cross-embodiment transfer and open-world adaptability. - Evolvable Robot Hardware: Robots deployed with Axis models that autonomously evolve through continuous interaction, turning every deployment into a self-improving node within our RGI network. The Ultimate Vision We are building more than models; we are architecting the Distributed Machine Economy. A future where every dataset, model, and robotic embodiment is a verifiable asset in a global, autonomous network. Thank you for building the future of intelligence with us✌️📷

Axis Robotics

27,858 Aufrufe • vor 6 Monaten

Google dropped a new AI paper called LUMIERE. It's remarkably flexible, supporting video inpainting, image-to-video, AND stylized video generation tasks. Say hello to “space-time diffusion” for video generation! Now what the heck does that mean exactly?! 🌐⏳ → TL;DR it utilizes a “Space-Time UNet” architecture that generates the full duration of the video in one pass, rather than generating distant keyframes and interpolating between them like prior works. Because the computation is done in this “compressed space-time representation” to generate the full clip at once, it's far more temporally consistent. → Another benefit of generating the full video at once is that you can “direct” the video generation, making it easier to hand off to other models/tasks without having to stitch together partial solutions. You can condition generations on additional inputs, meaning you get the full stack of AI video capabilities – from video inpainting to image-to-video and beyond. → New SOTA for AI video generation? User study results in the paper suggest human evaluators preferred Lumiere over Runway Gen-2, Pika Labs, and Stable Video Diffusion in terms of quality, text alignment AND motion. But as always, we need to get hands-on with this tech when Google *actually* decides to ship it. → Could this end up inside YouTube? Y’all know i’m obsessed with blending reality and imagination – so it’s the video inpainting tech I'm most excited about. I really hope this model finds its way into YouTube's Generative AI efforts, and based on their prior announcements and the list of acknowledgments in the paper I think it might! 🤞🏽 Links: 🔗Paper: 🔗Project:

Google dropped a new AI paper called LUMIERE. It's remarkably flexible, supporting video inpainting, image-to-video, AND stylized video generation tasks. Say hello to “space-time diffusion” for video generation! Now what the heck does that mean exactly?! 🌐⏳ → TL;DR it utilizes a “Space-Time UNet” architecture that generates the full duration of the video in one pass, rather than generating distant keyframes and interpolating between them like prior works. Because the computation is done in this “compressed space-time representation” to generate the full clip at once, it's far more temporally consistent. → Another benefit of generating the full video at once is that you can “direct” the video generation, making it easier to hand off to other models/tasks without having to stitch together partial solutions. You can condition generations on additional inputs, meaning you get the full stack of AI video capabilities – from video inpainting to image-to-video and beyond. → New SOTA for AI video generation? User study results in the paper suggest human evaluators preferred Lumiere over Runway Gen-2, Pika Labs, and Stable Video Diffusion in terms of quality, text alignment AND motion. But as always, we need to get hands-on with this tech when Google actually decides to ship it. → Could this end up inside YouTube? Y’all know i’m obsessed with blending reality and imagination – so it’s the video inpainting tech I'm most excited about. I really hope this model finds its way into YouTube's Generative AI efforts, and based on their prior announcements and the list of acknowledgments in the paper I think it might! 🤞🏽 Links: 🔗Paper: 🔗Project:

Bilawal Sidhu

44,822 Aufrufe • vor 2 Jahren

Massive performance improvement. This is a bit more of technical post, but man do I love this stuff! Units navigate the map using a 'Navigation Mesh'. Before, I was using one giant nav mesh that spanned the entire map. The more objects that were placed (especially on a large map such as this 'RadarAttack' map designed by Syphotic | Steel Command), the larger the 'lag' would be after placement. You can see here that there is a massive frame drop and the navmesh doesnt update for almost 5 seconds. Now, there are a ton of tiny navmeshes that connect to one another, and together they cover the whole map. Now, when an object is placed, the navmesh will update instantly, because it no longer needs to parse through every object on the map (potentially thousands!!!). It only needs to parse through the objects that exist in the mini navmesh that the object was placed in (probably only 1-5 objects now!). Performance XP Boost +100! Charles Horwood You might appreciate this one :) #Rts #RTSGame #IndieGame

Massive performance improvement. This is a bit more of technical post, but man do I love this stuff! Units navigate the map using a 'Navigation Mesh'. Before, I was using one giant nav mesh that spanned the entire map. The more objects that were placed (especially on a large map such as this 'RadarAttack' map designed by Syphotic | Steel Command), the larger the 'lag' would be after placement. You can see here that there is a massive frame drop and the navmesh doesnt update for almost 5 seconds. Now, there are a ton of tiny navmeshes that connect to one another, and together they cover the whole map. Now, when an object is placed, the navmesh will update instantly, because it no longer needs to parse through every object on the map (potentially thousands!!!). It only needs to parse through the objects that exist in the mini navmesh that the object was placed in (probably only 1-5 objects now!). Performance XP Boost +100! Charles Horwood You might appreciate this one :) #Rts #RTSGame #IndieGame

Smitty | Steel Command

60,072 Aufrufe • vor 6 Monaten

Wonderland: Navigating 3D Scenes from a Single Image Contributions: • First, we introduce a representation for controllable 3D generation by leveraging the generative priors from camera-guided video diffusion models. Unlike image models, video diffusion models are trained on extensive video datasets. This enables them to capture comprehensive spatial relationships within scenes across multiple views and embed a form of "3D awareness" in their latent space, which allows us to maintain 3D consistency in novel view synthesis. • Second, to achieve controllable novel view generation, we empower video models with precise control over specified camera motions. We introduce a novel dual-branch conditioning mechanism that effectively incorporates desired diverse camera trajectories into the video diffusion model. This enables expansion of a single image into a multi-view consistent capture of a 3D scene with precise pose control. • Third, to achieve efficient 3D reconstruction, we directly transform video latents into 3DGS. We propose a novel latent-based large reconstruction model (LaLRM) that lifts video latents to 3D in a feed-forward manner. With this design, during inference, our model directly predicts 3DGS from a single input image, effectively aligning the generation and reconstruction tasks—and bridging image space and 3D space—through the video latent space. Compared with reconstructing scenes from images, the video latent space offers a 256× spatial-temporal reduction while retaining essential and consistent 3D structural details. Such a high degree of compression is crucial, as it allows the LaLRM to handle a wider range of 3D scenes within the reconstruction framework, with the same memory constraints.

Wonderland: Navigating 3D Scenes from a Single Image Contributions: • First, we introduce a representation for controllable 3D generation by leveraging the generative priors from camera-guided video diffusion models. Unlike image models, video diffusion models are trained on extensive video datasets. This enables them to capture comprehensive spatial relationships within scenes across multiple views and embed a form of "3D awareness" in their latent space, which allows us to maintain 3D consistency in novel view synthesis. • Second, to achieve controllable novel view generation, we empower video models with precise control over specified camera motions. We introduce a novel dual-branch conditioning mechanism that effectively incorporates desired diverse camera trajectories into the video diffusion model. This enables expansion of a single image into a multi-view consistent capture of a 3D scene with precise pose control. • Third, to achieve efficient 3D reconstruction, we directly transform video latents into 3DGS. We propose a novel latent-based large reconstruction model (LaLRM) that lifts video latents to 3D in a feed-forward manner. With this design, during inference, our model directly predicts 3DGS from a single input image, effectively aligning the generation and reconstruction tasks—and bridging image space and 3D space—through the video latent space. Compared with reconstructing scenes from images, the video latent space offers a 256× spatial-temporal reduction while retaining essential and consistent 3D structural details. Such a high degree of compression is crucial, as it allows the LaLRM to handle a wider range of 3D scenes within the reconstruction framework, with the same memory constraints.

MrNeRF

52,849 Aufrufe • vor 1 Jahr