正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Do 3D reconstruction transformers really need a billion parameters, or are most of those layers just doing the same thing over and over? Introducing Déjà View: a single transformer block, looped K times, that matches or beats models 8–10× its size with lower compute. 🧵

Tobias Fischer

1,033 subscribers

92,192 次观看 • 1 个月前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

What are we doing people, is it just me, or does the foolery need to stop. What’s the difference than going on a field with an actual base and work on the same thing? Are we creating robots or athletes?

What are we doing people, is it just me, or does the foolery need to stop. What’s the difference than going on a field with an actual base and work on the same thing? Are we creating robots or athletes?

Austin Ford Sr.

448,948 次观看 • 11 个月前

How can we use test-time compute for spatial understanding? 🤔 In InterPose, we propose to repeatedly sample generative video models to help two-view pose estimation and reconstruction, by leveraging the video models' keyframe interpolation abilities. A 🧵... (1/8)

How can we use test-time compute for spatial understanding? 🤔 In InterPose, we propose to repeatedly sample generative video models to help two-view pose estimation and reconstruction, by leveraging the video models' keyframe interpolation abilities. A 🧵... (1/8)

Ricardo Martin-Brualla

21,685 次观看 • 1 年前

Wonderland: Navigating 3D Scenes from a Single Image Contributions: • First, we introduce a representation for controllable 3D generation by leveraging the generative priors from camera-guided video diffusion models. Unlike image models, video diffusion models are trained on extensive video datasets. This enables them to capture comprehensive spatial relationships within scenes across multiple views and embed a form of "3D awareness" in their latent space, which allows us to maintain 3D consistency in novel view synthesis. • Second, to achieve controllable novel view generation, we empower video models with precise control over specified camera motions. We introduce a novel dual-branch conditioning mechanism that effectively incorporates desired diverse camera trajectories into the video diffusion model. This enables expansion of a single image into a multi-view consistent capture of a 3D scene with precise pose control. • Third, to achieve efficient 3D reconstruction, we directly transform video latents into 3DGS. We propose a novel latent-based large reconstruction model (LaLRM) that lifts video latents to 3D in a feed-forward manner. With this design, during inference, our model directly predicts 3DGS from a single input image, effectively aligning the generation and reconstruction tasks—and bridging image space and 3D space—through the video latent space. Compared with reconstructing scenes from images, the video latent space offers a 256× spatial-temporal reduction while retaining essential and consistent 3D structural details. Such a high degree of compression is crucial, as it allows the LaLRM to handle a wider range of 3D scenes within the reconstruction framework, with the same memory constraints.

Wonderland: Navigating 3D Scenes from a Single Image Contributions: • First, we introduce a representation for controllable 3D generation by leveraging the generative priors from camera-guided video diffusion models. Unlike image models, video diffusion models are trained on extensive video datasets. This enables them to capture comprehensive spatial relationships within scenes across multiple views and embed a form of "3D awareness" in their latent space, which allows us to maintain 3D consistency in novel view synthesis. • Second, to achieve controllable novel view generation, we empower video models with precise control over specified camera motions. We introduce a novel dual-branch conditioning mechanism that effectively incorporates desired diverse camera trajectories into the video diffusion model. This enables expansion of a single image into a multi-view consistent capture of a 3D scene with precise pose control. • Third, to achieve efficient 3D reconstruction, we directly transform video latents into 3DGS. We propose a novel latent-based large reconstruction model (LaLRM) that lifts video latents to 3D in a feed-forward manner. With this design, during inference, our model directly predicts 3DGS from a single input image, effectively aligning the generation and reconstruction tasks—and bridging image space and 3D space—through the video latent space. Compared with reconstructing scenes from images, the video latent space offers a 256× spatial-temporal reduction while retaining essential and consistent 3D structural details. Such a high degree of compression is crucial, as it allows the LaLRM to handle a wider range of 3D scenes within the reconstruction framework, with the same memory constraints.

MrNeRF

52,801 次观看 • 1 年前

Create a 3D model from a single image, set of images or a text prompt in < 1 minute 😮‍💨 This new AI paper called CAT3D shows us that it’ll keep getting easier to produce 3D models from 2D images — whether it’s a sparser real world 3D scan (a few photos instead of hundreds) or your favorite 2D image generator like Midjourney (just an image). How does this magic work? “This architecture is similar to video diffusion models, but with camera pose embeddings for each image instead of time embeddings. The generated views are passed into a robust 3D reconstruction pipeline to create the 3D representation (Zip-NeRF or 3DGS)”

Create a 3D model from a single image, set of images or a text prompt in < 1 minute 😮‍💨 This new AI paper called CAT3D shows us that it’ll keep getting easier to produce 3D models from 2D images — whether it’s a sparser real world 3D scan (a few photos instead of hundreds) or your favorite 2D image generator like Midjourney (just an image). How does this magic work? “This architecture is similar to video diffusion models, but with camera pose embeddings for each image instead of time embeddings. The generated views are passed into a robust 3D reconstruction pipeline to create the 3D representation (Zip-NeRF or 3DGS)”

Bilawal Sidhu

92,792 次观看 • 2 年前

🚀 Introducing GenLit – Reformulating Single-Image Relighting as Video Generation! We leverage video diffusion models to perform realistic near-field relighting from just a single image—No explicit 3D reconstruction or ray tracing required! No intermediate graphics buffers, directly in the pixel space! 📄 Dive into the paper: 🎥 Project page & demos: 🛠 Code coming soon! #GenerativeAI #ComputerVision #Relighting #DiffusionModels #Graphics 🧵 1/5

🚀 Introducing GenLit – Reformulating Single-Image Relighting as Video Generation! We leverage video diffusion models to perform realistic near-field relighting from just a single image—No explicit 3D reconstruction or ray tracing required! No intermediate graphics buffers, directly in the pixel space! 📄 Dive into the paper: 🎥 Project page & demos: 🛠 Code coming soon! #GenerativeAI #ComputerVision #Relighting #DiffusionModels #Graphics 🧵 1/5

Haven Feng

22,442 次观看 • 1 年前

Platysternon megacephalum (or big headed turtle) is a very odd-shaped turtle with a huge head and a long tail that are almost the same size as its body. [📹 47ruacanhphongthuybmt]

Platysternon megacephalum (or big headed turtle) is a very odd-shaped turtle with a huge head and a long tail that are almost the same size as its body. [📹 47ruacanhphongthuybmt]

Massimo

16,142,025 次观看 • 2 年前

Show-o One Single Transformer to Unify Multimodal Understanding and Generation discuss: We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks including visual question-answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, it demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters tailored for understanding or generation. This significantly highlights its potential as a next-generation foundation model.

Show-o One Single Transformer to Unify Multimodal Understanding and Generation discuss: We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks including visual question-answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, it demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters tailored for understanding or generation. This significantly highlights its potential as a next-generation foundation model.

AK

124,048 次观看 • 1 年前

The US "diet industry" makes over $70 billion a year... Yet Americans are fatter than ever. Why? They sell you useless fad diets and supplements that don't work. Here are the only 8 diet rules you need for fat loss (bookmark this): 🧵 1. Eat 150g+ of protein a day

The US "diet industry" makes over $70 billion a year... Yet Americans are fatter than ever. Why? They sell you useless fad diets and supplements that don't work. Here are the only 8 diet rules you need for fat loss (bookmark this): 🧵 1. Eat 150g+ of protein a day

Chace Chambers

156,764 次观看 • 6 个月前

all the parameters (weights and biases) of the same multilayer perception over 200 epochs red/blue/yellow-three layers The top right knob in each layer being the weight

all the parameters (weights and biases) of the same multilayer perception over 200 epochs red/blue/yellow-three layers The top right knob in each layer being the weight

Kat ⊷ the Poet Engineer

15,851 次观看 • 1 年前

Conducting road safety audits, scheduling maintenance, or planning logistics routes used to require combing through imagery or, worse, driving miles of roads yourself. Not anymore. New AI-powered data layers in Google Earth are changing that. Pulling from billions of Google Street View images, these layers now spot infrastructure assets like stop signs, speed limit signs, and more to map infrastructure for you. By combining these assets into a single project and diving into Street View you can locate and validate assets on Google Earth in seconds. These layers are now available for Professional & Professional Advanced customers on web and Android, with coverage expanding over the coming weeks.

Conducting road safety audits, scheduling maintenance, or planning logistics routes used to require combing through imagery or, worse, driving miles of roads yourself. Not anymore. New AI-powered data layers in Google Earth are changing that. Pulling from billions of Google Street View images, these layers now spot infrastructure assets like stop signs, speed limit signs, and more to map infrastructure for you. By combining these assets into a single project and diving into Street View you can locate and validate assets on Google Earth in seconds. These layers are now available for Professional & Professional Advanced customers on web and Android, with coverage expanding over the coming weeks.

Google Earth

19,433 次观看 • 20 天前

The US "diet industry" makes over $70 billion a year... Yet Americans are fatter than ever. Why? They sell you useless fad diets and supplements that don´t work. Here are the only 8 diet rules you need for fat loss (bookmark this): 🧵 1. Eat 150g+ of protein a day

The US "diet industry" makes over $70 billion a year... Yet Americans are fatter than ever. Why? They sell you useless fad diets and supplements that don´t work. Here are the only 8 diet rules you need for fat loss (bookmark this): 🧵 1. Eat 150g+ of protein a day

Chace Chambers

642,492 次观看 • 10 个月前

The rise and fall of popular transformer architectures 📈 One pro of having a lot of transformer architectures implemented in transformers & hosted on the Hugging Face Hub is that one can see the evolution of popularity of different models 👀

The rise and fall of popular transformer architectures 📈 One pro of having a lot of transformer architectures implemented in transformers & hosted on the Hugging Face Hub is that one can see the evolution of popularity of different models 👀

Lysandre

31,730 次观看 • 2 年前

STEYER: "If you keep doing the same thing over and over and expecting a different outcome, that's insanity!" Steve Hilton: "Then don't vote Democrat!

STEYER: "If you keep doing the same thing over and over and expecting a different outcome, that's insanity!" Steve Hilton: "Then don't vote Democrat!

Daily Caller

101,573 次观看 • 1 个月前

🚨🚨 The natural beauty of Pashtuns, mashaAllah. No makeup, no surgery — just a brother and his sisters from a modest family. They do not need jewelry or cosmetic procedures; those whom God has blessed with beauty are never in need of anything artificial.

🚨🚨 The natural beauty of Pashtuns, mashaAllah. No makeup, no surgery — just a brother and his sisters from a modest family. They do not need jewelry or cosmetic procedures; those whom God has blessed with beauty are never in need of anything artificial.

GPX

2,389,518 次观看 • 6 个月前

View of destructive storms hitting downtown Indianapolis from my third floor. Not sure if those blue flashes are transformers blowing or not #inwx

View of destructive storms hitting downtown Indianapolis from my third floor. Not sure if those blue flashes are transformers blowing or not #inwx

Cody Moore

105,737 次观看 • 1 年前

SHES SO SICK OF YALL ASKING HER TO DO THE SAME THING OVER AND OVER AGAIN LMFAOOO

SHES SO SICK OF YALL ASKING HER TO DO THE SAME THING OVER AND OVER AGAIN LMFAOOO

🐿️💨

279,985 次观看 • 2 个月前

Self-Calibrating Gaussian Splatting for Large Field of View Reconstruction Note: Check below for full video. Abstract (cited): "In this paper, we present a self-calibrating framework that jointly optimizes camera parameters, lens distortion, and 3D Gaussian representations, enabling accurate and efficient scene reconstruction. Our technique is particularly effective for high-quality scene reconstruction from large field-of-view (FOV) imagery taken with wide-angle lenses, allowing the scene to be modeled from a smaller number of images. We introduce a novel method for modeling complex lens distortions using a hybrid network that combines invertible residual networks with explicit grids. This design effectively regularizes the optimization process, achieving greater accuracy than conventional camera models. Additionally, we propose a cubemap-based resampling strategy to support large FOV images without sacrificing resolution or introducing distortion artifacts. Our method is compatible with the fast rasterization of Gaussian Splatting, adaptable to a wide variety of camera lens distortions, and demonstrates state-of-the-art performance on both synthetic and real-world datasets."

Self-Calibrating Gaussian Splatting for Large Field of View Reconstruction Note: Check below for full video. Abstract (cited): "In this paper, we present a self-calibrating framework that jointly optimizes camera parameters, lens distortion, and 3D Gaussian representations, enabling accurate and efficient scene reconstruction. Our technique is particularly effective for high-quality scene reconstruction from large field-of-view (FOV) imagery taken with wide-angle lenses, allowing the scene to be modeled from a smaller number of images. We introduce a novel method for modeling complex lens distortions using a hybrid network that combines invertible residual networks with explicit grids. This design effectively regularizes the optimization process, achieving greater accuracy than conventional camera models. Additionally, we propose a cubemap-based resampling strategy to support large FOV images without sacrificing resolution or introducing distortion artifacts. Our method is compatible with the fast rasterization of Gaussian Splatting, adaptable to a wide variety of camera lens distortions, and demonstrates state-of-the-art performance on both synthetic and real-world datasets."

MrNeRF

17,206 次观看 • 1 年前

Playing around with a "tunnel gun" that can open a tunnel in front of you that can be looked into or moved through. This type of thing is really simple in my engine - just spawn an SDF capsule subtraction that is locked to the player's view.

Playing around with a "tunnel gun" that can open a tunnel in front of you that can be looked into or moved through. This type of thing is really simple in my engine - just spawn an SDF capsule subtraction that is locked to the player's view.

Mike Turitzin

281,995 次观看 • 8 个月前

CNN Legal Analyst Elie Honig: "Pam Bondi is, without a question, qualified to be Attorney General. She's been a prosecutor for 20 years in Florida. For 8 of those, she was the Attorney General of the state. That's a very big, very complicated job‚ and that level of experience is on par with, or better than most United States Attorneys General that we've seen over the past 50 years or so."

CNN Legal Analyst Elie Honig: "Pam Bondi is, without a question, qualified to be Attorney General. She's been a prosecutor for 20 years in Florida. For 8 of those, she was the Attorney General of the state. That's a very big, very complicated job‚ and that level of experience is on par with, or better than most United States Attorneys General that we've seen over the past 50 years or so."

Trump War Room

456,072 次观看 • 1 年前

New update drop 🎶 Blaster's beats are taking over! Download the latest 2.2 update and join the party! #Transformers #Gaming #Blaster

New update drop 🎶 Blaster's beats are taking over! Download the latest 2.2 update and join the party! #Transformers #Gaming #Blaster

TRANSFORMERS: Tactical Arena

40,533 次观看 • 1 年前