Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

FastMap: Revisiting Dense and Scalable Structure from Motion "FASTMAP, a redesigned SfM framework, achieves fast, high-accuracy dense structure from motion. On large scenes with thousands of images, FASTMAP is up to one to two orders of magnitude faster than GLOMAP and COLMAP. ... Importantly, FASTMAP achieves efficiency improvements while... keeping comparable performance. Extensive experiments on eight datasets demonstrate pose estimation accuracy and novel view synthesis quality close to GLOMAP and COLMAP. " Contributions: 1. For all the iterative nonlinear optimization problems involved, we design algorithms such that the computational complexity of each iteration is only linear in the number of image pairs, not keypoint pairs or 3D points. This includes replacing the traditional bundle adjustment [50] present in previous SfM frameworks with a novel re-weighting epipolar adjustment algorithm, which is much more efficient. 2. Throughout the entire framework, we formulate as many steps as possible as GPU-friendly dense tensor operations. This allows us to implement the entire method in PyTorch [39], which provides seamless GPU acceleration.show more

MrNeRF

13,566 subscribers

15,233 Aufrufe • vor 1 Jahr •via X (Twitter)

Kunst Wissenschaft & Technologie Bildung

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

Meta releases VGGSfM Visual Geometry Grounded Deep Structure From Motion Structure-from-motion (SfM) is a long-standing problem in the computer vision community, which aims to reconstruct the camera poses and 3D structure of a scene from a set of unconstrained 2D images. Classical frameworks solve this problem in an incremental manner by detecting and matching keypoints, registering images, triangulating 3D points, and conducting bundle adjustment. Recent research efforts have predominantly revolved around harnessing the power of deep learning techniques to enhance specific elements (e.g., keypoint matching), but are still based on the original, non-differentiable pipeline. Instead, we propose a new deep SfM pipeline VGGSfM, where each component is fully differentiable and thus can be trained in an end-to-end manner. To this end, we introduce new mechanisms and simplifications. First, we build on recent advances in deep 2D point tracking to extract reliable pixel-accurate tracks, which eliminates the need for chaining pairwise matches. Furthermore, we recover all cameras simultaneously based on the image and track features instead of gradually registering cameras. Finally, we optimise the cameras and triangulate 3D points via a differentiable bundle adjustment layer. We attain state-of-the-art performance on three popular datasets, CO3D, IMC Phototourism, and ETH3D.

Meta releases VGGSfM Visual Geometry Grounded Deep Structure From Motion Structure-from-motion (SfM) is a long-standing problem in the computer vision community, which aims to reconstruct the camera poses and 3D structure of a scene from a set of unconstrained 2D images. Classical frameworks solve this problem in an incremental manner by detecting and matching keypoints, registering images, triangulating 3D points, and conducting bundle adjustment. Recent research efforts have predominantly revolved around harnessing the power of deep learning techniques to enhance specific elements (e.g., keypoint matching), but are still based on the original, non-differentiable pipeline. Instead, we propose a new deep SfM pipeline VGGSfM, where each component is fully differentiable and thus can be trained in an end-to-end manner. To this end, we introduce new mechanisms and simplifications. First, we build on recent advances in deep 2D point tracking to extract reliable pixel-accurate tracks, which eliminates the need for chaining pairwise matches. Furthermore, we recover all cameras simultaneously based on the image and track features instead of gradually registering cameras. Finally, we optimise the cameras and triangulate 3D points via a differentiable bundle adjustment layer. We attain state-of-the-art performance on three popular datasets, CO3D, IMC Phototourism, and ETH3D.

AK

96,527 Aufrufe • vor 2 Jahren

Wonderland: Navigating 3D Scenes from a Single Image Contributions: • First, we introduce a representation for controllable 3D generation by leveraging the generative priors from camera-guided video diffusion models. Unlike image models, video diffusion models are trained on extensive video datasets. This enables them to capture comprehensive spatial relationships within scenes across multiple views and embed a form of "3D awareness" in their latent space, which allows us to maintain 3D consistency in novel view synthesis. • Second, to achieve controllable novel view generation, we empower video models with precise control over specified camera motions. We introduce a novel dual-branch conditioning mechanism that effectively incorporates desired diverse camera trajectories into the video diffusion model. This enables expansion of a single image into a multi-view consistent capture of a 3D scene with precise pose control. • Third, to achieve efficient 3D reconstruction, we directly transform video latents into 3DGS. We propose a novel latent-based large reconstruction model (LaLRM) that lifts video latents to 3D in a feed-forward manner. With this design, during inference, our model directly predicts 3DGS from a single input image, effectively aligning the generation and reconstruction tasks—and bridging image space and 3D space—through the video latent space. Compared with reconstructing scenes from images, the video latent space offers a 256× spatial-temporal reduction while retaining essential and consistent 3D structural details. Such a high degree of compression is crucial, as it allows the LaLRM to handle a wider range of 3D scenes within the reconstruction framework, with the same memory constraints.

Wonderland: Navigating 3D Scenes from a Single Image Contributions: • First, we introduce a representation for controllable 3D generation by leveraging the generative priors from camera-guided video diffusion models. Unlike image models, video diffusion models are trained on extensive video datasets. This enables them to capture comprehensive spatial relationships within scenes across multiple views and embed a form of "3D awareness" in their latent space, which allows us to maintain 3D consistency in novel view synthesis. • Second, to achieve controllable novel view generation, we empower video models with precise control over specified camera motions. We introduce a novel dual-branch conditioning mechanism that effectively incorporates desired diverse camera trajectories into the video diffusion model. This enables expansion of a single image into a multi-view consistent capture of a 3D scene with precise pose control. • Third, to achieve efficient 3D reconstruction, we directly transform video latents into 3DGS. We propose a novel latent-based large reconstruction model (LaLRM) that lifts video latents to 3D in a feed-forward manner. With this design, during inference, our model directly predicts 3DGS from a single input image, effectively aligning the generation and reconstruction tasks—and bridging image space and 3D space—through the video latent space. Compared with reconstructing scenes from images, the video latent space offers a 256× spatial-temporal reduction while retaining essential and consistent 3D structural details. Such a high degree of compression is crucial, as it allows the LaLRM to handle a wider range of 3D scenes within the reconstruction framework, with the same memory constraints.

MrNeRF

52,849 Aufrufe • vor 1 Jahr

Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models Contributions: • We introduce Diffuman4D, a novel diffusion model that generates spatio-temporally consistent and high-resolution (1024p) human videos from sparse-view video inputs. • We propose a sliding iterative denoising mechanism that enhances both the spatial and temporal consistency of generated long-term videos while maintaining efficient inference. • We design a human pose conditioning scheme to enhance the appearance quality and motion accuracy of generated human videos. • We plan to release our processed version of the DNA-Rendering dataset, which we believe will benefit future research in this area.

Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models Contributions: • We introduce Diffuman4D, a novel diffusion model that generates spatio-temporally consistent and high-resolution (1024p) human videos from sparse-view video inputs. • We propose a sliding iterative denoising mechanism that enhances both the spatial and temporal consistency of generated long-term videos while maintaining efficient inference. • We design a human pose conditioning scheme to enhance the appearance quality and motion accuracy of generated human videos. • We plan to release our processed version of the DNA-Rendering dataset, which we believe will benefit future research in this area.

MrNeRF

24,729 Aufrufe • vor 1 Jahr

Self-Calibrating Gaussian Splatting for Large Field of View Reconstruction Note: Check below for full video. Abstract (cited): "In this paper, we present a self-calibrating framework that jointly optimizes camera parameters, lens distortion, and 3D Gaussian representations, enabling accurate and efficient scene reconstruction. Our technique is particularly effective for high-quality scene reconstruction from large field-of-view (FOV) imagery taken with wide-angle lenses, allowing the scene to be modeled from a smaller number of images. We introduce a novel method for modeling complex lens distortions using a hybrid network that combines invertible residual networks with explicit grids. This design effectively regularizes the optimization process, achieving greater accuracy than conventional camera models. Additionally, we propose a cubemap-based resampling strategy to support large FOV images without sacrificing resolution or introducing distortion artifacts. Our method is compatible with the fast rasterization of Gaussian Splatting, adaptable to a wide variety of camera lens distortions, and demonstrates state-of-the-art performance on both synthetic and real-world datasets."

Self-Calibrating Gaussian Splatting for Large Field of View Reconstruction Note: Check below for full video. Abstract (cited): "In this paper, we present a self-calibrating framework that jointly optimizes camera parameters, lens distortion, and 3D Gaussian representations, enabling accurate and efficient scene reconstruction. Our technique is particularly effective for high-quality scene reconstruction from large field-of-view (FOV) imagery taken with wide-angle lenses, allowing the scene to be modeled from a smaller number of images. We introduce a novel method for modeling complex lens distortions using a hybrid network that combines invertible residual networks with explicit grids. This design effectively regularizes the optimization process, achieving greater accuracy than conventional camera models. Additionally, we propose a cubemap-based resampling strategy to support large FOV images without sacrificing resolution or introducing distortion artifacts. Our method is compatible with the fast rasterization of Gaussian Splatting, adaptable to a wide variety of camera lens distortions, and demonstrates state-of-the-art performance on both synthetic and real-world datasets."

MrNeRF

17,206 Aufrufe • vor 1 Jahr

3D Gaussian Splatting for Real-Time Radiance Field Rendering paper page: Radiance Field methods have recently revolutionized novel-view synthesis of scenes captured with multiple photos or videos. However, achieving high visual quality still requires neural networks that are costly to train and render, while recent faster methods inevitably trade off speed for quality. For unbounded and complete scenes (rather than isolated objects) and 1080p resolution rendering, no current method can achieve real-time display rates. We introduce three key elements that allow us to achieve state-of-the-art visual quality while maintaining competitive training times and importantly allow high-quality real-time (>= 30 fps) novel-view synthesis at 1080p resolution. First, starting from sparse points produced during camera calibration, we represent the scene with 3D Gaussians that preserve desirable properties of continuous volumetric radiance fields for scene optimization while avoiding unnecessary computation in empty space; Second, we perform interleaved optimization/density control of the 3D Gaussians, notably optimizing anisotropic covariance to achieve an accurate representation of the scene; Third, we develop a fast visibility-aware rendering algorithm that supports anisotropic splatting and both accelerates training and allows realtime rendering. We demonstrate state-of-the-art visual quality and real-time rendering on several established datasets.

3D Gaussian Splatting for Real-Time Radiance Field Rendering paper page: Radiance Field methods have recently revolutionized novel-view synthesis of scenes captured with multiple photos or videos. However, achieving high visual quality still requires neural networks that are costly to train and render, while recent faster methods inevitably trade off speed for quality. For unbounded and complete scenes (rather than isolated objects) and 1080p resolution rendering, no current method can achieve real-time display rates. We introduce three key elements that allow us to achieve state-of-the-art visual quality while maintaining competitive training times and importantly allow high-quality real-time (>= 30 fps) novel-view synthesis at 1080p resolution. First, starting from sparse points produced during camera calibration, we represent the scene with 3D Gaussians that preserve desirable properties of continuous volumetric radiance fields for scene optimization while avoiding unnecessary computation in empty space; Second, we perform interleaved optimization/density control of the 3D Gaussians, notably optimizing anisotropic covariance to achieve an accurate representation of the scene; Third, we develop a fast visibility-aware rendering algorithm that supports anisotropic splatting and both accelerates training and allows realtime rendering. We demonstrate state-of-the-art visual quality and real-time rendering on several established datasets.

AK

633,532 Aufrufe • vor 3 Jahren

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion TL;DR: Create 3/4DGS from Video Diffusion Note: Some first inference code released (not all yet). Contributions (cited): • We present DimensionX, a novel framework for generating photorealistic 3D and 4D scenes from only a single image using controllable video diffusion. • We propose ST-Director, which decouples the spatial and temporal priors in video diffusion models by learning (spatial and temporal) dimension-aware modules with our curated datasets. We further enhance the hybriddimension control with a training-free composition approach according to the essence of video diffusion denoising process. • To bridge the gap between video diffusion and real-world scenes, we design a trajectory-aware mechanism for 3D generation and an identity-preserving denoising approach for 4D generation, enabling more realistic and controllable scene synthesis. • Extensive experiments manifest that our DimensionX delivers superior performance in video, 3D, and 4D generation compared with baseline methods.

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion TL;DR: Create 3/4DGS from Video Diffusion Note: Some first inference code released (not all yet). Contributions (cited): • We present DimensionX, a novel framework for generating photorealistic 3D and 4D scenes from only a single image using controllable video diffusion. • We propose ST-Director, which decouples the spatial and temporal priors in video diffusion models by learning (spatial and temporal) dimension-aware modules with our curated datasets. We further enhance the hybriddimension control with a training-free composition approach according to the essence of video diffusion denoising process. • To bridge the gap between video diffusion and real-world scenes, we design a trajectory-aware mechanism for 3D generation and an identity-preserving denoising approach for 4D generation, enabling more realistic and controllable scene synthesis. • Extensive experiments manifest that our DimensionX delivers superior performance in video, 3D, and 4D generation compared with baseline methods.

MrNeRF

17,052 Aufrufe • vor 1 Jahr

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation paper page: Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation paper page: Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

AK

375,123 Aufrufe • vor 3 Jahren

Is Google taking initial steps to enhance Street View? For some reason, Street View seems stuck in technology that feels outdated. I wonder if we'll see such improvements on the product side. Also, note how much better it performs in all aspects compared to Zip-NeRF in their presented material. It offers more details and fewer artifacts. Great work! "LODGE: Level-of-Detail Large-Scale Gaussian Splatting with Efficient Rendering" Contributions: • We propose a novel LOD representation for 3DGS which, unlike previous methods [27, 28, 17], does not recompute the list of used Gaussians at each frame. This allows for acceleration and compaction, enabling the rendering of large-scale scenes even on mobile devices. • We design a strategy to automatically select optimal hyperparameters for splitting LODs, whereas most other methods require manual tuning of hyperparameters for each 3D scene. • To further accelerate rendering, we split the scene into chunks and pre-compute sets of active Gaussians per chunk. • Finally, we introduce a novel opacity interpolation scheme to produce visually pleasing rendering and eliminate artifacts when transitioning between chunks.

Is Google taking initial steps to enhance Street View? For some reason, Street View seems stuck in technology that feels outdated. I wonder if we'll see such improvements on the product side. Also, note how much better it performs in all aspects compared to Zip-NeRF in their presented material. It offers more details and fewer artifacts. Great work! "LODGE: Level-of-Detail Large-Scale Gaussian Splatting with Efficient Rendering" Contributions: • We propose a novel LOD representation for 3DGS which, unlike previous methods [27, 28, 17], does not recompute the list of used Gaussians at each frame. This allows for acceleration and compaction, enabling the rendering of large-scale scenes even on mobile devices. • We design a strategy to automatically select optimal hyperparameters for splitting LODs, whereas most other methods require manual tuning of hyperparameters for each 3D scene. • To further accelerate rendering, we split the scene into chunks and pre-compute sets of active Gaussians per chunk. • Finally, we introduce a novel opacity interpolation scheme to produce visually pleasing rendering and eliminate artifacts when transitioning between chunks.

MrNeRF

62,564 Aufrufe • vor 1 Jahr

First fully ML-framework-free 3D Gaussian Splatting implementation in LichtFeld Studio. I’ve completed the migration of the full training pipeline to a custom CUDA-based tensor library. No PyTorch, no LibTorch, no autograd. Every gradient is implemented by hand, either through CUDA kernels or minimal abstractions on top. This makes it the first full training setup for 3D Gaussian Splatting with zero dependencies on existing ML frameworks. It’s not just about independence, it's about control! We now manage every byte of GPU memory, which opens the door to tighter optimization and finer performance tuning. The framework footprint is minimal, without pulling in gigabytes of ML runtime code that was never designed for real-time or graphics-driven applications. A few modules, such as the metrics and 3DGUT interfaces, are still being ported, and some operations are temporarily naïve, so performance is not yet on par with master. But this refactor lays the groundwork for: - A fully self-contained binary - Fine-grained memory optimization - Easier experimentation without the weight of an ML stack We’re getting close.

First fully ML-framework-free 3D Gaussian Splatting implementation in LichtFeld Studio. I’ve completed the migration of the full training pipeline to a custom CUDA-based tensor library. No PyTorch, no LibTorch, no autograd. Every gradient is implemented by hand, either through CUDA kernels or minimal abstractions on top. This makes it the first full training setup for 3D Gaussian Splatting with zero dependencies on existing ML frameworks. It’s not just about independence, it's about control! We now manage every byte of GPU memory, which opens the door to tighter optimization and finer performance tuning. The framework footprint is minimal, without pulling in gigabytes of ML runtime code that was never designed for real-time or graphics-driven applications. A few modules, such as the metrics and 3DGUT interfaces, are still being ported, and some operations are temporarily naïve, so performance is not yet on par with master. But this refactor lays the groundwork for: - A fully self-contained binary - Fine-grained memory optimization - Easier experimentation without the weight of an ML stack We’re getting close.

MrNeRF

50,548 Aufrufe • vor 8 Monaten

$📢Announcing our 3D head avatar benchmark📢 Two tasks with hidden test sets: - Dynamic Novel View Synthesis on Heads - Monocular FLAME-driven Head Avatar Reconstruction Our goal is to make research on 3D head avatars more comparable and ultimately increase the realism of digital humans. The benchmark studies distinct phenomena of 3D head avatar creation, such as extreme facial expressions, slow motion captures of shaking long hair, or complicated light reflection and refraction patterns of glasses. The two benchmark tasks assess two core desiderata of 3D avatars: While the novel view synthesis challenge focuses on best possible rendering quality of complex moving scenes, the avatar animation challenge is concerned with how well a driving signal is translated into an avatar. Evaluations are light-weight and consist of diverse video recordings from the popular NeRSemble dataset with a hidden test set. Participation in the benchmark is therefore straight-forward and requires only 5 reconstructions per task. Leaderboard and benchmark submission: Benchmark data access and toolkit: Great work by Tobias Kirschstein Simon Giebenhain$

📢Announcing our 3D head avatar benchmark📢 Two tasks with hidden test sets: - Dynamic Novel View Synthesis on Heads - Monocular FLAME-driven Head Avatar Reconstruction Our goal is to make research on 3D head avatars more comparable and ultimately increase the realism of digital humans. The benchmark studies distinct phenomena of 3D head avatar creation, such as extreme facial expressions, slow motion captures of shaking long hair, or complicated light reflection and refraction patterns of glasses. The two benchmark tasks assess two core desiderata of 3D avatars: While the novel view synthesis challenge focuses on best possible rendering quality of complex moving scenes, the avatar animation challenge is concerned with how well a driving signal is translated into an avatar. Evaluations are light-weight and consist of diverse video recordings from the popular NeRSemble dataset with a hidden test set. Participation in the benchmark is therefore straight-forward and requires only 5 reconstructions per task. Leaderboard and benchmark submission: Benchmark data access and toolkit: Great work by Tobias Kirschstein Simon Giebenhain

Matthias Niessner

28,075 Aufrufe • vor 1 Jahr

Colmap 4.0 was very recently released, so it inspired me to do some work to better understand it and its new capabilities with Rerun. I want to really understand how Colmap, and in particular, pycolmap, works outside of just calling it via the CLI. So my goal is to use the low-level pycolmap API to log every part of the pipeline. The explicit goal is to have an alternative to the SQLite database that I can utilize. Instead of SQLite, I want to try logging everything directly to rerun and use RRD. This means I can have deep inspectability and still save the features/matches/2D view geometry, but be able to view it directly in rerun. I think this is one of the superpowers that rerun provides; data and visualizations are deeply integrated. As I'm often working with sequential data (videos), I'm going to specifically focus on four things: 1. Monocular Video Simple: Calls high-level APIs such as pycolmap.extract_features, pycolmap.match_sequential, pycolmap.incremental_mapping. These are basically identical to the CLI options and provide a good baseline. 2. Monocular Video Streamed: Take the above high-level APIs and break them down to their iterator version, logging each component in a streamed manner. This way, I can stream the intermediate features to rerun while the extraction/matching/mapping is happening. 3. Rig with unknown calibration: <- WHAT THE VIDEO SHOWS This is probably the most interesting version and the first one I've been working on. It allows one to set a rig between known sensors, such as in VR/AR devices, leading to much better reconstructions with multiple cameras. This is the case where we don't know the calibration a priori, so we have to run a reconstruction twice: once as a normal Colmap reconstruction with no rig constraints, use this to generate the constraints, and then do it again with the newly found rig. 4. Rig with known calibration: This is the RoboCap example, where we have a pre-calibrated set of sensors, so we don't need to run the two reconstructions and also gain better matching between cameras, both spatially and temporally. Again, this leads to a much better reconstruction! Along with all this, GLOMAP has become a first-class global mapper, making it super easy to use directly within pycolmap! I'm excited to do more with this and compare it to things like pycuvslam, vipe, and other alternatives.

Colmap 4.0 was very recently released, so it inspired me to do some work to better understand it and its new capabilities with Rerun. I want to really understand how Colmap, and in particular, pycolmap, works outside of just calling it via the CLI. So my goal is to use the low-level pycolmap API to log every part of the pipeline. The explicit goal is to have an alternative to the SQLite database that I can utilize. Instead of SQLite, I want to try logging everything directly to rerun and use RRD. This means I can have deep inspectability and still save the features/matches/2D view geometry, but be able to view it directly in rerun. I think this is one of the superpowers that rerun provides; data and visualizations are deeply integrated. As I'm often working with sequential data (videos), I'm going to specifically focus on four things: 1. Monocular Video Simple: Calls high-level APIs such as pycolmap.extract_features, pycolmap.match_sequential, pycolmap.incremental_mapping. These are basically identical to the CLI options and provide a good baseline. 2. Monocular Video Streamed: Take the above high-level APIs and break them down to their iterator version, logging each component in a streamed manner. This way, I can stream the intermediate features to rerun while the extraction/matching/mapping is happening. 3. Rig with unknown calibration: <- WHAT THE VIDEO SHOWS This is probably the most interesting version and the first one I've been working on. It allows one to set a rig between known sensors, such as in VR/AR devices, leading to much better reconstructions with multiple cameras. This is the case where we don't know the calibration a priori, so we have to run a reconstruction twice: once as a normal Colmap reconstruction with no rig constraints, use this to generate the constraints, and then do it again with the newly found rig. 4. Rig with known calibration: This is the RoboCap example, where we have a pre-calibrated set of sensors, so we don't need to run the two reconstructions and also gain better matching between cameras, both spatially and temporally. Again, this leads to a much better reconstruction! Along with all this, GLOMAP has become a first-class global mapper, making it super easy to use directly within pycolmap! I'm excited to do more with this and compare it to things like pycuvslam, vipe, and other alternatives.

Pablo Vela

30,070 Aufrufe • vor 4 Monaten

Human3R: Everyone Everywhere All at Once Note: I recorded the video from the interactive demo on their project page (linked in the comment below). Abstract (excerpt): Human3R jointly recovers global multi-person SMPL-X bodies ("everyone"), dense 3D scenes ("everywhere"), and camera trajectories in a single forward pass ("all-at-once"). Our method builds upon the 4D online reconstruction model CUT3R and uses parameter-efficient visual prompt tuning to preserve CUT3R's rich spatiotemporal priors while enabling direct readout of multiple SMPL-X bodies. Human3R is a unified model that eliminates heavy dependencies and iterative refinement. After being trained on the relatively small-scale synthetic dataset BEDLAM for just one day on one GPU, it achieves superior performance with remarkable efficiency: it reconstructs multiple humans in a one-shot manner, along with 3D scenes, in one stage, at real-time speed (15 FPS) with a low memory footprint (8 GB).

Human3R: Everyone Everywhere All at Once Note: I recorded the video from the interactive demo on their project page (linked in the comment below). Abstract (excerpt): Human3R jointly recovers global multi-person SMPL-X bodies ("everyone"), dense 3D scenes ("everywhere"), and camera trajectories in a single forward pass ("all-at-once"). Our method builds upon the 4D online reconstruction model CUT3R and uses parameter-efficient visual prompt tuning to preserve CUT3R's rich spatiotemporal priors while enabling direct readout of multiple SMPL-X bodies. Human3R is a unified model that eliminates heavy dependencies and iterative refinement. After being trained on the relatively small-scale synthetic dataset BEDLAM for just one day on one GPU, it achieves superior performance with remarkable efficiency: it reconstructs multiple humans in a one-shot manner, along with 3D scenes, in one stage, at real-time speed (15 FPS) with a low memory footprint (8 GB).

MrNeRF

35,783 Aufrufe • vor 9 Monaten

On The Road With Al & Ivy Blog entry for June 3, 2016: First book trailer for the On The Road With Al & Ivy novel "On The Road With Al & Ivy—Book 1: Becoming A Face" will be released in about two weeks. As I said earlier, I was working on this book and "The Quitturz," and this one was the first to be completed. I decided a while back to do this as a novel for various reasons, mainly because it's complicated to write a book based on real life with actual people. Anyone who's read this blog for years knows that this novella, which runs about 45,000 words, will be based on my experiences. The novel format allows me to include a lot of material based on other people's observations and comments, which will make the novel more richly detailed. The format allows me to fictionalize people so that every character resembles no one in real life. The main reason is to protect sources and to take advantage of fiction's flexibility, which allows me to use my imagination. I think that with any work based on real-life experiences, it's essential to be fair to the characters. One of the things I specifically avoided was not to create one-dimensional saints and sinners, so to speak, so that every character can be seen as a human being with both virtues and flaws. Doing this as a fictional trilogy also allows me to draw on my interest in past literature and do this as a work. I didn't want to do a documentary-style book. Plenty of those are already available for this book's subject matter. The title refers to a phrase that will be occasionally used in the book. It doesn't have one particular meaning. It starts off meaning one thing, but as the story progresses, you'll find that it's a much more complex concept. Another aspect is the story's structure. The original drafts were in the first person, but I found that once I shifted to a novel format, I could weave a more complex tapestry that included other first-person accounts and third-person narratives. I admit that my writing will probably get mixed reviews, as some readers will always criticize that kind of structure, but after 8 years and a lot of thought, it's simply the best way to present the story. There will be more blog entries about the book in the coming two weeks and afterwards. - Al Handa #kindleunlimited #booktwitter #homeless #unhoused #booktwitter #siliconvalley #shitzu #books #Blogs

On The Road With Al & Ivy Blog entry for June 3, 2016: First book trailer for the On The Road With Al & Ivy novel "On The Road With Al & Ivy—Book 1: Becoming A Face" will be released in about two weeks. As I said earlier, I was working on this book and "The Quitturz," and this one was the first to be completed. I decided a while back to do this as a novel for various reasons, mainly because it's complicated to write a book based on real life with actual people. Anyone who's read this blog for years knows that this novella, which runs about 45,000 words, will be based on my experiences. The novel format allows me to include a lot of material based on other people's observations and comments, which will make the novel more richly detailed. The format allows me to fictionalize people so that every character resembles no one in real life. The main reason is to protect sources and to take advantage of fiction's flexibility, which allows me to use my imagination. I think that with any work based on real-life experiences, it's essential to be fair to the characters. One of the things I specifically avoided was not to create one-dimensional saints and sinners, so to speak, so that every character can be seen as a human being with both virtues and flaws. Doing this as a fictional trilogy also allows me to draw on my interest in past literature and do this as a work. I didn't want to do a documentary-style book. Plenty of those are already available for this book's subject matter. The title refers to a phrase that will be occasionally used in the book. It doesn't have one particular meaning. It starts off meaning one thing, but as the story progresses, you'll find that it's a much more complex concept. Another aspect is the story's structure. The original drafts were in the first person, but I found that once I shifted to a novel format, I could weave a more complex tapestry that included other first-person accounts and third-person narratives. I admit that my writing will probably get mixed reviews, as some readers will always criticize that kind of structure, but after 8 years and a lot of thought, it's simply the best way to present the story. There will be more blog entries about the book in the coming two weeks and afterwards. - Al Handa #kindleunlimited #booktwitter #homeless #unhoused #booktwitter #siliconvalley #shitzu #books #Blogs

Boogie Underground

30,383 Aufrufe • vor 1 Jahr

NeuRBF: A Neural Fields Representation with Adaptive Radial Basis Functions paper page: present a novel type of neural fields that uses general radial bases for signal representation. State-of-the-art neural fields typically rely on grid-based representations for storing local neural features and N-dimensional linear kernels for interpolating features at continuous query points. The spatial positions of their neural features are fixed on grid nodes and cannot well adapt to target signals. Our method instead builds upon general radial bases with flexible kernel position and shape, which have higher spatial adaptivity and can more closely fit target signals. To further improve the channel-wise capacity of radial basis functions, we propose to compose them with multi-frequency sinusoid functions. This technique extends a radial basis to multiple Fourier radial bases of different frequency bands without requiring extra parameters, facilitating the representation of details. Moreover, by marrying adaptive radial bases with grid-based ones, our hybrid combination inherits both adaptivity and interpolation smoothness. We carefully designed weighting schemes to let radial bases adapt to different types of signals effectively. Our experiments on 2D image and 3D signed distance field representation demonstrate the higher accuracy and compactness of our method than prior arts. When applied to neural radiance field reconstruction, our method achieves state-of-the-art rendering quality, with small model size and comparable training speed.

NeuRBF: A Neural Fields Representation with Adaptive Radial Basis Functions paper page: present a novel type of neural fields that uses general radial bases for signal representation. State-of-the-art neural fields typically rely on grid-based representations for storing local neural features and N-dimensional linear kernels for interpolating features at continuous query points. The spatial positions of their neural features are fixed on grid nodes and cannot well adapt to target signals. Our method instead builds upon general radial bases with flexible kernel position and shape, which have higher spatial adaptivity and can more closely fit target signals. To further improve the channel-wise capacity of radial basis functions, we propose to compose them with multi-frequency sinusoid functions. This technique extends a radial basis to multiple Fourier radial bases of different frequency bands without requiring extra parameters, facilitating the representation of details. Moreover, by marrying adaptive radial bases with grid-based ones, our hybrid combination inherits both adaptivity and interpolation smoothness. We carefully designed weighting schemes to let radial bases adapt to different types of signals effectively. Our experiments on 2D image and 3D signed distance field representation demonstrate the higher accuracy and compactness of our method than prior arts. When applied to neural radiance field reconstruction, our method achieves state-of-the-art rendering quality, with small model size and comparable training speed.

AK

194,469 Aufrufe • vor 2 Jahren

🚀 Introducing EgoExo Forge - built on top of Rerun, Gradio, and Hugging Face hub (I’ll be in San Francisco July 21–29 — if you’re into robotics, egocentric AI, large-scale data collection, or just want to chat, DM me!) In my opinion, large-scale, diverse, and high-quality data is still the largest bottleneck for generalized robotics deployment. I believe that some version of imitation learning from human examples will be the most scalable + clean way to train humanoid robots 🤖 (similar to what Tesla did for Full Self Driving). Teleop is too expensive to collect a large enough dataset in a reasonable manner, so passive collection via egocentric (and in certain cases, exocentric) views feels like the right bet. Over the past few months, I've been trying to build out the scaffolding for this and using Rerun as my underlying infrastructure. Data being collected needs to be easily inspectable + time series and rerun provides the right tooling for this. My goal is to first build out a ground truth representative dataset from already existing open source data, generate some reasonable baselines, and then go out and collect my own data that adheres to the defined schema. 🔍 Starting with open-source datasets 1. EgoDex from Apple 2. HOCap from Nvidia and the University of Texas at Dallas 3. Assembly101 from Meta All these different datasets have different sensor configurations + annotations, so my goal with egoexo-forge is to have one consistent labeling scheme + data layout. I built a data pipeline that aligns all of the different datasets in one general schema assuming the COCO133 keypoint layout that allows for exo+ego, ego only, or exo only Since the scaffolding is already there, it becomes MUCH easier to add other datasets. So the next ones that I'll be including are HD-EPIC kitchens dataset, HOT3D, and finally my own personal iPhone + insta360 go collection method. Once I have a diverse variety of datasets, I'll double down on what I believe to be the key algorithms required to make useful data for imitation learning 📊 1. Camera Pose estimation via SLAM/SFM for ego perspective (and automatic calibration for exo) 2. Human pose estimation for both egocentric + exocentric views 3. Metric 3D reconstruction + object tracking I'll be setting up reasonable open-source baselines for each of these to validate that these datasets work, and then finally try to use the generated datasets for some imitation learning via the pi0-lerobot repo I've been working on. I plan on making a blog post + providing more info on all of this in the near future so stay tuned

🚀 Introducing EgoExo Forge - built on top of Rerun, Gradio, and Hugging Face hub (I’ll be in San Francisco July 21–29 — if you’re into robotics, egocentric AI, large-scale data collection, or just want to chat, DM me!) In my opinion, large-scale, diverse, and high-quality data is still the largest bottleneck for generalized robotics deployment. I believe that some version of imitation learning from human examples will be the most scalable + clean way to train humanoid robots 🤖 (similar to what Tesla did for Full Self Driving). Teleop is too expensive to collect a large enough dataset in a reasonable manner, so passive collection via egocentric (and in certain cases, exocentric) views feels like the right bet. Over the past few months, I've been trying to build out the scaffolding for this and using Rerun as my underlying infrastructure. Data being collected needs to be easily inspectable + time series and rerun provides the right tooling for this. My goal is to first build out a ground truth representative dataset from already existing open source data, generate some reasonable baselines, and then go out and collect my own data that adheres to the defined schema. 🔍 Starting with open-source datasets 1. EgoDex from Apple 2. HOCap from Nvidia and the University of Texas at Dallas 3. Assembly101 from Meta All these different datasets have different sensor configurations + annotations, so my goal with egoexo-forge is to have one consistent labeling scheme + data layout. I built a data pipeline that aligns all of the different datasets in one general schema assuming the COCO133 keypoint layout that allows for exo+ego, ego only, or exo only Since the scaffolding is already there, it becomes MUCH easier to add other datasets. So the next ones that I'll be including are HD-EPIC kitchens dataset, HOT3D, and finally my own personal iPhone + insta360 go collection method. Once I have a diverse variety of datasets, I'll double down on what I believe to be the key algorithms required to make useful data for imitation learning 📊 1. Camera Pose estimation via SLAM/SFM for ego perspective (and automatic calibration for exo) 2. Human pose estimation for both egocentric + exocentric views 3. Metric 3D reconstruction + object tracking I'll be setting up reasonable open-source baselines for each of these to validate that these datasets work, and then finally try to use the generated datasets for some imitation learning via the pi0-lerobot repo I've been working on. I plan on making a blog post + providing more info on all of this in the near future so stay tuned

Pablo Vela

32,085 Aufrufe • vor 1 Jahr

TURBULENCE: “It is not external events themselves that cause us distress, but the way in which we think about them...It is our attitudes and reactions that give us trouble. We cannot choose our external circumstances, but we can always choose how we respond to them.” Epictetus from “The Enchiridion” written by Arrian Welcome to the: complex, high-tech, global knowledge, interconnected, multi-cultural, multi-dimensional, multi-disciplinary, multi-educational, multi-generational, multi-ideological, multi-ability, multi-ethnic, and anticipated TURBULENT future business environment that impacts performance: · positively (opportunity), &/or · negatively (threat); coupled with potential crisis &/or chaos. As could be seen on the video (15 seconds); perception of the TURBULENCE in the environment (stable and/or turbulent) is determined by 4 Attributes, as follows: Unpredictability → Uncertainty which is measured by two attributes: 1. Visibility of future - ranges from unchanged to surprise-full. 2. Speed of change – ranges from slower to faster than speed of response. Changeability → Discontinuity which is measured by two attributes: 3. Novelty of Events - ranges from familiar to not experienced before. 4. Complexity, which contains (orthogonally): 4.1. Scope - ranging from local to global. 4.2. Decision making and/or judgements that are affected by transaction of the entities. Or the extent of importance of each of the following CHANGE(s), as they affect your decision making in the industry (impact of Decision Making on individuals, &/or groups &/or institutions): Economy, Education, environment (natural), Health, Marriage & the Family, Law & Order, Information & Media, Politics, Religion, and Other. Anticipated Future: visibility of future is low, speed of change is high, novelty of change is high, & complexity is high. Anticipated Implications: acceleration, expansion, overlapping, & suction of future levels of environmental TURBULENCE. Advice: learn & act concurrently. To address the above, two programs based on the teachings of H. Igor Ansoff (Father of Strategic Management) were launched on 1-1- 2023, which are part of my Strategic Management Theme. Titles and links of the two programs are below: 1- Strategic Management Seminars and Workshop (for profit), For any type of firm (for profit and not for profit). Prerequisite, knowledge and practice in management and leadership. Details: 2- Strategic Democracy (not for profit), Customized for managing and leading the people’s business in the USA. Prerequisite, High School or above. Details: Fundamental Question (both programs): How to lead and manage ourselves, each other and the entity, while interacting with turbulence for effective, efficient, and equitable performance (potential formidable advantage)? That is done through answering five questions, as follows: 1. Where we are now? 2. Where do we need to go? 3. How do we get there? 4. How to lead and manage the above? 5. Why not?

TURBULENCE: “It is not external events themselves that cause us distress, but the way in which we think about them...It is our attitudes and reactions that give us trouble. We cannot choose our external circumstances, but we can always choose how we respond to them.” Epictetus from “The Enchiridion” written by Arrian Welcome to the: complex, high-tech, global knowledge, interconnected, multi-cultural, multi-dimensional, multi-disciplinary, multi-educational, multi-generational, multi-ideological, multi-ability, multi-ethnic, and anticipated TURBULENT future business environment that impacts performance: · positively (opportunity), &/or · negatively (threat); coupled with potential crisis &/or chaos. As could be seen on the video (15 seconds); perception of the TURBULENCE in the environment (stable and/or turbulent) is determined by 4 Attributes, as follows: Unpredictability → Uncertainty which is measured by two attributes: 1. Visibility of future - ranges from unchanged to surprise-full. 2. Speed of change – ranges from slower to faster than speed of response. Changeability → Discontinuity which is measured by two attributes: 3. Novelty of Events - ranges from familiar to not experienced before. 4. Complexity, which contains (orthogonally): 4.1. Scope - ranging from local to global. 4.2. Decision making and/or judgements that are affected by transaction of the entities. Or the extent of importance of each of the following CHANGE(s), as they affect your decision making in the industry (impact of Decision Making on individuals, &/or groups &/or institutions): Economy, Education, environment (natural), Health, Marriage & the Family, Law & Order, Information & Media, Politics, Religion, and Other. Anticipated Future: visibility of future is low, speed of change is high, novelty of change is high, & complexity is high. Anticipated Implications: acceleration, expansion, overlapping, & suction of future levels of environmental TURBULENCE. Advice: learn & act concurrently. To address the above, two programs based on the teachings of H. Igor Ansoff (Father of Strategic Management) were launched on 1-1- 2023, which are part of my Strategic Management Theme. Titles and links of the two programs are below: 1- Strategic Management Seminars and Workshop (for profit), For any type of firm (for profit and not for profit). Prerequisite, knowledge and practice in management and leadership. Details: 2- Strategic Democracy (not for profit), Customized for managing and leading the people’s business in the USA. Prerequisite, High School or above. Details: Fundamental Question (both programs): How to lead and manage ourselves, each other and the entity, while interacting with turbulence for effective, efficient, and equitable performance (potential formidable advantage)? That is done through answering five questions, as follows: 1. Where we are now? 2. Where do we need to go? 3. How do we get there? 4. How to lead and manage the above? 5. Why not?

Tamer Tamer Salameh

20,444 Aufrufe • vor 3 Jahren

Our first test flight is just the beginning! Behind the scenes, we are focused on up-scaling and improving our technology. We are excited to announce that we have successfully tested the central subsystem of our Helix 2.0 oxygen-rich staged-combustion engine: the powerpack. We have performed two successful hot-fire tests in which we have shown steady-state operation and cavitation limits. The powerpack incorporates the turbopump and pre-burner(s). It is the most complex as well as the most mechanically and thermally stressed subsystem of a staged-combustion engine. This milestone validated key technological challenges, such as the simultaneous ignition of multiple pre-burners and turbopump cavitation performance. The results are in-line with the predictions from our design models. The closed-cycle architecture of Helix allows us to push the performance envelope further: Helix 2.0 is designed to deliver double the thrust (200kN), while mass, production technology and costs remain comparable to Helix 1.0. The result for our customers: more payload for a lower budget! Excited about this news? Check out our career portal for employment opportunities and help us to elevate our Helix staged-combustion engine technology to the next level! ➡️

Our first test flight is just the beginning! Behind the scenes, we are focused on up-scaling and improving our technology. We are excited to announce that we have successfully tested the central subsystem of our Helix 2.0 oxygen-rich staged-combustion engine: the powerpack. We have performed two successful hot-fire tests in which we have shown steady-state operation and cavitation limits. The powerpack incorporates the turbopump and pre-burner(s). It is the most complex as well as the most mechanically and thermally stressed subsystem of a staged-combustion engine. This milestone validated key technological challenges, such as the simultaneous ignition of multiple pre-burners and turbopump cavitation performance. The results are in-line with the predictions from our design models. The closed-cycle architecture of Helix allows us to push the performance envelope further: Helix 2.0 is designed to deliver double the thrust (200kN), while mass, production technology and costs remain comparable to Helix 1.0. The result for our customers: more payload for a lower budget! Excited about this news? Check out our career portal for employment opportunities and help us to elevate our Helix staged-combustion engine technology to the next level! ➡️

Rocket Factory Augsburg

33,233 Aufrufe • vor 2 Monaten

Depth Any Video with Scalable Synthetic Data AI physicists and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

Depth Any Video with Scalable Synthetic Data AI physicists and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

MrNeRF

27,428 Aufrufe • vor 1 Jahr

🚨BREAKING: The beta test of Blender Cycles on The Render Network is going great!! With $RENDER, a rendering job by Omid Pakbin took less than 10 minutes instead of 28 hours!! This is the largest #AI / #GPU integration ever seen in the crypto space. No one will ever come close to $RENDER and here's why ✍️ With Blender 🔶 Cycles integrated on the Render Network, millions of artists from the leading open source 3D ecosystem can harness near unlimited high performance decentralized GPU cloud rendering power on Render. The tasks performed by the millions of Blender users require heavy GPU demands. These tasks will result in many $RENDER tokens being burned, as the burning mechanism is tied 1:1 to the GPU usage of the The Render Network There is literally no #AI altcoin that has the partnerships or real utility that $RENDER provides. Forget "The next $RENDER". Once this integration goes fully live, the burning numbers of $RENDER will explode and you will see the biggest fomo ever seen in crypto.

🚨BREAKING: The beta test of Blender Cycles on The Render Network is going great!! With $RENDER, a rendering job by Omid Pakbin took less than 10 minutes instead of 28 hours!! This is the largest #AI / #GPU integration ever seen in the crypto space. No one will ever come close to $RENDER and here's why ✍️ With Blender 🔶 Cycles integrated on the Render Network, millions of artists from the leading open source 3D ecosystem can harness near unlimited high performance decentralized GPU cloud rendering power on Render. The tasks performed by the millions of Blender users require heavy GPU demands. These tasks will result in many $RENDER tokens being burned, as the burning mechanism is tied 1:1 to the GPU usage of the The Render Network There is literally no #AI altcoin that has the partnerships or real utility that $RENDER provides. Forget "The next $RENDER". Once this integration goes fully live, the burning numbers of $RENDER will explode and you will see the biggest fomo ever seen in crypto.

D0c Crypto ⭕️

15,672 Aufrufe • vor 1 Jahr

Day 11/90 of Inference Engineering How does vLLM work and how is it used in production? Before we discuss how vLLM works internally, it helps to understand what vLLM is. At a high level, vLLM is an inference engine that is designed to serve LLMs to thousands of concurrent users efficiently while managing scarce compute and memory. The goal for vLLM is to maximize throughput and minimize latency; optimizing for the best inference economics and experience for end users. With every request from the end user, it eventually ends up in the engine core, gets scheduled alongside other requests from other concurrent users, executes on the GPU, and updates the KV cache with the new key and value vectors, and streams the tokens back to the user. The Scheduler decides what requests should execute next while continuously batching requests together to maximize GPU utilization. Continuous batching is an inference optimization that allows new requests to join a running batch as other requests finish generating tokens. This helps with keeping the GPU utilization high instead of letting it sit idle waiting for an entire batch to complete generating. After the scheduler dispatches the selected batch to the Model Executor, the Model Executor prepares the tensors and metadata required for inference, retrieves each request’s block table from KV Cache Manager, launches the optimized transformer forward pass on the GPU, computes the logits, updates the KV cache with the new key and value vectors, and finally returns the results for sampling and streaming. The KV Cache Manager uses the PagedAttention memory layout to allocate fixed-size cache blocks on demand and maintains a Free Block Queue on the CPU that tracks which blocks in the GPU’s Paged KV Cache are currently free. When a request needs additional KV cache space, the KV Cache manager takes a free block from the queue and assigns it to that request, thus avoiding an expensive search through GPU memory for available cache blocks. All of these components form the core of vLLM’s inference engine. The Scheduler determines what requests are executed, the Model Executor determines how those requests are executed, the KV Cache Manager determines where each request’s KV cache lives using the PagedAttention Memory Layout. This architecture enables vLLM to serve thousands of concurrent requests with high throughput, low latency, and efficient GPU memory utilization. Heres a little animation that visualizes everything! - I've also completed the forward pass for my mnist.c project. I had a nice chat with shrey birmiwal, such a knowledgeable guy. Excited to learn more about vLLM and implement a tiny-vLLM one day.

Day 11/90 of Inference Engineering How does vLLM work and how is it used in production? Before we discuss how vLLM works internally, it helps to understand what vLLM is. At a high level, vLLM is an inference engine that is designed to serve LLMs to thousands of concurrent users efficiently while managing scarce compute and memory. The goal for vLLM is to maximize throughput and minimize latency; optimizing for the best inference economics and experience for end users. With every request from the end user, it eventually ends up in the engine core, gets scheduled alongside other requests from other concurrent users, executes on the GPU, and updates the KV cache with the new key and value vectors, and streams the tokens back to the user. The Scheduler decides what requests should execute next while continuously batching requests together to maximize GPU utilization. Continuous batching is an inference optimization that allows new requests to join a running batch as other requests finish generating tokens. This helps with keeping the GPU utilization high instead of letting it sit idle waiting for an entire batch to complete generating. After the scheduler dispatches the selected batch to the Model Executor, the Model Executor prepares the tensors and metadata required for inference, retrieves each request’s block table from KV Cache Manager, launches the optimized transformer forward pass on the GPU, computes the logits, updates the KV cache with the new key and value vectors, and finally returns the results for sampling and streaming. The KV Cache Manager uses the PagedAttention memory layout to allocate fixed-size cache blocks on demand and maintains a Free Block Queue on the CPU that tracks which blocks in the GPU’s Paged KV Cache are currently free. When a request needs additional KV cache space, the KV Cache manager takes a free block from the queue and assigns it to that request, thus avoiding an expensive search through GPU memory for available cache blocks. All of these components form the core of vLLM’s inference engine. The Scheduler determines what requests are executed, the Model Executor determines how those requests are executed, the KV Cache Manager determines where each request’s KV cache lives using the PagedAttention Memory Layout. This architecture enables vLLM to serve thousands of concurrent requests with high throughput, low latency, and efficient GPU memory utilization. Heres a little animation that visualizes everything! - I've also completed the forward pass for my mnist.c project. I had a nice chat with shrey birmiwal, such a knowledgeable guy. Excited to learn more about vLLM and implement a tiny-vLLM one day.

max fu

69,773 Aufrufe • vor 10 Tagen