Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

MVP of Multiview Video → Camera parameters + 3D keypoints. Visualized with Rerun The basic pipeline as of right now looks like this: 1. Capture 🔴 – Using 4 iPhones and an Insta360 Go. iPhone videos are captured via Final Cut Pro Multicam for easy sync and the exocentric... view; the Insta360 Go is used for the egocentric view. 2. Sync 🕒 – Custom Gradio app using two Rerun viewers and callbacks for easily aligning frame timestamps so the ego and exo views are aligned. 3. Calibrate 🎯 – Use VGGT from Jianyuan and AI at Meta to get intrinsics/extrinsics for sparse cameras. 4. Estimate 3D 🕺 – Use RTMLib whole‑body keypoint estimator on each frame, then triangulate in 3D. What's missing? 1. No temporal coherence: I’m estimating keypoints one frame at a time and one camera at a time. This leads to a lot of jittering. For now, I plan on adding a One Euro Filter to help with jittering. Long term, I'd want to train a multiview keypoint estimator 2. Kinematic fitting is still missing; this is my next goal. The output will be joint angles, as explored in my previous posts. 3. Missing dense point cloud: VGGT seems to fail for me here. I’m looking to explore using MP‑SFM as a method for generating dense multiview depth maps + normals (plus it has a friendlier license compared to VGGT). 4. Eventually, creation of 4D Gaussian splatting using something akin to DN‑splatter—my long‑term goal is a data engine that provides poses/depths/splats/keypoints/etc.show more

Pablo Vela

2,458 subscribers

42,785 Aufrufe • vor 1 Jahr •via X (Twitter)

Kunst Wissenschaft & Technologie Bildung

Anya Rossi• Live Now

Private livecam show

11 Kommentare

Profilbild von Pablo Vela

Pablo Velavor 1 Jahr

Links (still a work in progress, but wanted to share for folks who want to dig in): 1. Saved RRD visualization – < 2. Multicam ego/exo sync app – < 3. 3D person detection + triangulation – <

Profilbild von Pablo Vela

Pablo Velavor 1 Jahr

Posted the wrong link for the RRD! Updated here <

Profilbild von UserInterface

UserInterfacevor 4 Jahren

Need Professional Video Production, Music Videos, Commercials, Graphic Design, or Photo Retouching? We will take your project from concept to completion. #services #creative #DMV

Profilbild von Jianyuan Wang

Jianyuan Wangvor 1 Jahr

@rerundotio Yeah I guess VGGT cannot predict high-quality depth maps for dynamic multi-view inputs by now.

Profilbild von Pablo Vela

Pablo Velavor 1 Jahr

@rerundotio Yeah, still, it performs better than I expected. Thankfully it produces extremely useful camera parameters. I'm sure it wouldn't be too hard to extend to 4D using something like as an additional dataset

Profilbild von Daniel

Danielvor 1 Jahr

@rerundotio Very cool

Profilbild von Pablo Vela

Pablo Velavor 1 Jahr

@rerundotio Thank you! Slowly but surely 🫡

Profilbild von Neil Nie

Neil Nievor 1 Jahr

@rerundotio Impressive pipeline! thanks for sharing!

Profilbild von Pablo Vela

Pablo Velavor 1 Jahr

@rerundotio Thanks for the kind words!

Profilbild von StudioGaltMocap

StudioGaltMocapvor 1 Jahr

@rerundotio Very cool!

Profilbild von zeke

zekevor 1 Jahr

@rerundotio Wow. This has so much potential. Can’t wait to see the progress.

Ähnliche Videos

More progress! I now have two Dockerized Gradio | Rerun apps. The first one takes as input a "raw" rrd file that consists of the synchronized egocentric and exocentric MP4 files. This runs the pipeline and produces an "annotated" rrd file. This has the camera parameters, 3D joints, and projected 2D joints (with 6DOF mano soon). The second app takes this "annotated" rrd file and allows for manual labeling. This is a crucial step in addressing any major failures in the pipeline. Right now, it is only the ego view that can be modified. But I'll eventually extend to all. This results in a final "gt" rrd file. From here, the plan is to improve quality and start building a data loop. Excited to start really scaling this. I'm basically going all in on keeping my data stored as Rerun rrd files. As always, I want to emphasize how crucial it is to LOOK AT YOUR data! The rrd format makes it incredibly easy to do so. Getting the data out to use is a bit of a hassle right now, but for me, it's well worth the tradeoff.

More progress! I now have two Dockerized Gradio | Rerun apps. The first one takes as input a "raw" rrd file that consists of the synchronized egocentric and exocentric MP4 files. This runs the pipeline and produces an "annotated" rrd file. This has the camera parameters, 3D joints, and projected 2D joints (with 6DOF mano soon). The second app takes this "annotated" rrd file and allows for manual labeling. This is a crucial step in addressing any major failures in the pipeline. Right now, it is only the ego view that can be modified. But I'll eventually extend to all. This results in a final "gt" rrd file. From here, the plan is to improve quality and start building a data loop. Excited to start really scaling this. I'm basically going all in on keeping my data stored as Rerun rrd files. As always, I want to emphasize how crucial it is to LOOK AT YOUR data! The rrd format makes it incredibly easy to do so. Getting the data out to use is a bit of a hassle right now, but for me, it's well worth the tradeoff.

Pablo Vela

19,527 Aufrufe • vor 9 Monaten

Code and data are now online for CameraHMR, our state-of-the-art parametric 3D human pose and shape (HPS) estimation method that will appear at hashtag#3DV2025. There are 4 key contributions that make it so accurate and robust: 1. To get accurate 3D shape and pose as well as good alignment to image features, you need to know the focal length of the camera. To solve this, we train HumanFOV to compute the field of view. 2. We introduce CameraHMR, which integrates HumanFOV into HMR2.0 to exploit the estimated focal length. 3. To get accurate pseudo ground truth (pGT) training data, we compute the focal length for images in 4DHumans dataset and modify SMPLify to take this into account. 4. But SMPLify only uses sparse 2D keypoints, which do not capture body shape. So we train a dense surface keypoint detector, DenseKP, on BEDLAM and run it on 4DHumans, resulting in improved body shape. The resulting method is CamSMPLify. We iterate training CameraHMR and running CamSMPLify on the training set initialized with CameraHMR. This results in much improved pGT for 4DHumans and a SOTA single-image HMR method.

Code and data are now online for CameraHMR, our state-of-the-art parametric 3D human pose and shape (HPS) estimation method that will appear at hashtag#3DV2025. There are 4 key contributions that make it so accurate and robust: 1. To get accurate 3D shape and pose as well as good alignment to image features, you need to know the focal length of the camera. To solve this, we train HumanFOV to compute the field of view. 2. We introduce CameraHMR, which integrates HumanFOV into HMR2.0 to exploit the estimated focal length. 3. To get accurate pseudo ground truth (pGT) training data, we compute the focal length for images in 4DHumans dataset and modify SMPLify to take this into account. 4. But SMPLify only uses sparse 2D keypoints, which do not capture body shape. So we train a dense surface keypoint detector, DenseKP, on BEDLAM and run it on 4DHumans, resulting in improved body shape. The resulting method is CamSMPLify. We iterate training CameraHMR and running CamSMPLify on the training set initialized with CameraHMR. This results in much improved pGT for 4DHumans and a SOTA single-image HMR method.

Michael Black

21,696 Aufrufe • vor 1 Jahr

Drivable 3D Gaussian Avatars paper page: present Drivable 3D Gaussian Avatars (D3GA), the first 3D controllable model for human bodies rendered with Gaussian splats. Current photorealistic drivable avatars require either accurate 3D registrations during training, dense input images during testing, or both. The ones based on neural radiance fields also tend to be prohibitively slow for telepresence applications. This work uses the recently presented 3D Gaussian Splatting (3DGS) technique to render realistic humans at real-time framerates, using dense calibrated multi-view videos as input. To deform those primitives, we depart from the commonly used point deformation method of linear blend skinning (LBS) and use a classic volumetric deformation method: cage deformations. Given their smaller size, we drive these deformations with joint angles and keypoints, which are more suitable for communication applications. Our experiments on nine subjects with varied body shapes, clothes, and motions obtain higher-quality results than state-of-the-art methods when using the same training and test data.

Drivable 3D Gaussian Avatars paper page: present Drivable 3D Gaussian Avatars (D3GA), the first 3D controllable model for human bodies rendered with Gaussian splats. Current photorealistic drivable avatars require either accurate 3D registrations during training, dense input images during testing, or both. The ones based on neural radiance fields also tend to be prohibitively slow for telepresence applications. This work uses the recently presented 3D Gaussian Splatting (3DGS) technique to render realistic humans at real-time framerates, using dense calibrated multi-view videos as input. To deform those primitives, we depart from the commonly used point deformation method of linear blend skinning (LBS) and use a classic volumetric deformation method: cage deformations. Given their smaller size, we drive these deformations with joint angles and keypoints, which are more suitable for communication applications. Our experiments on nine subjects with varied body shapes, clothes, and motions obtain higher-quality results than state-of-the-art methods when using the same training and test data.

AK

327,105 Aufrufe • vor 2 Jahren

✨ Massive Pipeline Refactor → One Framework for Ego + Exo Datasets, Visualized with Rerun 🚀 After a deep refactoring and cleanup, my entire egocentric/exocentric pipeline is now fully modular. One codebase handles different sensor layouts and generates a unified, multimodal timeseries RRD file that you can open instantly in Rerun. --- The first three datasets that are already supported 1. Assembly101 – 4 ego Quest‑style fisheye cams + 8 exo pinhole cams 2. HO‑Cap – 1 ego HoloLens pinhole cam + 8 exo pinhole cams 3. EgoDex – 1 ego Apple Vision Pro pinhole cam Unified geometry: Each frame now logs _both_ camera intrinsics / extrinsics and COCO Whole-Body 133-kp keypoints in the same stream. Everything is canonicalized at import time, so there’s zero OpenCV vs OpenGL guess-work—Rerun reads it all in the correct coordinate system automatically. --- Why this matters - Consistent schema ✚ live visuals – Rerun’s deep link between data & rendering means every experiment comes with a built‑in viewer. No more ad‑hoc OpenCV/matplotlib hacks just to sanity‑check a dataset. - Multi‑terabyte friendly – The next step is bulk‑ingest these giants into Rerun and wrap them in a Gradio UI for point‑and‑click exploration, as I've already done for EgoDex!

✨ Massive Pipeline Refactor → One Framework for Ego + Exo Datasets, Visualized with Rerun 🚀 After a deep refactoring and cleanup, my entire egocentric/exocentric pipeline is now fully modular. One codebase handles different sensor layouts and generates a unified, multimodal timeseries RRD file that you can open instantly in Rerun. --- The first three datasets that are already supported 1. Assembly101 – 4 ego Quest‑style fisheye cams + 8 exo pinhole cams 2. HO‑Cap – 1 ego HoloLens pinhole cam + 8 exo pinhole cams 3. EgoDex – 1 ego Apple Vision Pro pinhole cam Unified geometry: Each frame now logs _both_ camera intrinsics / extrinsics and COCO Whole-Body 133-kp keypoints in the same stream. Everything is canonicalized at import time, so there’s zero OpenCV vs OpenGL guess-work—Rerun reads it all in the correct coordinate system automatically. --- Why this matters - Consistent schema ✚ live visuals – Rerun’s deep link between data & rendering means every experiment comes with a built‑in viewer. No more ad‑hoc OpenCV/matplotlib hacks just to sanity‑check a dataset. - Multi‑terabyte friendly – The next step is bulk‑ingest these giants into Rerun and wrap them in a Gradio UI for point‑and‑click exploration, as I've already done for EgoDex!

Pablo Vela

20,836 Aufrufe • vor 1 Jahr

Using Nano Banana Pro + Sora, I can now create a short video in just about one minute. I’ve been using Kling for a while to make short videos, and while the results can be great, the workflow has always felt a bit heavy for me. Previously, I had to export the video frame by frame as images, then send those keyframes into Kling’s first/last frame mode to generate the in-betweens. It worked, but it took a lot of time and small steps. Recently I realized I don’t necessarily need to do that anymore: now I can use Nano Banana Pro to generate the images, then send them directly to Sora to create the short video—no need to manually extract a specific frame. Another bonus is that Sora also handles basic sound design for you, which saves me from an extra round of editing and makes the whole process feel much lighter. This idea actually came from a post I saw this morning on Twitter by SD | AI Animation Storyteller, which gave me a new way of thinking about my workflow. I’m still experimenting, but so far it’s been a big improvement in efficiency for my use case. I’ve put the detailed steps I’m using in the comments in case anyone wants to try something similar.

Using Nano Banana Pro + Sora, I can now create a short video in just about one minute. I’ve been using Kling for a while to make short videos, and while the results can be great, the workflow has always felt a bit heavy for me. Previously, I had to export the video frame by frame as images, then send those keyframes into Kling’s first/last frame mode to generate the in-betweens. It worked, but it took a lot of time and small steps. Recently I realized I don’t necessarily need to do that anymore: now I can use Nano Banana Pro to generate the images, then send them directly to Sora to create the short video—no need to manually extract a specific frame. Another bonus is that Sora also handles basic sound design for you, which saves me from an extra round of editing and makes the whole process feel much lighter. This idea actually came from a post I saw this morning on Twitter by SD | AI Animation Storyteller, which gave me a new way of thinking about my workflow. I’m still experimenting, but so far it’s been a big improvement in efficiency for my use case. I’ve put the detailed steps I’m using in the comments in case anyone wants to try something similar.

underwood

46,400 Aufrufe • vor 7 Monaten

We have HOT3D! I've started using Claude to port more datasets into Rerun and exoego-forge. I'd been meaning to bring in the HOT3D dataset from Meta for a while, but with Claude, it's way easier. My goal is to take any egocentric, exocentric, or both datasets and ingest them into a standardized schema. Getting everything into Rerun means we can easily query and transform data via the in-memory OSS server. This lets us generate SQL-like queries such as: "Find me all frames that only contain left hands in the leftmost camera view." Most people think of Rerun as a viewer, but this is the actual superpower. So far we have: 1. HOT3D 2. Hocap 3. UmeTrack 4. Assembly101 5. EgoDex Planning to add more, and with every addition, it gets easier as we build up agent skills and better code examples. Hoping to make it almost fully automatic for adding new datasets. The next few I'm looking at are Harmony4D and Aria Pilot Gen2 After we have enough samples, I'll work on bringing in all the different algorithms I've worked on to transform the data 🙂

We have HOT3D! I've started using Claude to port more datasets into Rerun and exoego-forge. I'd been meaning to bring in the HOT3D dataset from Meta for a while, but with Claude, it's way easier. My goal is to take any egocentric, exocentric, or both datasets and ingest them into a standardized schema. Getting everything into Rerun means we can easily query and transform data via the in-memory OSS server. This lets us generate SQL-like queries such as: "Find me all frames that only contain left hands in the leftmost camera view." Most people think of Rerun as a viewer, but this is the actual superpower. So far we have: 1. HOT3D 2. Hocap 3. UmeTrack 4. Assembly101 5. EgoDex Planning to add more, and with every addition, it gets easier as we build up agent skills and better code examples. Hoping to make it almost fully automatic for adding new datasets. The next few I'm looking at are Harmony4D and Aria Pilot Gen2 After we have enough samples, I'll work on bringing in all the different algorithms I've worked on to transform the data 🙂

Pablo Vela

35,662 Aufrufe • vor 3 Monaten

Excited to share this demo from Over the Reality! Watch our Unitree Go2 robodog navigating our office while reconstructing the 3D space in real-time using the VGGT-based foundation vision model. This is a prime example of machine perception in action, turning raw RGB camera feeds into rich, detailed 3D maps! The robodog's RGB cam generates a dense, textured 3D reconstruction via VGGT from a few photograms, capturing nuances like object shapes and surfaces with impressive fidelity (main view). Compare that to the standard LiDAR system (top right), it's sparser, more point-cloud focused, lacking the visual richness. Vision models are closing the gap fast! What's powering this? VGGT, a cutting-edge foundation model for 3D perception, trained on datasets orders of magnitude smaller than our massive OVER 3D maps dataset. Imagine the leap when we apply OVER's scale to VGGT-like transformer based architectures, denser reconstructions, better generalization, revolutionary for robotics, machine perception & AR! Stay tuned for more breakthroughs at the intersection of AI, robotics, and DePIN. We're building the future of Physical AI and Spatial Computing at Over the Reality 🌐 What do you think, ready for robodogs in your world? Drop your thoughts! 🤖🌐

Excited to share this demo from Over the Reality! Watch our Unitree Go2 robodog navigating our office while reconstructing the 3D space in real-time using the VGGT-based foundation vision model. This is a prime example of machine perception in action, turning raw RGB camera feeds into rich, detailed 3D maps! The robodog's RGB cam generates a dense, textured 3D reconstruction via VGGT from a few photograms, capturing nuances like object shapes and surfaces with impressive fidelity (main view). Compare that to the standard LiDAR system (top right), it's sparser, more point-cloud focused, lacking the visual richness. Vision models are closing the gap fast! What's powering this? VGGT, a cutting-edge foundation model for 3D perception, trained on datasets orders of magnitude smaller than our massive OVER 3D maps dataset. Imagine the leap when we apply OVER's scale to VGGT-like transformer based architectures, denser reconstructions, better generalization, revolutionary for robotics, machine perception & AR! Stay tuned for more breakthroughs at the intersection of AI, robotics, and DePIN. We're building the future of Physical AI and Spatial Computing at Over the Reality 🌐 What do you think, ready for robodogs in your world? Drop your thoughts! 🤖🌐

Over the Reality 🌐

359,717 Aufrufe • vor 10 Monaten

“This guy will do anything to raise awareness for Covid and Long Covid”. Here are some of the ways I’ve tried. Want to support my efforts to raise awareness and help others? Now is the time. Now is the time to change course. I kicked off fundraising around 19 months ago to... 1. Raise Awareness for Covid and Long Covid. 2. Provide education assistance to families. 3. Provide advocacy assistance. 4. Advocate for education, mitigation, and air filtration. 5. Contact as many political and workforce leaders as I can to request immediate action. If you'd like to donate to my efforts that's great, if you can't that is understandable, and if you just don't want to I would even appreciate a share. I am not raising money to stockpile cash and if we raise a large amount it will go toward setting up a non-profit and hiring a team. That is a far off, long shot goal however so until then I will continue to try to create a job for myself with the help of those who value my work and want to help others. I am not backing down from the goal of becoming a full time advocate for this issue, and I hope to look back on this time as the start of something bigger, and the turning point of when we got enough people to notice and ask for change with us. I will assist people as best I can, I will make as much noise as I can, and I will contact as many policy influencers as I can to help turn the tide, and if you'd like to support the kind of work I'd appreciate it. No obligation friends, if you’re tagged just remove it and I appreciate it if you’ve already donated or even shared. YerAweSum!

“This guy will do anything to raise awareness for Covid and Long Covid”. Here are some of the ways I’ve tried. Want to support my efforts to raise awareness and help others? Now is the time. Now is the time to change course. I kicked off fundraising around 19 months ago to... 1. Raise Awareness for Covid and Long Covid. 2. Provide education assistance to families. 3. Provide advocacy assistance. 4. Advocate for education, mitigation, and air filtration. 5. Contact as many political and workforce leaders as I can to request immediate action. If you'd like to donate to my efforts that's great, if you can't that is understandable, and if you just don't want to I would even appreciate a share. I am not raising money to stockpile cash and if we raise a large amount it will go toward setting up a non-profit and hiring a team. That is a far off, long shot goal however so until then I will continue to try to create a job for myself with the help of those who value my work and want to help others. I am not backing down from the goal of becoming a full time advocate for this issue, and I hope to look back on this time as the start of something bigger, and the turning point of when we got enough people to notice and ask for change with us. I will assist people as best I can, I will make as much noise as I can, and I will contact as many policy influencers as I can to help turn the tide, and if you'd like to support the kind of work I'd appreciate it. No obligation friends, if you’re tagged just remove it and I appreciate it if you’ve already donated or even shared. YerAweSum!

Keith

92,638 Aufrufe • vor 3 Monaten

I left the UK to ILORIN to surprise my family after being away for about 2yrs. Nobody knew I was coming except for Ustaz Awon Hoodlums as I felt it’s normal to let at least one person know about my journey. Each frame of this post means so much to me and my family and in a long time this is the happiest I’ve been. Alhamdulillah. Frame 1 : My mom Frame 2: My elder bro Frame 3 : My grandma Frame 4 : My sister who had dreamt 2 days earlier that I arrived and told her husband about it, she kept saying “I said it” 😂

I left the UK to ILORIN to surprise my family after being away for about 2yrs. Nobody knew I was coming except for Ustaz Awon Hoodlums as I felt it’s normal to let at least one person know about my journey. Each frame of this post means so much to me and my family and in a long time this is the happiest I’ve been. Alhamdulillah. Frame 1 : My mom Frame 2: My elder bro Frame 3 : My grandma Frame 4 : My sister who had dreamt 2 days earlier that I arrived and told her husband about it, she kept saying “I said it” 😂

👳🏾‍♂️Mufti Of Ilorin Online 👳🏾‍♂️

612,254 Aufrufe • vor 1 Jahr

Boom! Open source LegoGPT is a building AI, and sure it can be used for Legos but it can also be used for Lego-like building of homes. LegoGPT converts meshes to Lego in one step using 1×1, 1×2, 1×4, 1×6, 1×8, 2×2, 2×4, and 2×6 bricks. Then they evaluate the stability of the design. Finally, they render an image and ask GPT-4o to produce captions to go with the image. This sent to robots and they complete the real-time build. This can absolutely scale and as with Legos, you can use smaller pieces for higher resolution. Testing it detail now with robot assembly. Link:

Boom! Open source LegoGPT is a building AI, and sure it can be used for Legos but it can also be used for Lego-like building of homes. LegoGPT converts meshes to Lego in one step using 1×1, 1×2, 1×4, 1×6, 1×8, 2×2, 2×4, and 2×6 bricks. Then they evaluate the stability of the design. Finally, they render an image and ask GPT-4o to produce captions to go with the image. This sent to robots and they complete the real-time build. This can absolutely scale and as with Legos, you can use smaller pieces for higher resolution. Testing it detail now with robot assembly. Link:

Brian Roemmele

22,363 Aufrufe • vor 1 Jahr

I created a 3D Gaussian Splat of my kitchen using 1/3rd the images it used to take me by using NVIDIA AI Developer's 3DGRUT! I can now use 180 degree fisheye images and ray tracing to make detailed splats. The only reason the scene isn't sharper is because my input images weren't super sharp - when I took the images back in October, I was still learning to use the lens. I plan to make a "first reactions/overview video". Tutorial after that. For reference, this took 206 images and the ultrawide on my iPhone took 608 images to capture. #3D #AEC #Computervision

I created a 3D Gaussian Splat of my kitchen using 1/3rd the images it used to take me by using NVIDIA AI Developer's 3DGRUT! I can now use 180 degree fisheye images and ray tracing to make detailed splats. The only reason the scene isn't sharper is because my input images weren't super sharp - when I took the images back in October, I was still learning to use the lens. I plan to make a "first reactions/overview video". Tutorial after that. For reference, this took 206 images and the ultrawide on my iPhone took 608 images to capture. #3D #AEC #Computervision

Jonathan Stephens

30,663 Aufrufe • vor 1 Jahr

This is probably the most complex workflow I’ve ever built, only with open-source tools. It took my 4 days. It takes four inputs: author, title, and style; and generates a full visual animated story in one click in ComfyUI . I worked on it for four days. There are still some bugs, but here’s the first preview. Here’s a quick breakdown: - The four inputs are sent to LLMs with precise instructions to generate: first, prompts for images and image modifications; second, prompts for animations; third, prompts for generating music. - All voices are generated from the text and timed precisely, as they determine the length of each animation segment. - The first image and video are generated to serve as the title, but also as the guide for all other images created for the video. - Titles and subtitles are also added automatically in Comfy. - I also developed a lot of custom nodes for minor frame calculations, mostly to match audio and video. - The full system is a large loop that, for each line of text, generates an image and then a video from that image. The loop was the hardest part to build in this workflow, so it can process either a 20-second video or a 2-minute video with the same input. - There are multiple combinations of LLMs that try to understand the text in the best way to provide the best prompts for images and video. - The final video is assembled entirely within ComfyUI. - The music is generated based on the LLM output and matches the exact timing of the full animation. - Done! For reference, this workflow uses a lot of models and only works on an RTX 6000 Pro with plenty of RAM. My goal is not to replace humans, as I’ll try to explain later, this workflow is highly controlled and can be adapted or reworked at any point by real artists! My aim was to create a tool that can animate text in one go, allowing the AI some freedom while keeping a strict flow. I don’t know yet how I’ll share this workflow with people, I still need to polish it properly, but maybe through Patreon. Anyway, I hope you enjoy my research, and let’s always keep pushing further! :)

This is probably the most complex workflow I’ve ever built, only with open-source tools. It took my 4 days. It takes four inputs: author, title, and style; and generates a full visual animated story in one click in ComfyUI . I worked on it for four days. There are still some bugs, but here’s the first preview. Here’s a quick breakdown: - The four inputs are sent to LLMs with precise instructions to generate: first, prompts for images and image modifications; second, prompts for animations; third, prompts for generating music. - All voices are generated from the text and timed precisely, as they determine the length of each animation segment. - The first image and video are generated to serve as the title, but also as the guide for all other images created for the video. - Titles and subtitles are also added automatically in Comfy. - I also developed a lot of custom nodes for minor frame calculations, mostly to match audio and video. - The full system is a large loop that, for each line of text, generates an image and then a video from that image. The loop was the hardest part to build in this workflow, so it can process either a 20-second video or a 2-minute video with the same input. - There are multiple combinations of LLMs that try to understand the text in the best way to provide the best prompts for images and video. - The final video is assembled entirely within ComfyUI. - The music is generated based on the LLM output and matches the exact timing of the full animation. - Done! For reference, this workflow uses a lot of models and only works on an RTX 6000 Pro with plenty of RAM. My goal is not to replace humans, as I’ll try to explain later, this workflow is highly controlled and can be adapted or reworked at any point by real artists! My aim was to create a tool that can animate text in one go, allowing the AI some freedom while keeping a strict flow. I don’t know yet how I’ll share this workflow with people, I still need to polish it properly, but maybe through Patreon. Anyway, I hope you enjoy my research, and let’s always keep pushing further! :)

Lovis Odin

58,769 Aufrufe • vor 10 Monaten

we’ll call this one results of the cut cut began back in late November/early December with the goal of revealing the muscle put on in the previous bulk, idea being to bring the best physique to date all roads culminated at the All Elite Wrestling PPV Revolution on March 9th, this video was the last check in I sent to my coach pre match can safely say both the bulk and cut were a success, as I’m leaner than the end of the previous cut by a fair margin and over 10lbs heavier, meaning I added at least 10lbs of lean muscle to my frame so now we get to call this my new “peak physique” myself/my coach have already put the plan in place for what’s next and I am SO EXCITED to get back in the lab and get back to work this pressure is a privilege this is what I live for

we’ll call this one results of the cut cut began back in late November/early December with the goal of revealing the muscle put on in the previous bulk, idea being to bring the best physique to date all roads culminated at the All Elite Wrestling PPV Revolution on March 9th, this video was the last check in I sent to my coach pre match can safely say both the bulk and cut were a success, as I’m leaner than the end of the previous cut by a fair margin and over 10lbs heavier, meaning I added at least 10lbs of lean muscle to my frame so now we get to call this my new “peak physique” myself/my coach have already put the plan in place for what’s next and I am SO EXCITED to get back in the lab and get back to work this pressure is a privilege this is what I live for

AEW TNT Champion “The Protostar” Kyle Fletcher

183,597 Aufrufe • vor 1 Jahr

This is the first cut of my trailer for my idle Basebuilding RPG with a deckbuilding roguelike adventure, Sector Scavengers! I’m a little over 20 days into development and having a blast! Here’s the set of tools I used for this: 1. Makko AI for gameplay and art 2. ElevenLabs for voiceover 3. Grok for the cut scenes 4. BandM8 for the background music 5. CapCut to bring it all together Starting out as a solo dev at age 40 wasn’t on my dance card but here we are! Thanks to everyone who is following along and supporting!

This is the first cut of my trailer for my idle Basebuilding RPG with a deckbuilding roguelike adventure, Sector Scavengers! I’m a little over 20 days into development and having a blast! Here’s the set of tools I used for this: 1. Makko AI for gameplay and art 2. ElevenLabs for voiceover 3. Grok for the cut scenes 4. BandM8 for the background music 5. CapCut to bring it all together Starting out as a solo dev at age 40 wasn’t on my dance card but here we are! Thanks to everyone who is following along and supporting!

Sector Scavengers | Wishlist on Steam

48,556 Aufrufe • vor 4 Monaten

There's been a few cool updates recently. In particular, Rerun 0.33 released headless rendering. This, along with the Fable 5 release pushed me to work torwards making MAMMA realtime! I threw Fable at the problem, and it was able to take original implementation that was ~12 seconds / frame and get it all the way down to 40ms /frame, or nearly a 300x speedup 🏎️ How did I achieve this? TLDR: - Use rerun's headless rendering as supervision when optimizing - Save rrd file as test fixture to guide model optiziation with /goal - create an html artifact with headless rendering to provide detailed breakdown of what it did and how it actually looks like in the viewer There were a few critical bits to make sure that this ACTUALLY worked and that Fable didn't just cheat or delete something and declare victory. The first is that the original version used Rerun, this allowed us to save things to disk as an RRD file, meaning we could query the contents and use this as a sort of test fixture or golden artifact that held EXACTLY what all of the values should be. Then we can use this with /goal as a metric when doing the optimization to ensure there are no regressions. The second bit is the headless rendering, this gave us the ability to check that not only did the test fixture pass, but it also looked visually correct. This made a huge difference, and an awesome side affect of it is that we can use the headless rendering to create an implementations.html file. This gives a visual guide as to what the agent did (I walk through it in the video below) Along with this, we're working on an MCP server for rerun that allows full interactivity with the rerun viewer for your agent. So for example the agent can click, drag, move views, scroll timelines, ect. I used this to help the agent debug certain parts such as when the 2d sam masks didn't line up, or if the triangulated keypoints werent correctly matching with the optimized mesh. The agents could go, click into the view, scroll through the timeline and see where things went wrong. Fable + Headless Rendering + Rerun MCP == 300x speedup in less then a days work With these new tools, I'm planning on going back to my gaussian splatting implemntation and cleaning it up + making it fast!

There's been a few cool updates recently. In particular, Rerun 0.33 released headless rendering. This, along with the Fable 5 release pushed me to work torwards making MAMMA realtime! I threw Fable at the problem, and it was able to take original implementation that was ~12 seconds / frame and get it all the way down to 40ms /frame, or nearly a 300x speedup 🏎️ How did I achieve this? TLDR: - Use rerun's headless rendering as supervision when optimizing - Save rrd file as test fixture to guide model optiziation with /goal - create an html artifact with headless rendering to provide detailed breakdown of what it did and how it actually looks like in the viewer There were a few critical bits to make sure that this ACTUALLY worked and that Fable didn't just cheat or delete something and declare victory. The first is that the original version used Rerun, this allowed us to save things to disk as an RRD file, meaning we could query the contents and use this as a sort of test fixture or golden artifact that held EXACTLY what all of the values should be. Then we can use this with /goal as a metric when doing the optimization to ensure there are no regressions. The second bit is the headless rendering, this gave us the ability to check that not only did the test fixture pass, but it also looked visually correct. This made a huge difference, and an awesome side affect of it is that we can use the headless rendering to create an implementations.html file. This gives a visual guide as to what the agent did (I walk through it in the video below) Along with this, we're working on an MCP server for rerun that allows full interactivity with the rerun viewer for your agent. So for example the agent can click, drag, move views, scroll timelines, ect. I used this to help the agent debug certain parts such as when the 2d sam masks didn't line up, or if the triangulated keypoints werent correctly matching with the optimized mesh. The agents could go, click into the view, scroll through the timeline and see where things went wrong. Fable + Headless Rendering + Rerun MCP == 300x speedup in less then a days work With these new tools, I'm planning on going back to my gaussian splatting implemntation and cleaning it up + making it fast!

Pablo Vela

22,880 Aufrufe • vor 1 Monat

Colmap 4.0 was very recently released, so it inspired me to do some work to better understand it and its new capabilities with Rerun. I want to really understand how Colmap, and in particular, pycolmap, works outside of just calling it via the CLI. So my goal is to use the low-level pycolmap API to log every part of the pipeline. The explicit goal is to have an alternative to the SQLite database that I can utilize. Instead of SQLite, I want to try logging everything directly to rerun and use RRD. This means I can have deep inspectability and still save the features/matches/2D view geometry, but be able to view it directly in rerun. I think this is one of the superpowers that rerun provides; data and visualizations are deeply integrated. As I'm often working with sequential data (videos), I'm going to specifically focus on four things: 1. Monocular Video Simple: Calls high-level APIs such as pycolmap.extract_features, pycolmap.match_sequential, pycolmap.incremental_mapping. These are basically identical to the CLI options and provide a good baseline. 2. Monocular Video Streamed: Take the above high-level APIs and break them down to their iterator version, logging each component in a streamed manner. This way, I can stream the intermediate features to rerun while the extraction/matching/mapping is happening. 3. Rig with unknown calibration: <- WHAT THE VIDEO SHOWS This is probably the most interesting version and the first one I've been working on. It allows one to set a rig between known sensors, such as in VR/AR devices, leading to much better reconstructions with multiple cameras. This is the case where we don't know the calibration a priori, so we have to run a reconstruction twice: once as a normal Colmap reconstruction with no rig constraints, use this to generate the constraints, and then do it again with the newly found rig. 4. Rig with known calibration: This is the RoboCap example, where we have a pre-calibrated set of sensors, so we don't need to run the two reconstructions and also gain better matching between cameras, both spatially and temporally. Again, this leads to a much better reconstruction! Along with all this, GLOMAP has become a first-class global mapper, making it super easy to use directly within pycolmap! I'm excited to do more with this and compare it to things like pycuvslam, vipe, and other alternatives.

Colmap 4.0 was very recently released, so it inspired me to do some work to better understand it and its new capabilities with Rerun. I want to really understand how Colmap, and in particular, pycolmap, works outside of just calling it via the CLI. So my goal is to use the low-level pycolmap API to log every part of the pipeline. The explicit goal is to have an alternative to the SQLite database that I can utilize. Instead of SQLite, I want to try logging everything directly to rerun and use RRD. This means I can have deep inspectability and still save the features/matches/2D view geometry, but be able to view it directly in rerun. I think this is one of the superpowers that rerun provides; data and visualizations are deeply integrated. As I'm often working with sequential data (videos), I'm going to specifically focus on four things: 1. Monocular Video Simple: Calls high-level APIs such as pycolmap.extract_features, pycolmap.match_sequential, pycolmap.incremental_mapping. These are basically identical to the CLI options and provide a good baseline. 2. Monocular Video Streamed: Take the above high-level APIs and break them down to their iterator version, logging each component in a streamed manner. This way, I can stream the intermediate features to rerun while the extraction/matching/mapping is happening. 3. Rig with unknown calibration: <- WHAT THE VIDEO SHOWS This is probably the most interesting version and the first one I've been working on. It allows one to set a rig between known sensors, such as in VR/AR devices, leading to much better reconstructions with multiple cameras. This is the case where we don't know the calibration a priori, so we have to run a reconstruction twice: once as a normal Colmap reconstruction with no rig constraints, use this to generate the constraints, and then do it again with the newly found rig. 4. Rig with known calibration: This is the RoboCap example, where we have a pre-calibrated set of sensors, so we don't need to run the two reconstructions and also gain better matching between cameras, both spatially and temporally. Again, this leads to a much better reconstruction! Along with all this, GLOMAP has become a first-class global mapper, making it super easy to use directly within pycolmap! I'm excited to do more with this and compare it to things like pycuvslam, vipe, and other alternatives.

Pablo Vela

30,070 Aufrufe • vor 4 Monaten

I've been on a SLAM/SFM kick. It's one of the more underexplored and lacking areas when it comes to human teleop/data collections, so I've brought over Deep Patch Visual Odometry/SLAM to Rerun and Gradio. With this example, we now have 1. pycuvslam 2. pycolmap/glomap 3. mast3r-slam 4. dpvo/slam all integrated into rerun. The question becomes, which method should be used in what situations? They all make different trade-offs with different camera requirements and throughput/accuracy. What about when a new method comes out? Now that I have several different methods, I plan to use VSLAM-LAB for evaluation. It uses prefix.dev to isolate all the dependencies of each of these methods and easily compare them against each other. In particular, I'll be converting the data preprocessing, algorithm outputs, and evaluation into rerun recordings (rrd files). This will allow both programmatic querying of anything stored in the files (which method had the highest ATE-to-FPS ratio? Which dataset/sequence caused the most difficulty? etc. etc.), all with easy visual inspection using the rerun server to link them all together. Another really important side effect of this is how it impacts agents. As Karpathy said ``` LLMs are exceptionally good at looping until they meet specific goals, and this is where most of the "feel the AGI" magic is to be found. Don't tell it what to do, give it success criteria, and watch it go. ``` by having accuracy and throughput metrics deeply tied with human inspectable artifacts. One can really accelerate agentic development with an actual understanding of how the method/data performs. I think this is another killer use case that I'll be really leaning into to make ingestion of new datasets/methods trivial with an agent. I'm making it my mission for folks to understand that rerun as a visualization tool only scratches the surface of what its true benefit is. Deep integration between data and visuals, with powerful query capabilities. I'll be focusing on the SLAM use case first and then bringing this into the full egocentric/exocentric data collection domain!

I've been on a SLAM/SFM kick. It's one of the more underexplored and lacking areas when it comes to human teleop/data collections, so I've brought over Deep Patch Visual Odometry/SLAM to Rerun and Gradio. With this example, we now have 1. pycuvslam 2. pycolmap/glomap 3. mast3r-slam 4. dpvo/slam all integrated into rerun. The question becomes, which method should be used in what situations? They all make different trade-offs with different camera requirements and throughput/accuracy. What about when a new method comes out? Now that I have several different methods, I plan to use VSLAM-LAB for evaluation. It uses prefix.dev to isolate all the dependencies of each of these methods and easily compare them against each other. In particular, I'll be converting the data preprocessing, algorithm outputs, and evaluation into rerun recordings (rrd files). This will allow both programmatic querying of anything stored in the files (which method had the highest ATE-to-FPS ratio? Which dataset/sequence caused the most difficulty? etc. etc.), all with easy visual inspection using the rerun server to link them all together. Another really important side effect of this is how it impacts agents. As Karpathy said ``` LLMs are exceptionally good at looping until they meet specific goals, and this is where most of the "feel the AGI" magic is to be found. Don't tell it what to do, give it success criteria, and watch it go. ``` by having accuracy and throughput metrics deeply tied with human inspectable artifacts. One can really accelerate agentic development with an actual understanding of how the method/data performs. I think this is another killer use case that I'll be really leaning into to make ingestion of new datasets/methods trivial with an agent. I'm making it my mission for folks to understand that rerun as a visualization tool only scratches the surface of what its true benefit is. Deep integration between data and visuals, with powerful query capabilities. I'll be focusing on the SLAM use case first and then bringing this into the full egocentric/exocentric data collection domain!

Pablo Vela

40,864 Aufrufe • vor 3 Monaten

Most people think Rerun is a visualization tool. In reality, it's a database masquerading as a visualizer. I wanted to showcase this functionality by building a full data pipeline consisting of: ingestion → baseline method → eval → finetuning for SLAM on egocentric data. I'll eventually extend this to the rest of my ego/exo datasets, but I wanted to start with a smaller bunch of datasets first. Rerun allows you to expose your saved .rrd files to a catalog where you store datasets. You can query, filter, and join them like any database using DataFusion under the hood. These are the same .rrd files that are automatically generated whenever you visualize anything in Rerun and decide to save it to disk. I brought in 109 VSLAM-LAB sequences across 14 datasets into the Rerun catalog as an example. These include 7Scenes, Euroc, eth3d, and others. Now I can query them with segment_table, filter_segments, and filter_contents instead of parsing CSVs and YAML files. With a strong set of ground-truth datasets for SLAM, baseline additions become nearly automatic with agents like Opus/Codex. This unification of data and visualization is imo the largest missing part for Physical AI. Visualization becomes a natural byproduct of having your data properly structured and queryable. The catalog API is what makes it a database, not just a viewer. I initially focused on VSLAM-LAB data, but I'll migrate all the egoexo data to this format in the coming days to really show just how useful this is.

Most people think Rerun is a visualization tool. In reality, it's a database masquerading as a visualizer. I wanted to showcase this functionality by building a full data pipeline consisting of: ingestion → baseline method → eval → finetuning for SLAM on egocentric data. I'll eventually extend this to the rest of my ego/exo datasets, but I wanted to start with a smaller bunch of datasets first. Rerun allows you to expose your saved .rrd files to a catalog where you store datasets. You can query, filter, and join them like any database using DataFusion under the hood. These are the same .rrd files that are automatically generated whenever you visualize anything in Rerun and decide to save it to disk. I brought in 109 VSLAM-LAB sequences across 14 datasets into the Rerun catalog as an example. These include 7Scenes, Euroc, eth3d, and others. Now I can query them with segment_table, filter_segments, and filter_contents instead of parsing CSVs and YAML files. With a strong set of ground-truth datasets for SLAM, baseline additions become nearly automatic with agents like Opus/Codex. This unification of data and visualization is imo the largest missing part for Physical AI. Visualization becomes a natural byproduct of having your data properly structured and queryable. The catalog API is what makes it a database, not just a viewer. I initially focused on VSLAM-LAB data, but I'll migrate all the egoexo data to this format in the coming days to really show just how useful this is.

Pablo Vela

34,937 Aufrufe • vor 3 Monaten

Congrats to EloShapes.com on their launch! On a similar note, is out of beta and available to the public. A lot of you have been asking me what my plan is now that EloShapes is also doing 3D scans... the answer is: nothing changed! I think that competition is good. I'm a mouse nerd and a LONG TIME EloShapes user, so the more options I get as a user, the better. I really think that my vision and EloShapes, though, is fundamentally different. I've had the domain for FMM for years, I have a much larger platform in mind compared to what I built so far. The 3D scans are the necessary step for my vision, not the end goal: they never were. I am also very proud of the fact that I've been able to scan at least 5 mice in full color EVERY DAY with my own pipeline, and I think I will easily have almost full coverage of the mice people would want on the website within months. Check the video for the mice I've added in just the last few days, all textured. Apart from some of the features I've mentioned before (like the virtual hand/grip), FMM is meant to be kind of a "MyAnimeList" for mice, so that you can share your profile, like this: and then later on leverage this curated list of mice you build for yourself to be shown even more mice you might like. FMM is also a tool for reviewers and anyone that wants to share their opinions on mice. Links like this: allow you to easily share your specific points about a mouse via the annotations, viewpoint sharing and measurements. And of course, as I've said since the beginning, I will use FMM to centralize all my hard testing about sensors, dpi, and such, so that you'll be able to find EVERYTHING about a mouse, including the raw results of my standardized sensor implementation testing. So, with this said, I will keep scanning as many mice as possible and adding as many features as possible until FMM becomes what I envisioned so long ago. Join my discord if you want to keep up! I hope you'll all give it a try! Ciao!

bardOZ (Giovanni Laerte Frongia)

25,646 Aufrufe • vor 3 Monaten

When I saw the mask "Tribes of the Calf" from Kanbas I knew I had to make it into reality. The jewelry and gold really made it stand out for me. Since Sam Spratt's The Masquerade was revealed, I have been spending time sculpting and dissecting the mask to recreate it in 3D as faithfully as possible. I delved into the creation of this mask for many reasons. I love a good challenge and this mask surely was one for me. Creating something in 3D from a 2D image is not easy, and especially when the source has generative nature, some stuff is hard to interpret, but I tried my best to make sure the visual integrity of the mask is as close to the original as possible. Splitting the whole mask into parts, filling the missing pieces so I can build the textures was quite a lot of work. I tried to present the mask in my own style with a slightly different colorway to adapt to the mask itself. Please enjoy this short animation, and turn on sound🔊 This piece is my statement that I am here to stay. That I have a voice that often feels being lost in the void. That I have been creating and posting digital art for over 20 years now and will continue until I'm gone. I have a story to tell and I want to be heard. The space we have here is small, and is shrinking day by day. It doesn't have to be like that. We need to support each other and push ourselves and people here, otherwise we are all doomed. As Kanbas has put in their observation of the mask: "Inspirational. Emotional. Natural." This is what our space can be, and this is me making a statement with this homage. I will share a 4k still below as well as a short video showing the 3D GLB interactive model together with a yt link to the 4k video since compression here is pretty bad.

When I saw the mask "Tribes of the Calf" from Kanbas I knew I had to make it into reality. The jewelry and gold really made it stand out for me. Since Sam Spratt's The Masquerade was revealed, I have been spending time sculpting and dissecting the mask to recreate it in 3D as faithfully as possible. I delved into the creation of this mask for many reasons. I love a good challenge and this mask surely was one for me. Creating something in 3D from a 2D image is not easy, and especially when the source has generative nature, some stuff is hard to interpret, but I tried my best to make sure the visual integrity of the mask is as close to the original as possible. Splitting the whole mask into parts, filling the missing pieces so I can build the textures was quite a lot of work. I tried to present the mask in my own style with a slightly different colorway to adapt to the mask itself. Please enjoy this short animation, and turn on sound🔊 This piece is my statement that I am here to stay. That I have a voice that often feels being lost in the void. That I have been creating and posting digital art for over 20 years now and will continue until I'm gone. I have a story to tell and I want to be heard. The space we have here is small, and is shrinking day by day. It doesn't have to be like that. We need to support each other and push ourselves and people here, otherwise we are all doomed. As Kanbas has put in their observation of the mask: "Inspirational. Emotional. Natural." This is what our space can be, and this is me making a statement with this homage. I will share a 4k still below as well as a short video showing the 3D GLB interactive model together with a yt link to the 4k video since compression here is pretty bad.

shoneec

17,531 Aufrufe • vor 1 Jahr