Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

How can lightweight drones without depth cameras navigate using monocular images? Check out our paper at ISER 2023! MonoNav: MAV Navigation via Monocular Depth Estimation and Reconstruction arXiv: website: Work led by Nate Simon

Anirudha Majumdar

5,475 subscribers

27,642 просмотров • 2 лет назад •via X (Twitter)

Наука и технологии

Anya Rossi• Live Now

Private livecam show

Комментарии: 9

Фото профиля Anirudha Majumdar

Anirudha Majumdar2 лет назад

In this work, we ask the following question: using only a monocular camera, optical odometry, and offboard computation, can we create metrically accurate maps that enable the use of conventional path planning to achieve robust autonomy in unknown environments?

Фото профиля Anirudha Majumdar

Anirudha Majumdar2 лет назад

The answer is YES - surprisingly, a monocular system using state-of-the-art depth estimation techniques can perform local 3D reconstruction with sufficient quality to enable fast (0.5 m/s) MAV navigation in unexplored, constrained, indoor environments.

Фото профиля Anirudha Majumdar

Anirudha Majumdar2 лет назад

We present MonoNav: a monocular navigation stack that leverages pre-trained transformer-based models for monocular depth estimation (ZoeDepth) in combination with off-the-shelf fusion (Open3D) and conventional planning techniques (motion primitives).

Фото профиля Anirudha Majumdar

Anirudha Majumdar2 лет назад

MonoNav is able to reconstruct and navigate in constrained indoor environments. In another example, we see MonoNav navigating a hallway corner at 0.5 m/s.

Фото профиля Anirudha Majumdar

Anirudha Majumdar2 лет назад

We compare MonoNav to NoMaD, a state of the art method in monocular navigation. NoMaD uses a transformer encoder and diffusion policy to directly output action candidates from a series of RGB images (and optional goal image). Website:

Фото профиля Anirudha Majumdar

Anirudha Majumdar2 лет назад

We find NoMaD works well when a clear maneuver is required, e.g., turning to avoid a wall. However, the action candidates are not always diverse and occasionally suggest turning into the wall. In another case, the action candidates are insufficiently evasive to avoid a crash.

Фото профиля Anirudha Majumdar

Anirudha Majumdar2 лет назад

In 15 side-by-side experiments in diverse conditions, we find that MonoNav significantly reduces collision rate (by a factor of 4x). This increase in safety comes at the cost of conservatism, in terms of a 22% reduction in goal completion.

Фото профиля Anirudha Majumdar

Anirudha Majumdar2 лет назад

This performance occurs because MonoNav reasons explicitly about the environment scale, and can self-arrest and land if collision appears imminent. For more information, check out: Video: Paper: Website:

Фото профиля Avik Sarkar

Avik Sarkar2 лет назад

@Nate___Simon This is pretty amazing work! Thinking very simply, we can still perceive depth with just one eye, so one would think you don't need stereo vision cameras for depth perception ..

Похожие видео

Want to use Depth Anything, but need metric depth rather than relative depth? Thrilled to introduce Prompt Depth Anything, a new paradigm for accurate metric depth estimation with up to 4K resolution. 👉Key Message: Depth foundation models like DA have already internalized rich geometric knowledge of the 3D world but lack a proper way to elicit it. Inspired by the success of prompting in LLMs, we propose prompting Depth Anything with metric cues to produce metric depth. This method proves to be very effective when using a low-cost lidar (e.g., iPhone's LiDAR), which is widely available, as prompts. We believe the prompt can generalize to other forms as long as scale information is provided. Prompt Depth Anything offers 1⃣A series of models for iPhone lidars. 2⃣4D reconstruction from monocular videos (captured with iPhone). 3⃣Improved generalization ability for robot manipulation, e.g. Training on cans but generalizing on glasses. 4⃣More detailed depth annotations for the ScanNet++ dataset. The first author is our excellent intern Haotong Lin. Paper: Huggingface: Project Page: Code:

Want to use Depth Anything, but need metric depth rather than relative depth? Thrilled to introduce Prompt Depth Anything, a new paradigm for accurate metric depth estimation with up to 4K resolution. 👉Key Message: Depth foundation models like DA have already internalized rich geometric knowledge of the 3D world but lack a proper way to elicit it. Inspired by the success of prompting in LLMs, we propose prompting Depth Anything with metric cues to produce metric depth. This method proves to be very effective when using a low-cost lidar (e.g., iPhone's LiDAR), which is widely available, as prompts. We believe the prompt can generalize to other forms as long as scale information is provided. Prompt Depth Anything offers 1⃣A series of models for iPhone lidars. 2⃣4D reconstruction from monocular videos (captured with iPhone). 3⃣Improved generalization ability for robot manipulation, e.g. Training on cans but generalizing on glasses. 4⃣More detailed depth annotations for the ScanNet++ dataset. The first author is our excellent intern Haotong Lin. Paper: Huggingface: Project Page: Code:

Bingyi Kang

67,604 просмотров • 1 год назад

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

Sherwin Bahmani

66,590 просмотров • 9 месяцев назад

Exploring Andrej Karpathy Nanochat through a Knowledge Graph Love how this project combines simplicity with power. Want to navigate and understand the entire repo? Check it out here: Plus, you can connect it via MCP to any AI coding assistant and work directly on the repository. Powered by Deep Graph MCP AI at Meta and Groq Inc

Exploring Andrej Karpathy Nanochat through a Knowledge Graph Love how this project combines simplicity with power. Want to navigate and understand the entire repo? Check it out here: Plus, you can connect it via MCP to any AI coding assistant and work directly on the repository. Powered by Deep Graph MCP AI at Meta and Groq Inc

Daniel San

126,642 просмотров • 9 месяцев назад

Want to create an avatar from a single image? FlexAvatar is a transformer model that creates full 360°, high-quality, and expressive 3D head avatar from just a single portrait image in minutes. Real-time Demo: FlexAvatar's lightweight architecture allows both animation and rendering in real-time, enabling interactive user experiences. To create a new 3D head avatar, only one image is required, e.g., from a webcam. The final avatar is ready after 2 minutes. Architecture: Under the hood, FlexAvatar adopts a transformer-based encoder-decoder design. The encoder maps the input image onto a latent avatar space, while the decoder produces 3D Gaussian attribute maps by incorporating the animation signal via cross-attention. The model learns all facial animations directly from the data without relying on pre-built 3D face models. This equips the avatars with realistic facial expressions. The internal avatar latent space can be conveniently used to integrate additional observations of a person via fitting. This enables use-cases where more than one image of a person is available, e.g., from a phone scan of the person. We train jointly on 2D monocular videos and multi-view data. However, in monocular videos, the animation signal leaks the target viewpoint, causing the model to produce incomplete 3D heads. We call this phenomenon entanglement of driving signal and target viewpoint. To prevent entanglement, we introduce bias sinks. These are learnable tokens that indicate whether a training sample stems from a monocular or a multi-view dataset. During training, the model learns to produce incomplete 3D heads only when the monocular token is present. During inference, FlexAvatar then always uses the multi-view token for which the model has learned to produce complete 3D heads. This simple design allows to combine the generalizability from monocular data with the quality of multi-view data. FlexAvatar summary: - Input: Single-image, phone scan, or monocular video - Output: Full 360° head avatar - Expressive animations - Real-time rendering and animation - Generalization to any portrait - Create a new avatar in 2 minutes - Use bias sinks to combine 2D and 3D data 🏠 🌍 🎥 Great work by Tobias Kirschstein and Simon Giebenhain!

Matthias Niessner

95,991 просмотров • 7 месяцев назад

Enabling autonomous vehicles perceive their environment using only off-the-shelf cameras has been a long term research objective at Swaayatt Robots. This demo highlights the capabilities of our on-road perception system which is able to detect obstacles, road boundaries, lane markers in images, as well as compute depth of the complex scenes in the environment. The output shown in this video is end-to-end raw output from our deep learning system, without any post processing. The current system, with joint computation of obstacles, lane/road boundaries, and depth, works at 30 FPS on an embedded GPU in our autonomous vehicle, and can achieve higher FPS with further optimization -- which is currently a research in progress. This system is being scaled up for both the day and night operations, and we will showcase its strength towards enabling autonomous driving on a mountainous environment with unpaved roads, in the absence of any delimiters. #deeplearning #autonomousdriving #autonomousvehicles #machinelearning

Enabling autonomous vehicles perceive their environment using only off-the-shelf cameras has been a long term research objective at Swaayatt Robots. This demo highlights the capabilities of our on-road perception system which is able to detect obstacles, road boundaries, lane markers in images, as well as compute depth of the complex scenes in the environment. The output shown in this video is end-to-end raw output from our deep learning system, without any post processing. The current system, with joint computation of obstacles, lane/road boundaries, and depth, works at 30 FPS on an embedded GPU in our autonomous vehicle, and can achieve higher FPS with further optimization -- which is currently a research in progress. This system is being scaled up for both the day and night operations, and we will showcase its strength towards enabling autonomous driving on a mountainous environment with unpaved roads, in the absence of any delimiters. #deeplearning #autonomousdriving #autonomousvehicles #machinelearning

Sanjeev Sharma

18,478 просмотров • 1 год назад

We are thrilled to share our breakthrough research on "Agile Flight from Pixels without State Estimation," to be presented and live-demonstrated at #RSS2024 next week! You heard well: no state estimation means no explicit visual localization, no SLAM, no VIO, and no IMU! Paper: Video (Narrated): Last year, we demonstrated that #ReinforcementLearning (RL) policies could outperform world-champion drone-racing pilots using the same quadrotor hardware; however, unlike human pilots, these policies continuously estimated an explicit state from known gate positions, the camera feed, and inertial measurements (IMU). In this new work, we tackle the challenge of learning vision-based drone racing using an end-to-end reinforcement learning approach that eliminates the need for IMU data or explicit state estimation. Like professional pilots, we go directly from images to control commands. The training is facilitated by an asymmetric actor-critic with access to privileged information. To overcome the computational complexity during image-based RL training, we use an appropriate sensor representation, which can be efficiently simulated during training without rendering images. We achieve agile flight at speeds up to 40 km/h with accelerations up to 2 g's. Although our demonstration focuses on drone racing, we believe that our method has an impact beyond drone racing and can serve as a foundation for future research into real-world applications in structured environments. Besides the paper presentation, we will also give a live demo next Tuesday and Wednesday between and hrs at TU Delft: Reference: Ismail Geles*, Leonard Bauersfeld*, Angel Romero, Jiaxu Xing, Davide Scaramuzza "Demonstrating Agile Flight from Pixels without State Estimation" Robotics: Science and Systems (RSS), 2024. Kudos to Ismail Geles Leonard Bauersfeld Ángel Romero Jiaxu Xing! University of Zurich UZH Science UZH Space Hub Aerial Core AUTOASSESS European Research Council (ERC)

We are thrilled to share our breakthrough research on "Agile Flight from Pixels without State Estimation," to be presented and live-demonstrated at #RSS2024 next week! You heard well: no state estimation means no explicit visual localization, no SLAM, no VIO, and no IMU! Paper: Video (Narrated): Last year, we demonstrated that #ReinforcementLearning (RL) policies could outperform world-champion drone-racing pilots using the same quadrotor hardware; however, unlike human pilots, these policies continuously estimated an explicit state from known gate positions, the camera feed, and inertial measurements (IMU). In this new work, we tackle the challenge of learning vision-based drone racing using an end-to-end reinforcement learning approach that eliminates the need for IMU data or explicit state estimation. Like professional pilots, we go directly from images to control commands. The training is facilitated by an asymmetric actor-critic with access to privileged information. To overcome the computational complexity during image-based RL training, we use an appropriate sensor representation, which can be efficiently simulated during training without rendering images. We achieve agile flight at speeds up to 40 km/h with accelerations up to 2 g's. Although our demonstration focuses on drone racing, we believe that our method has an impact beyond drone racing and can serve as a foundation for future research into real-world applications in structured environments. Besides the paper presentation, we will also give a live demo next Tuesday and Wednesday between and hrs at TU Delft: Reference: Ismail Geles, Leonard Bauersfeld, Angel Romero, Jiaxu Xing, Davide Scaramuzza "Demonstrating Agile Flight from Pixels without State Estimation" Robotics: Science and Systems (RSS), 2024. Kudos to Ismail Geles Leonard Bauersfeld Ángel Romero Jiaxu Xing! University of Zurich UZH Science UZH Space Hub Aerial Core AUTOASSESS European Research Council (ERC)

Davide Scaramuzza

27,917 просмотров • 2 лет назад

Wow. Recreating the Shawshank Redemption prison in 3D from a single video, in real time (!) Just read the MASt3R-SLAM paper and it's pretty neat. These folks basically built a real-time dense SLAM system on top of MASt3R, which is a transformer-based neural network that can do 3d reconstruction and localization from uncalibrated image pairs. The cool part is they don't need a fixed camera model -- it just works with arbitrary cameras -- think different focal lengths, sensor sizes, even handling zooming in video (FMV drone video anyone?!). If you've done photogrammetry or played with NeRFs you know that is a HUGE deal. They've solved some tricky problems like efficient point matching and tracking, plus they've figured out how to fuse point clouds and handle loop closures in real-time. Their system runs at about 15 FPS on a 4090 and produces both camera poses and dense geometry. When they know the camera calibration, they get SOTA results across several benchmarks, but even without calibration, they still perform well. What's interesting is the approach -- most recent SLAM work has built on DROID-SLAM's architecture, but these folks went a different direction by leveraging a strong 3D reconstruction prior. Seems to give them more coherent geometry, which makes sense since that's what MASt3R was designed for. For anyone who cares about monocular SLAM and 3D reconstruction, this feels like a significant step toward plug-and-play dense SLAM without calibration headaches -- perfect for drones, robots, AR/VR -- the works!

Wow. Recreating the Shawshank Redemption prison in 3D from a single video, in real time (!) Just read the MASt3R-SLAM paper and it's pretty neat. These folks basically built a real-time dense SLAM system on top of MASt3R, which is a transformer-based neural network that can do 3d reconstruction and localization from uncalibrated image pairs. The cool part is they don't need a fixed camera model -- it just works with arbitrary cameras -- think different focal lengths, sensor sizes, even handling zooming in video (FMV drone video anyone?!). If you've done photogrammetry or played with NeRFs you know that is a HUGE deal. They've solved some tricky problems like efficient point matching and tracking, plus they've figured out how to fuse point clouds and handle loop closures in real-time. Their system runs at about 15 FPS on a 4090 and produces both camera poses and dense geometry. When they know the camera calibration, they get SOTA results across several benchmarks, but even without calibration, they still perform well. What's interesting is the approach -- most recent SLAM work has built on DROID-SLAM's architecture, but these folks went a different direction by leveraging a strong 3D reconstruction prior. Seems to give them more coherent geometry, which makes sense since that's what MASt3R was designed for. For anyone who cares about monocular SLAM and 3D reconstruction, this feels like a significant step toward plug-and-play dense SLAM without calibration headaches -- perfect for drones, robots, AR/VR -- the works!

Bilawal Sidhu

703,816 просмотров • 1 год назад

🚀Announcing NeRSemble 3D Head Avatar Benchmark v2 Version 2 of the NeRSemble 3D Head Avatar Benchmark systematically evaluates several aspects of 3D head avatar creation. Our goal is to drive progress toward more realistic, robust, and generalizable avatar methods. 🔬Benchmark Tasks The NeRSemble Benchmark v2 features three core challenges: - Dynamic Novel View Synthesis - Monocular FLAME-driven Avatar Creation (updated) - Single-view 3D Face Reconstruction (new) 👉Explore the online leaderboard and submission system: 🆕What's new? 1. New Task: Single-view 3D Face Reconstruction Given a single portrait image, reconstruct an accurate 3D mesh either showing the input expression or a fully neutral one. Unlike prior benchmarks, the NeRSemble benchmark emphasizes diverse and challenging facial expressions, better reflecting real scenarios. For technical details, see the Pixel3DMM paper. 2. Updated task: Monocular FLAME-driven Avatar Creation We have improved the FLAME tracking that is used for both avatar creation from the monocular videos and avatar driving on the hidden test sequences. The updated benchmark task has: - more stable torso tracking - more expressive lip closures during speech - Improved mouth tracking for challenging facial expressions We hope that these improvements to the benchmark help drive the field forward. 🏆 CVPR 2026 Workshop & Prizes The NeRSemble benchmark will be featured at the CVPR 2026 Workshop on Photo-realistic 3D Head Avatars. Participants in the new and updated tasks have the opportunity to win: - 🎁RTX 5080 GPUs (sponsored by NVIDIA) - 🎤15-minute oral presentation at the workshop ⏰ Submission Deadline - May 26, 2026 Reach out to the amazing Tobias Kirschstein and Simon Giebenhain for more details :)

Matthias Niessner

29,954 просмотров • 3 месяцев назад

Glad that our work “Inference-Time Enhancement of Generative Robot Policies via Predictive World Modeling”, led by Han Qi, has been accepted to IEEE Robotics and Automation Letters! 🎉 We propose Generative Predictive Control (GPC): sample action proposals from a pretrained diffusion policy (“look back”), roll them out with a diffusion-based action-conditioned video world model (“look forward”), then rank or optimize the actions using either a learned reward model or VLM preferences. Conceptually, this is trajectory optimization / MPC with hybrid sampling + gradient optimization, interpreted through modern diffusion priors and video world models. Interestingly, we first posted the paper on arXiv in Feb 2025, when action-conditioned video world models for planning were still rare—now this direction is rapidly gaining traction. Still many open questions, e.g., • how to avoid local minima in planning • what representations work best for world models • how to balance physics priors vs. data-driven learning Paper:

Glad that our work “Inference-Time Enhancement of Generative Robot Policies via Predictive World Modeling”, led by Han Qi, has been accepted to IEEE Robotics and Automation Letters! 🎉 We propose Generative Predictive Control (GPC): sample action proposals from a pretrained diffusion policy (“look back”), roll them out with a diffusion-based action-conditioned video world model (“look forward”), then rank or optimize the actions using either a learned reward model or VLM preferences. Conceptually, this is trajectory optimization / MPC with hybrid sampling + gradient optimization, interpreted through modern diffusion priors and video world models. Interestingly, we first posted the paper on arXiv in Feb 2025, when action-conditioned video world models for planning were still rare—now this direction is rapidly gaining traction. Still many open questions, e.g., • how to avoid local minima in planning • what representations work best for world models • how to balance physics priors vs. data-driven learning Paper:

Heng Yang

18,994 просмотров • 4 месяцев назад

This is how you can get programmatic access to any website. You need two things: 1. The URL of the website 2. A prompt specifying what you want to do on the site Mino is a neat platform that uses browser automation and AI-powered navigation to understand your prompt, open the website, and extract the information you want. This opens the doors to unlimited potential automations you could build using data from everyday websites that don't give you an API. Mino uses AI to go way further than you could get by using a regular web scraping tool: • It can handle dynamic JavaScript content • It can handle login walls • It can navigate through interactive booking flows • It adapts to different interfaces and layout changes • It can fill out forms automatically • It supports stealth browser mode In my experience, running Mino on relatively straightforward sites takes about 30-60 seconds to complete. The more times you run it, the faster it gets. I recorded a quick video to show you the platform. You can check their site here: Thanks to the team for the support, onboarding me on their platform, and the collaboration on this post.

This is how you can get programmatic access to any website. You need two things: 1. The URL of the website 2. A prompt specifying what you want to do on the site Mino is a neat platform that uses browser automation and AI-powered navigation to understand your prompt, open the website, and extract the information you want. This opens the doors to unlimited potential automations you could build using data from everyday websites that don't give you an API. Mino uses AI to go way further than you could get by using a regular web scraping tool: • It can handle dynamic JavaScript content • It can handle login walls • It can navigate through interactive booking flows • It adapts to different interfaces and layout changes • It can fill out forms automatically • It supports stealth browser mode In my experience, running Mino on relatively straightforward sites takes about 30-60 seconds to complete. The more times you run it, the faster it gets. I recorded a quick video to show you the platform. You can check their site here: Thanks to the team for the support, onboarding me on their platform, and the collaboration on this post.

Santiago

31,742 просмотров • 7 месяцев назад

Today, as shared by The New York Times, we’re announcing two things: >Our Series B at a $2.1B valuation led by Sarah Wang at a16z. >Reaching $100M ARR, profitably, with a team of just 50 people. That's $2M ARR per employee. PowerPoint was invented before the first website, before the Game Boy, before the Berlin Wall fell. But Gamma, and our 70 million users, are proof that an AI-native company can disrupt a category everyone assumed was won. 30 million gammas are created every single month, as we fight hard to become a new standard for communication. But we’re not stopping. We’re expanding our plans for businesses. We’re building out a full visual storytelling platform. And today, we’re releasing our API to the general public. So you can plug Gamma into wherever work happens. And to celebrate, we’re also sharing our first ever prompt guide backed by research into how our most successful users use Gamma to automate presentations, websites, and content in minutes.

Today, as shared by The New York Times, we’re announcing two things: >Our Series B at a $2.1B valuation led by Sarah Wang at a16z. >Reaching $100M ARR, profitably, with a team of just 50 people. That's $2M ARR per employee. PowerPoint was invented before the first website, before the Game Boy, before the Berlin Wall fell. But Gamma, and our 70 million users, are proof that an AI-native company can disrupt a category everyone assumed was won. 30 million gammas are created every single month, as we fight hard to become a new standard for communication. But we’re not stopping. We’re expanding our plans for businesses. We’re building out a full visual storytelling platform. And today, we’re releasing our API to the general public. So you can plug Gamma into wherever work happens. And to celebrate, we’re also sharing our first ever prompt guide backed by research into how our most successful users use Gamma to automate presentations, websites, and content in minutes.

Grant Lee

4,437,216 просмотров • 8 месяцев назад

We know a lot of you are hyped for Stage 2 of the Megadrop - so here's a quick guide on how to install and connect your Bitget wallet to Matchain! The first step? Installing Bitget Wallet and adding the Matchain network. Make sure you’re using our mini-app by going here: Next, you can use this step-by-step video tutorial to guide you through installing your wallet and connecting to the Matchain network in less than a minute. For more detailed instructions, check out our full guide: Don't miss out on this opportunity to join the Matchain ecosystem and earn exclusive rewards! 💡Remember: 🗓 Event Period: September 25, 17:00 - October 9, 17:00 (UTC+8) 🏆 Complete all tasks to be eligible for rewards 👀 Stay tuned for more in-depth guides!

We know a lot of you are hyped for Stage 2 of the Megadrop - so here's a quick guide on how to install and connect your Bitget wallet to Matchain! The first step? Installing Bitget Wallet and adding the Matchain network. Make sure you’re using our mini-app by going here: Next, you can use this step-by-step video tutorial to guide you through installing your wallet and connecting to the Matchain network in less than a minute. For more detailed instructions, check out our full guide: Don't miss out on this opportunity to join the Matchain ecosystem and earn exclusive rewards! 💡Remember: 🗓 Event Period: September 25, 17:00 - October 9, 17:00 (UTC+8) 🏆 Complete all tasks to be eligible for rewards 👀 Stay tuned for more in-depth guides!

Matchain

105,517 просмотров • 1 год назад

"Using coding agents well is taking every inch of my 25 years of experience as a software engineer, and it is mentally exhausting. I can fire up four agents in parallel and have them work on four different problems, and by 11am I am wiped out for the day. There is a limit on human cognition. Even if you're not reviewing everything they're doing, how much you can hold in your head at one time. There's a sort of personal skill that we have to learn, which is finding our new limits. What is a responsible way for us to not burn out, and for us to use the time that we have?" Simon Willison

"Using coding agents well is taking every inch of my 25 years of experience as a software engineer, and it is mentally exhausting. I can fire up four agents in parallel and have them work on four different problems, and by 11am I am wiped out for the day. There is a limit on human cognition. Even if you're not reviewing everything they're doing, how much you can hold in your head at one time. There's a sort of personal skill that we have to learn, which is finding our new limits. What is a responsible way for us to not burn out, and for us to use the time that we have?" Simon Willison

Lenny Rachitsky

1,931,032 просмотров • 3 месяцев назад

| $SGT Latest Update is LIVE! At we're dedicated to building an all-in-one DeFi ecosystem that covers everything from the hottest crypto trends and breaking news to top performers and in-depth analysis of call groups and influencer signals. Our latest update is designed to enhance your user experience, making it easier to navigate through our utilities and access critical insights. Here’s What’s New: 🔍 Search Bar: for tokens mentioned on our platform by name or contract hash, streamlining your access to key data. 📊 Token Profile Pages: with essential metrics such as Market Cap, Volume, Price, and Performance Charts. A detailed table at the bottom shows which influencers mentioned the token and how it performed afterward. 📈 Top 5 Visited Tokens: Check out the most viewed tokens on our platform. This new table highlights the tokens generating the most user interest, helping you stay ahead of the trends. 💻 Telegram App & Bots: With our focus on Telegram, we've added dedicated pages for our app and bots, making it easier to explore their features and utilities as we continue to expand our offerings. ⚡️ UI & Speed Enhancements: The platform has been optimized for a faster, smoother experience, with an improved user interface for better navigation. 📺 Watch the full tutorial: Join our community for more news and updates! 🌐 Visit us at :

ShillGuard $SGT

16,239 просмотров • 1 год назад

We are excited to share our #CORL2024 paper on learning quadrotor obstacle avoidance from the visual stream of a single #eventcamera! Trained entirely in simulation! We demonstrate obstacle avoidance both in the dark and in a forest up to 5m/s. PDF: Video: Project page: Event cameras are sensors that output per-pixel-level intensity changes at microsecond latency resolution; they feature nearly zero motion blur and high dynamic range but produce a very large volume of events under significant ego-motion and further lack a high-fidelity continuous-time sensor model in simulation, making direct #sim2real transfer not possible. By leveraging depth prediction as a pretext task, we pre-train a reactive obstacle avoidance policy with “approximated” simulated events and then fine-tune the perception component with limited events-and-depth real-world data. This technique bridges the sim2real gap for #eventcameras! As at the current state, there is no continuous-time sensor model for event cameras, we hope that this work can finally spur future research leveraging simulation for training event-vision-based policies to create faster, agile robots! Kudos to Anish Bhattacharya, @marcocannic, Vijay Kumar Nikolai Matni UZH Science University of Zurich UZH Space Hub UZH IfI European Research Council (ERC) GRASP Laboratory Penn Engineering

We are excited to share our #CORL2024 paper on learning quadrotor obstacle avoidance from the visual stream of a single #eventcamera! Trained entirely in simulation! We demonstrate obstacle avoidance both in the dark and in a forest up to 5m/s. PDF: Video: Project page: Event cameras are sensors that output per-pixel-level intensity changes at microsecond latency resolution; they feature nearly zero motion blur and high dynamic range but produce a very large volume of events under significant ego-motion and further lack a high-fidelity continuous-time sensor model in simulation, making direct #sim2real transfer not possible. By leveraging depth prediction as a pretext task, we pre-train a reactive obstacle avoidance policy with “approximated” simulated events and then fine-tune the perception component with limited events-and-depth real-world data. This technique bridges the sim2real gap for #eventcameras! As at the current state, there is no continuous-time sensor model for event cameras, we hope that this work can finally spur future research leveraging simulation for training event-vision-based policies to create faster, agile robots! Kudos to Anish Bhattacharya, @marcocannic, Vijay Kumar Nikolai Matni UZH Science University of Zurich UZH Space Hub UZH IfI European Research Council (ERC) GRASP Laboratory Penn Engineering

Davide Scaramuzza

17,219 просмотров • 1 год назад

Check out our #PAMI paper with code "Dense Continuous-Time Optical Flow from Event Cameras," where we show how to regress *continuous-time* trajectories of every pixel from event cameras alone or events plus frames! The key idea is to iteratively estimate per-pixel polynomials using a recurrent lookup and update scheme. Paper: Code: DOI: We present a method for estimating dense continuous-time optical flow from event data. Traditional dense optical flow methods compute the pixel displacement between two images. Due to missing information, these approaches cannot recover the pixel trajectories in the blind time between two images. We show that it is possible to compute per-pixel, continuous-time optical flow using events from an event camera. Events provide temporally fine-grained information about movement in pixel space due to their asynchronous nature and microsecond response time. We leverage these benefits to predict pixel trajectories densely in continuous time via parameterized Bézier curves. To achieve this, we build a neural network with strong inductive biases for this task: First, we build multiple sequential correlation volumes in time using event data. Second, we use Bézier curves to index these correlation volumes at multiple timestamps along the trajectory. Third, we use the retrieved correlation to update the Bézier curve representations iteratively. Our method can optionally include image pairs to boost performance further. To train and evaluate our model, we introduce a synthetic dataset (MultiFlow) that features moving objects and ground truth trajectories for every pixel. Our quantitative experiments suggest that our method successfully predicts pixel trajectories in continuous time and is competitive in the traditional two-view pixel displacement metric on MultiFlow and DSEC-Flow. Open source code and datasets are released to the public. Kudos to Mathias Gehrig Manasi Muglikar

Check out our #PAMI paper with code "Dense Continuous-Time Optical Flow from Event Cameras," where we show how to regress continuous-time trajectories of every pixel from event cameras alone or events plus frames! The key idea is to iteratively estimate per-pixel polynomials using a recurrent lookup and update scheme. Paper: Code: DOI: We present a method for estimating dense continuous-time optical flow from event data. Traditional dense optical flow methods compute the pixel displacement between two images. Due to missing information, these approaches cannot recover the pixel trajectories in the blind time between two images. We show that it is possible to compute per-pixel, continuous-time optical flow using events from an event camera. Events provide temporally fine-grained information about movement in pixel space due to their asynchronous nature and microsecond response time. We leverage these benefits to predict pixel trajectories densely in continuous time via parameterized Bézier curves. To achieve this, we build a neural network with strong inductive biases for this task: First, we build multiple sequential correlation volumes in time using event data. Second, we use Bézier curves to index these correlation volumes at multiple timestamps along the trajectory. Third, we use the retrieved correlation to update the Bézier curve representations iteratively. Our method can optionally include image pairs to boost performance further. To train and evaluate our model, we introduce a synthetic dataset (MultiFlow) that features moving objects and ground truth trajectories for every pixel. Our quantitative experiments suggest that our method successfully predicts pixel trajectories in continuous time and is competitive in the traditional two-view pixel displacement metric on MultiFlow and DSEC-Flow. Open source code and datasets are released to the public. Kudos to Mathias Gehrig Manasi Muglikar

Davide Scaramuzza

12,637 просмотров • 2 лет назад

While frontier closed models like Google’s Nano Banana can autonomously produce rich interleaved content (e.g., illustrated tutorials), open-source models still lag in both task coverage and generation quality. We introduce DuoGen, a dual transformer–diffusion framework that narrows this gap via an efficient decoupled design: a pretrained Multimodal LLM performs semantic reasoning and decides when to generate images, while a Video Diffusion Transformer ensures high-fidelity, consistent visuals—without costly mixed-modality pretraining. Enabled by a new large-scale interleaved instruction-tuning dataset. Code & data will be open-sourced. Paper: Project: Work led by Min Shi 🐝 originating from his summer internship at NVIDIA Research and continuing beyond, in collaboration with Ming-Yu Liu, Xiaohui Zeng, Jiannan Huang, Yin Cui, Jialuo Li, Tsung-Yi Lin, Max Zhaoshuo Li 李赵硕, Francesco Ferroni, Xiao Fu, Yogesh Balaji, Chieh-Yun Chen, and other colleagues 🚀

While frontier closed models like Google’s Nano Banana can autonomously produce rich interleaved content (e.g., illustrated tutorials), open-source models still lag in both task coverage and generation quality. We introduce DuoGen, a dual transformer–diffusion framework that narrows this gap via an efficient decoupled design: a pretrained Multimodal LLM performs semantic reasoning and decides when to generate images, while a Video Diffusion Transformer ensures high-fidelity, consistent visuals—without costly mixed-modality pretraining. Enabled by a new large-scale interleaved instruction-tuning dataset. Code & data will be open-sourced. Paper: Project: Work led by Min Shi 🐝 originating from his summer internship at NVIDIA Research and continuing beyond, in collaboration with Ming-Yu Liu, Xiaohui Zeng, Jiannan Huang, Yin Cui, Jialuo Li, Tsung-Yi Lin, Max Zhaoshuo Li 李赵硕, Francesco Ferroni, Xiao Fu, Yogesh Balaji, Chieh-Yun Chen, and other colleagues 🚀

Humphrey Shi

16,833 просмотров • 5 месяцев назад

Unloading trucks with robots! 📦 Bastian Solutions has developed a mobile robot system that automates high-volume, floor-level unloading of trailers. Let's have a closer look at how it actually works! The robot seamlessly drives itself in and out of trailers and docks without any fixed infrastructure. This system picks up cases via an extendable conveyor, lifts them using a retractable mast, and accurately places them inside the trailer using a dedicated gripper. In the video you can see the reversed process, unloading of a truck. 🚛 It's guided by LiDAR-based navigation and an omnidirectional base, enabling fast, precise, and infrastructure-free operation. Integrates smoothly with existing systems like conveyors, scanners, sorters, and depalletizers. Every time I see a robot loading and unloading a truck, I'm like: we are getting there :) ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

Unloading trucks with robots! 📦 Bastian Solutions has developed a mobile robot system that automates high-volume, floor-level unloading of trailers. Let's have a closer look at how it actually works! The robot seamlessly drives itself in and out of trailers and docks without any fixed infrastructure. This system picks up cases via an extendable conveyor, lifts them using a retractable mast, and accurately places them inside the trailer using a dedicated gripper. In the video you can see the reversed process, unloading of a truck. 🚛 It's guided by LiDAR-based navigation and an omnidirectional base, enabling fast, precise, and infrastructure-free operation. Integrates smoothly with existing systems like conveyors, scanners, sorters, and depalletizers. Every time I see a robot loading and unloading a truck, I'm like: we are getting there :) ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

Lukas Ziegler

64,540 просмотров • 16 дней назад

Introducing Digital Red Queen (DRQ): Adversarial Program Evolution in Core War with LLMs Blog: Core War is a programming game where self-replicating assembly programs, called warriors, compete for control of a virtual machine. In this dynamic environment, where there is no distinction between code and data, warriors must crash opponents while defending themselves to survive. In this work, we explore how LLMs can drive open-ended adversarial evolution of these programs within Core War. Our approach is inspired by the Red Queen Hypothesis from evolutionary biology: the principle that species must continually adapt and evolve simply to survive against ever-changing competitors. We found that running our DRQ algorithm for longer durations produces warriors that become more generally robust. Most notably, we observed an emergent pressure towards convergent evolution. Independent runs, starting from completely different initial conditions, evolved toward similar general-purpose behaviors—mirroring how distinct species in nature often evolve similar traits to solve the same problems. Simulating these adversarial dynamics in an isolated sandbox offers a glimpse into the future, where deployed LLM systems might eventually compete against one another for computational or physical resources in the real world. This project is a collaboration between MIT and Sakana AI led by Akarsh Kumar Full Paper (Website): Full Paper (arxiv): Code:

Introducing Digital Red Queen (DRQ): Adversarial Program Evolution in Core War with LLMs Blog: Core War is a programming game where self-replicating assembly programs, called warriors, compete for control of a virtual machine. In this dynamic environment, where there is no distinction between code and data, warriors must crash opponents while defending themselves to survive. In this work, we explore how LLMs can drive open-ended adversarial evolution of these programs within Core War. Our approach is inspired by the Red Queen Hypothesis from evolutionary biology: the principle that species must continually adapt and evolve simply to survive against ever-changing competitors. We found that running our DRQ algorithm for longer durations produces warriors that become more generally robust. Most notably, we observed an emergent pressure towards convergent evolution. Independent runs, starting from completely different initial conditions, evolved toward similar general-purpose behaviors—mirroring how distinct species in nature often evolve similar traits to solve the same problems. Simulating these adversarial dynamics in an isolated sandbox offers a glimpse into the future, where deployed LLM systems might eventually compete against one another for computational or physical resources in the real world. This project is a collaboration between MIT and Sakana AI led by Akarsh Kumar Full Paper (Website): Full Paper (arxiv): Code:

Sakana AI

143,831 просмотров • 6 месяцев назад

Xiaomi Robotics Lab, alongside Tsinghua and HKUST, just figured out how to give humanoids a real "feel" for physics. 🤖🛹 Their new HAIC framework lets robots handle objects they can't even see by using high order proprioception (essentially digital muscle memory). Instead of relying on cameras, the bot analyzes the feedback from its own joints to infer exactly how an object is moving in its blind spot. Here is what this physical intuition actually looks like in practice: ➤ Blind Skateboarding: Hits a 100% success rate on gliding and dismounting without ever looking at the board. ➤ Heavy Loads: Pushes carts weighing up to 70kg (approx. 154 lbs) and pulls up to 20kg (approx. 44 lbs) with total stability. ➤ Obstructed Locomotion: Carries bulky boxes that block its view of the ground while flawlessly navigating stairs and slopes. ➤ Adaptive Brain: Proactively compensates for inertia and weight shifts by predicting velocity and acceleration on the fly. ➤ Generalization: Handles various box sizes and heavy weights without needing any specific retraining for the new dimensions. This is a huge deal for real world work where cameras are often blocked or lighting is too messy for pure vision. Paper: Project: #Robot #Humanoid #Robotics #AI #EmbodiedAI #PhysicalAI #XiaomiRobotics #HAIC #SkateboardingRobot

Xiaomi Robotics Lab, alongside Tsinghua and HKUST, just figured out how to give humanoids a real "feel" for physics. 🤖🛹 Their new HAIC framework lets robots handle objects they can't even see by using high order proprioception (essentially digital muscle memory). Instead of relying on cameras, the bot analyzes the feedback from its own joints to infer exactly how an object is moving in its blind spot. Here is what this physical intuition actually looks like in practice: ➤ Blind Skateboarding: Hits a 100% success rate on gliding and dismounting without ever looking at the board. ➤ Heavy Loads: Pushes carts weighing up to 70kg (approx. 154 lbs) and pulls up to 20kg (approx. 44 lbs) with total stability. ➤ Obstructed Locomotion: Carries bulky boxes that block its view of the ground while flawlessly navigating stairs and slopes. ➤ Adaptive Brain: Proactively compensates for inertia and weight shifts by predicting velocity and acceleration on the fly. ➤ Generalization: Handles various box sizes and heavy weights without needing any specific retraining for the new dimensions. This is a huge deal for real world work where cameras are often blocked or lighting is too messy for pure vision. Paper: Project: #Robot #Humanoid #Robotics #AI #EmbodiedAI #PhysicalAI #XiaomiRobotics #HAIC #SkateboardingRobot

RoboHub🤖

13,253 просмотров • 4 месяцев назад