Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

FlowRVS - segmentation as a continuous deformation, mapping video latents directly to masks via an ODE. Built on Wan’s T2V. - complex semantic understanding with temporal consistency. - no flickering

Wildminder

8,529 subscribers

26,166 görüntüleme • 3 ay önce •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Yorum

Yorum bulunmuyor

Orijinal gönderinin yorumları burada görünecek

Benzer Videolar

Tracking Anything with Decoupled Video Segmentation paper page: Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation.

Tracking Anything with Decoupled Video Segmentation paper page: Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation.

AK

305,560 görüntüleme • 2 yıl önce

SuperGSeg: Open-Vocabulary 3D Segmentation with Structured Super-Gaussians Contributions: • We propose SuperGSeg: a 3D segmentation method with neural Gaussians, designed to learn hierarchical instance segmentation features from 2D foundation models. • We introduce the concept of Super-Gaussian, a novel representation that integrates hierarchical instance segmentation features, enabling the embedding of high-dimensional language features. This approach addresses previously unfeasible challenges in representing complex scenes with rich semantic details. • Extensive experiments on the LERF-OVS and ScanNet datasets demonstrate the effectiveness of the proposed method, achieving significant improvements in open-vocabulary 3D object-level and scene-level semantic segmentation. It shows particular strength in capturing fine-grained scene details and dense pixel semantic segmentation tasks for the first time.

SuperGSeg: Open-Vocabulary 3D Segmentation with Structured Super-Gaussians Contributions: • We propose SuperGSeg: a 3D segmentation method with neural Gaussians, designed to learn hierarchical instance segmentation features from 2D foundation models. • We introduce the concept of Super-Gaussian, a novel representation that integrates hierarchical instance segmentation features, enabling the embedding of high-dimensional language features. This approach addresses previously unfeasible challenges in representing complex scenes with rich semantic details. • Extensive experiments on the LERF-OVS and ScanNet datasets demonstrate the effectiveness of the proposed method, achieving significant improvements in open-vocabulary 3D object-level and scene-level semantic segmentation. It shows particular strength in capturing fine-grained scene details and dense pixel semantic segmentation tasks for the first time.

MrNeRF

13,594 görüntüleme • 1 yıl önce

Bussing plates after a healthy meal. Our robot only eats plastic fruits and veggies for now. Video segmentation accomplished using Track-Anything, an architecture combining Meta's SAM to generate a zero-shot segmentation prompt and Xmem to enable long-horizon, temporally consistent video segmentation masks.

Bussing plates after a healthy meal. Our robot only eats plastic fruits and veggies for now. Video segmentation accomplished using Track-Anything, an architecture combining Meta's SAM to generate a zero-shot segmentation prompt and Xmem to enable long-horizon, temporally consistent video segmentation masks.

Watney Robotics

14,856 görüntüleme • 2 yıl önce

NVIDIA just dropped UniRelight, a handy video-relighting model - estimates albedo, relit video in a single pass - based on Cosmos-Predict1-7B - supports complex materials - temporal consistency - beats DiLightNet, NeuralGaffer

NVIDIA just dropped UniRelight, a handy video-relighting model - estimates albedo, relit video in a single pass - based on Cosmos-Predict1-7B - supports complex materials - temporal consistency - beats DiLightNet, NeuralGaffer

Wildminder

29,374 görüntüleme • 1 ay önce

🔥Ultra-Long Video World Model up to 5min🔥 ✨ We introduce #LongVie2, an end-to-end autoregressive video world model that supports continuous video generation lasting up to 5min with: 🕹️ Strong Controllability 📷 Long-term Visual Fidelity 🔒 Temporal Consistency - Project: - Code: - Paper: . Thanks to AK !

🔥Ultra-Long Video World Model up to 5min🔥 ✨ We introduce #LongVie2, an end-to-end autoregressive video world model that supports continuous video generation lasting up to 5min with: 🕹️ Strong Controllability 📷 Long-term Visual Fidelity 🔒 Temporal Consistency - Project: - Code: - Paper: . Thanks to AK !

Ziwei Liu

82,381 görüntüleme • 5 ay önce

VSTAR Generative Temporal Nursing for Longer Dynamic Video Synthesis Despite tremendous progress in the field of text-to-video (T2V) synthesis, open-sourced T2V diffusion models struggle to generate longer videos with dynamically varying and evolving content. They tend to

VSTAR Generative Temporal Nursing for Longer Dynamic Video Synthesis Despite tremendous progress in the field of text-to-video (T2V) synthesis, open-sourced T2V diffusion models struggle to generate longer videos with dynamically varying and evolving content. They tend to

AK

36,982 görüntüleme • 2 yıl önce

🕹️We are excited to introduce "ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation" ChronoEdit reframes image editing as a video generation task to encourage temporal consistency. It leverages a temporal reasoning stage that denoises with “video reasoning tokens” to "reason" on physically plausible edits. See the attached video for results. Project Page: Arxiv: Code and model are coming.

🕹️We are excited to introduce "ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation" ChronoEdit reframes image editing as a video generation task to encourage temporal consistency. It leverages a temporal reasoning stage that denoises with “video reasoning tokens” to "reason" on physically plausible edits. See the attached video for results. Project Page: Arxiv: Code and model are coming.

Huan Ling

36,818 görüntüleme • 8 ay önce

Got tired of boring websites. Built a way to drop 3D agents directly onto standard HTML elements with zero complex math.

Got tired of boring websites. Built a way to drop 3D agents directly onto standard HTML elements with zero complex math.

nich

135,799 görüntüleme • 21 gün önce

Video Analysis and Generation via a Semantic Progress Function paper:

Video Analysis and Generation via a Semantic Progress Function paper:

AK

34,219 görüntüleme • 1 ay önce

Free4D announced on Hugging Face Tuning-free 4D Scene Generation with Spatial-Temporal Consistency

Free4D announced on Hugging Face Tuning-free 4D Scene Generation with Spatial-Temporal Consistency

AK

22,878 görüntüleme • 1 yıl önce

Introducing Continuous Thought Machines New Blog: Modern AI is powerful, but it’s still distinct from human-like flexible intelligence. We believe neural timing is key. Our Continuous Thought Machine is built from the ground up to use neural dynamics as a powerful representation for intelligence. Thought takes time, and reasoning is a process. Biological brains inspire us with their complex neural activity, where neural timing is critical to intelligence. We’re exploring how to bring that power to AI. The Continuous Thought Machine (CTM) incorporates neuron-level temporal processing and neural synchronization, moving beyond current AI limitations. Our approach has two core innovations: (1) neuron-level temporal processing, where each neuron uses unique parameters to process a history of incoming signals for fine-grained temporal dynamics, and (2) neural synchronization, used as a direct latent representation to modulate data and produce outputs, encoding information directly in the timing of neural activity. Learn more about our approach: Interactive Report: Full Paper: GitHub :

Introducing Continuous Thought Machines New Blog: Modern AI is powerful, but it’s still distinct from human-like flexible intelligence. We believe neural timing is key. Our Continuous Thought Machine is built from the ground up to use neural dynamics as a powerful representation for intelligence. Thought takes time, and reasoning is a process. Biological brains inspire us with their complex neural activity, where neural timing is critical to intelligence. We’re exploring how to bring that power to AI. The Continuous Thought Machine (CTM) incorporates neuron-level temporal processing and neural synchronization, moving beyond current AI limitations. Our approach has two core innovations: (1) neuron-level temporal processing, where each neuron uses unique parameters to process a history of incoming signals for fine-grained temporal dynamics, and (2) neural synchronization, used as a direct latent representation to modulate data and produce outputs, encoding information directly in the timing of neural activity. Learn more about our approach: Interactive Report: Full Paper: GitHub :

Sakana AI

289,680 görüntüleme • 1 yıl önce

One for all, and all for one 🧧 Introducing Ming-flash-omni-2.0: A specialist in every domain, unified as a capable generalist. A gift from Ling =) - Unified Acoustic Synthesis: Speech, audio, and music combined for unbounded creativity; - "Seeing" to "Knowing": Moving beyond input to true deep semantic understanding; - Native Visual Fusion: Seamless generation, editing, and segmentation;

One for all, and all for one 🧧 Introducing Ming-flash-omni-2.0: A specialist in every domain, unified as a capable generalist. A gift from Ling =) - Unified Acoustic Synthesis: Speech, audio, and music combined for unbounded creativity; - "Seeing" to "Knowing": Moving beyond input to true deep semantic understanding; - Native Visual Fusion: Seamless generation, editing, and segmentation;

Ant Ling

1,925,421 görüntüleme • 4 ay önce

'Rerender A Video' by Shuai Yang et al. uses a cluster of techniques to remove the flickering and temporal inconsistency that happens when using stable diffusion for video, while still being compatible with LoRA/ControlNet-driven content.

'Rerender A Video' by Shuai Yang et al. uses a cluster of techniques to remove the flickering and temporal inconsistency that happens when using stable diffusion for video, while still being compatible with LoRA/ControlNet-driven content.

Ben Ferns

726,935 görüntüleme • 3 yıl önce

CoDeF: Content Deformation Fields for Temporally Consistent Video Processing abs: paper page: present the content deformation field CoDeF as a new type of video representation, which consists of a canonical content field aggregating the static contents in the entire video and a temporal deformation field recording the transformations from the canonical image (i.e., rendered from the canonical content field) to each individual frame along the time axis.Given a target video, these two fields are jointly optimized to reconstruct it through a carefully tailored rendering pipeline.We advisedly introduce some regularizations into the optimization process, urging the canonical content field to inherit semantics (e.g., the object shape) from the video.With such a design, CoDeF naturally supports lifting image algorithms for video processing, in the sense that one can apply an image algorithm to the canonical image and effortlessly propagate the outcomes to the entire video with the aid of the temporal deformation field.We experimentally show that CoDeF is able to lift image-to-image translation to video-to-video translation and lift keypoint detection to keypoint tracking without any training.More importantly, thanks to our lifting strategy that deploys the algorithms on only one image, we achieve superior cross-frame consistency in processed videos compared to existing video-to-video translation approaches, and even manage to track non-rigid objects like water and smog.

CoDeF: Content Deformation Fields for Temporally Consistent Video Processing abs: paper page: present the content deformation field CoDeF as a new type of video representation, which consists of a canonical content field aggregating the static contents in the entire video and a temporal deformation field recording the transformations from the canonical image (i.e., rendered from the canonical content field) to each individual frame along the time axis.Given a target video, these two fields are jointly optimized to reconstruct it through a carefully tailored rendering pipeline.We advisedly introduce some regularizations into the optimization process, urging the canonical content field to inherit semantics (e.g., the object shape) from the video.With such a design, CoDeF naturally supports lifting image algorithms for video processing, in the sense that one can apply an image algorithm to the canonical image and effortlessly propagate the outcomes to the entire video with the aid of the temporal deformation field.We experimentally show that CoDeF is able to lift image-to-image translation to video-to-video translation and lift keypoint detection to keypoint tracking without any training.More importantly, thanks to our lifting strategy that deploys the algorithms on only one image, we achieve superior cross-frame consistency in processed videos compared to existing video-to-video translation approaches, and even manage to track non-rigid objects like water and smog.

AK

153,241 görüntüleme • 2 yıl önce

Wan2.2 is now natively supported in ComfyUI on Day 0! 🔹 A next-gen video model with MoE (Mixture of Experts) architecture with dual noise experts, under Apache 2.0 license! - Cinematic-level Aesthetic Control - Large-scale Complex Motion - Precise Semantic Compliance 📚 Versions available: - Wan2.2-TI2V-5B: FP16 - Wan2.2-I2V-14B: FP16/FP8 - Wan2.2-T2V-14B: FP16/FP8 💻 Down to 8GB VRAM requirement for the 5B version with ComfyUI auto-offloading.

Wan2.2 is now natively supported in ComfyUI on Day 0! 🔹 A next-gen video model with MoE (Mixture of Experts) architecture with dual noise experts, under Apache 2.0 license! - Cinematic-level Aesthetic Control - Large-scale Complex Motion - Precise Semantic Compliance 📚 Versions available: - Wan2.2-TI2V-5B: FP16 - Wan2.2-I2V-14B: FP16/FP8 - Wan2.2-T2V-14B: FP16/FP8 💻 Down to 8GB VRAM requirement for the 5B version with ComfyUI auto-offloading.

ComfyUI

82,926 görüntüleme • 10 ay önce

Semantic Chemistry: algorithmic chemistry on semantic graphs as a creative inference control strategy.

Semantic Chemistry: algorithmic chemistry on semantic graphs as a creative inference control strategy.

Ben Goertzel

20,186 görüntüleme • 26 gün önce

𝗠𝗲𝘁𝗮𝗖𝗮𝗻𝘃𝗮𝘀 🎨 lets MLLMs "𝘥𝘳𝘢𝘧𝘵 𝘢𝘯𝘥 𝘥𝘳𝘢𝘸 𝘰𝘯 𝘢 𝘤𝘢𝘯𝘷𝘢𝘴" to guide diffusion generators. - canvas: spatial & temporal latents - gains on 6 tasks: T2I, T/I2V, image/video editing, in-context vgen. 📄 🌐

𝗠𝗲𝘁𝗮𝗖𝗮𝗻𝘃𝗮𝘀 🎨 lets MLLMs "𝘥𝘳𝘢𝘧𝘵 𝘢𝘯𝘥 𝘥𝘳𝘢𝘸 𝘰𝘯 𝘢 𝘤𝘢𝘯𝘷𝘢𝘴" to guide diffusion generators. - canvas: spatial & temporal latents - gains on 6 tasks: T2I, T/I2V, image/video editing, in-context vgen. 📄 🌐

Ziqi Huang

11,054 görüntüleme • 5 ay önce

LTX-2.3 Transition LoRA. preserves semantic stability for T2V/I2V without the usual mid-clip identity drift.

LTX-2.3 Transition LoRA. preserves semantic stability for T2V/I2V without the usual mid-clip identity drift.

Wildminder

10,535 görüntüleme • 2 ay önce

[NeurIPS '24] DreamMesh4D: Video-to-4D Generation with Sparse-Controlled Gaussian-Mesh Hybrid Representation Abstract (excerpt) We introduce DreamMesh4D, a novel framework that combines mesh representation with sparse-controlled deformation technique to generate high-quality 4D object from a monocular video. To overcome the limitation of classical texture representation, we bind Gaussian splats to the surface of the triangular mesh for differentiable optimization of both the texture and mesh vertices. In particular, DreamMesh4D begins with a coarse mesh provided by a single image based 3D generation method. Sparse points are then uniformly sampled across the surface of the mesh, and are used to build a deformation graph to drive the motion of the 3D object for the sake of computational efficiency and providing additional constraint. For each step, transformations of sparse control points are predicted using a deformation network, and the mesh vertices as well as the bound surface Gaussians are deformed via a geometric skinning algorithm. The skinning algorithm is a hybrid approach combining LBS (linear blending skinning) and DQS (dual-quaternion skinning), mitigating drawbacks associated with both approaches. The static surface Gaussians and mesh vertices as well as the dynamic deformation network are learned via reference view photometric loss, score distillation loss as well as other regularization losses in a two-stage manner. Extensive experiments demonstrate that our method outperforms prior video-to-4D generation methods in terms of rendering quality and spatial-temporal consistency.

[NeurIPS '24] DreamMesh4D: Video-to-4D Generation with Sparse-Controlled Gaussian-Mesh Hybrid Representation Abstract (excerpt) We introduce DreamMesh4D, a novel framework that combines mesh representation with sparse-controlled deformation technique to generate high-quality 4D object from a monocular video. To overcome the limitation of classical texture representation, we bind Gaussian splats to the surface of the triangular mesh for differentiable optimization of both the texture and mesh vertices. In particular, DreamMesh4D begins with a coarse mesh provided by a single image based 3D generation method. Sparse points are then uniformly sampled across the surface of the mesh, and are used to build a deformation graph to drive the motion of the 3D object for the sake of computational efficiency and providing additional constraint. For each step, transformations of sparse control points are predicted using a deformation network, and the mesh vertices as well as the bound surface Gaussians are deformed via a geometric skinning algorithm. The skinning algorithm is a hybrid approach combining LBS (linear blending skinning) and DQS (dual-quaternion skinning), mitigating drawbacks associated with both approaches. The static surface Gaussians and mesh vertices as well as the dynamic deformation network are learned via reference view photometric loss, score distillation loss as well as other regularization losses in a two-stage manner. Extensive experiments demonstrate that our method outperforms prior video-to-4D generation methods in terms of rendering quality and spatial-temporal consistency.

MrNeRF

12,323 görüntüleme • 1 yıl önce

The Multi-Repo AI Agent by Zencoder is a clear evolution in AI tooling. Automation adapts to complex environments by understanding distributed code, integrating with existing workflows, and supporting continuous delivery. More > Partnership zencoderai

The Multi-Repo AI Agent by Zencoder is a clear evolution in AI tooling. Automation adapts to complex environments by understanding distributed code, integrating with existing workflows, and supporting continuous delivery. More > Partnership zencoderai

Antonio Grasso

12,887 görüntüleme • 10 ay önce