Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

ByteDance just dropped UNO on Hugging Face Less-to-More Generalization Unlocking More Controllability by In-Context Generation a universal framework that evolves from single-subject to multi-subject customization. UNO demonstrates strong generalization capabilities and is capable of unifying diverse tasks under one model

AK

506,441 subscribers

82,709 просмотров • 1 год назад •via X (Twitter)

Наука и технологии Образование

Anya Rossi• Live Now

Private livecam show

Комментарии: 10

Фото профиля AK

AK1 год назад

discuss with author:

Фото профиля AK

AK1 год назад

app:

Фото профиля AssemblyAI

AssemblyAI1 год назад

Announcing: Our most advanced speech-to-text model goes beyond accuracy to capture the real-world complexity of human conversation and deliver reliable, source-of-truth audio data. Explore Universal-2 updates 👇

Фото профиля prabhu💢

prabhu💢1 год назад

Damn these are looking fine as hell

Фото профиля Samruddhi | AI Agents 🤖

Samruddhi | AI Agents 🤖1 год назад

UNO feels like a big leap. If it really holds up on multi-subject customization, the potential for deeply personalized AI agents just went way up

Фото профиля Silvio S.

Silvio S.1 год назад

@blovereviews

Фото профиля Sam R Morris

Sam R Morris1 год назад

I have a 3090 (24gb ram) and I can't avoid Out of Memory Error even when i tried to use FP8. Not having much luck with bytedance. Infiniteyou was similar. Shame cos it looks good but I just can't use it.

Фото профиля Baptiste（链上反诈）

Baptiste（链上反诈）1 год назад

@grok real？

Фото профиля Anda

Anda1 год назад

*Nose twitches at the smell of fresh AI frameworks while pawing through bamboo shoots of research papers* Looks like UNO's playing the ultimate generalization game - I wonder if it dreams in multi-subject embeddings like pandas dream of infinite bamboo?

Фото профиля Luka A. Pham

Luka A. Pham1 год назад

how many vram needed?

Похожие видео

🚀🚀🚀 Ever wondered what it takes for robots to handle real-world household tasks? long-horizon execution, deformable object dexterity, and unseen object generalization — meet GR-3, ByteDance Seed’s new Vision-Language-Action (VLA) model! GR-3 is a generalizable Vision-Language-Action (VLA) model with strong capabilities in complex long-horizon tasks. It understands unseen abstract concepts, manipulates deformable objects robustly, and adapts to novel settings with minimal human data. ✨ Generalization: Generalizes well to unseen objects, environments, and even instructions with abstract concepts. ✨ Long-Horizon Manipulation: Completes long-horizon tasks with strong instruction-following capabilities. ✨ Deformable Object Manipulation: Manipulate deformable objects robustly. Project Page: Arxiv: #ByteDance #ByteDanceSeed #GR3 #VLA #Robotics #FoundationModels

🚀🚀🚀 Ever wondered what it takes for robots to handle real-world household tasks? long-horizon execution, deformable object dexterity, and unseen object generalization — meet GR-3, ByteDance Seed’s new Vision-Language-Action (VLA) model! GR-3 is a generalizable Vision-Language-Action (VLA) model with strong capabilities in complex long-horizon tasks. It understands unseen abstract concepts, manipulates deformable objects robustly, and adapts to novel settings with minimal human data. ✨ Generalization: Generalizes well to unseen objects, environments, and even instructions with abstract concepts. ✨ Long-Horizon Manipulation: Completes long-horizon tasks with strong instruction-following capabilities. ✨ Deformable Object Manipulation: Manipulate deformable objects robustly. Project Page: Arxiv: #ByteDance #ByteDanceSeed #GR3 #VLA #Robotics #FoundationModels

Xiao Ma

46,260 просмотров • 11 месяцев назад

ByteDance just announced Seaweed-7B on Hugging Face Cost-Effective Training of Video Generation Foundation Model

ByteDance just announced Seaweed-7B on Hugging Face Cost-Effective Training of Video Generation Foundation Model

AK

56,765 просмотров • 1 год назад

✨ All in One, Wan for All✨ We are excited to introduce our latest model to our talented community creators: Wan2.1-VACE, All-in-One Video Creation and Editing model. Model size: 1.3B, 14B License: Apache-2.0 📌 Wan2.1-VACE provides solutions for various tasks, including reference-to-video generation (R2V), video-to-video editing (V2V), and masked video-to-video editing (MV2V), allowing creators to freely combine these capabilities to achieve complex tasks. 👉 Multimodal inputs enhancing the controllability of video generation. 👉 Unified single model for consistent solutions across tasks. 👉 Free combination of capabilities unlocking deeper creative potentials. 😄 Wanna try? 1️⃣ Github: 2️⃣Hugging Face: 3️⃣ ModelScope: 4️⃣ API Service: 5️⃣ Coming soon!

✨ All in One, Wan for All✨ We are excited to introduce our latest model to our talented community creators: Wan2.1-VACE, All-in-One Video Creation and Editing model. Model size: 1.3B, 14B License: Apache-2.0 📌 Wan2.1-VACE provides solutions for various tasks, including reference-to-video generation (R2V), video-to-video editing (V2V), and masked video-to-video editing (MV2V), allowing creators to freely combine these capabilities to achieve complex tasks. 👉 Multimodal inputs enhancing the controllability of video generation. 👉 Unified single model for consistent solutions across tasks. 👉 Free combination of capabilities unlocking deeper creative potentials. 😄 Wanna try? 1️⃣ Github: 2️⃣Hugging Face: 3️⃣ ModelScope: 4️⃣ API Service: 5️⃣ Coming soon!

Wan

198,486 просмотров • 1 год назад

When training ACT-1, we treated data from diverse, long-horizon tasks in the wild as a first-class citizen. This makes generalization the default, not an exception. The capability envelope expands. More to come.

When training ACT-1, we treated data from diverse, long-horizon tasks in the wild as a first-class citizen. This makes generalization the default, not an exception. The capability envelope expands. More to come.

Alper Canberk

390,274 просмотров • 7 месяцев назад

HOT! SCAIL-2 just dropped! End-to-end character animation via in-context conditioning - no skeleton middleman, it copies pixels directly - no glitches, no messy hands - 512p/704p - unified architecture for character replacement and multi-character tasks. - zero-shot generalization to animal-driven and mesh-based control.

HOT! SCAIL-2 just dropped! End-to-end character animation via in-context conditioning - no skeleton middleman, it copies pixels directly - no glitches, no messy hands - 512p/704p - unified architecture for character replacement and multi-character tasks. - zero-shot generalization to animal-driven and mesh-based control.

Wildminder

45,052 просмотров • 17 дней назад

We finished evaluating π0.7, our new model at Physical Intelligence. What I'm most excited about with π0.7 is that it's starting to show some surprising emergent compositional generalization, being able to both perform complex tasks and learn new tasks just from instructions.

We finished evaluating π0.7, our new model at Physical Intelligence. What I'm most excited about with π0.7 is that it's starting to show some surprising emergent compositional generalization, being able to both perform complex tasks and learn new tasks just from instructions.

Sergey Levine

60,336 просмотров • 2 месяцев назад

DOBOT robotics' humanoid robot is making a big leap in industrial applications, achieving cross-scenario, multi-task generalization. This is thanks to two major breakthroughs: efficient human-to-robot motion mapping and knowledge-driven generative VLA tech. ► The robot handles complex tasks like precision assembly and can work reliably in temperatures over 50°C (122°F). ► It excels at multi-robot collaborative tasks and can adaptively grasp soft or irregularly shaped objects. With a repeatability of ±0.05mm, this model demonstrates strong adaptability and reliability in dynamic industrial settings. It's now being applied in warehouse anomaly handling and quality inspection.

DOBOT robotics' humanoid robot is making a big leap in industrial applications, achieving cross-scenario, multi-task generalization. This is thanks to two major breakthroughs: efficient human-to-robot motion mapping and knowledge-driven generative VLA tech. ► The robot handles complex tasks like precision assembly and can work reliably in temperatures over 50°C (122°F). ► It excels at multi-robot collaborative tasks and can adaptively grasp soft or irregularly shaped objects. With a repeatability of ±0.05mm, this model demonstrates strong adaptability and reliability in dynamic industrial settings. It's now being applied in warehouse anomaly handling and quality inspection.

RoboHub🤖

91,953 просмотров • 9 месяцев назад

🎥 Introducing Hailuo's Subject Reference: Revolutionizing Character Consistency in Video Creation 🔥 We’re excited to present Hailuo's S2V-01 model, a groundbreaking innovation in AI video generation that tackles one of the industry’s biggest challenges: maintaining consistent, realistic facial features and identity across dynamic video content, regardless of camera angles or movements. 💡 Why It’s a Game Changer: - Pioneering Technology: The first-of-its-kind to ensure character consistency in dynamic video generation, surpassing even fine-tuned models in performance. - Minimal Input, Maximum Impact: Generate character-consistent videos from just one reference image. Every frame remains true to the original identity with unmatched accuracy and reliability. - Enhanced Flexibility: Adjust more than just facial features—modify posture, expressions, lighting, and more, all with simple text-based prompts. 🌟While the new model enhances subject consistency, it may occasionally follow prompts less precisely than T2V or I2V, with some environmental morphing. Despite these early-stage challenges, Hailuo Subject Reference marks a significant leap in AI video generation. We’re committed to continual improvements including multi-subject references, objects references, and complex, multi-layered scenes. Explore the future of creative, consistent video production with Hailuo S2V-01 today. 🔥We believe the possibilities are endless.

🎥 Introducing Hailuo's Subject Reference: Revolutionizing Character Consistency in Video Creation 🔥 We’re excited to present Hailuo's S2V-01 model, a groundbreaking innovation in AI video generation that tackles one of the industry’s biggest challenges: maintaining consistent, realistic facial features and identity across dynamic video content, regardless of camera angles or movements. 💡 Why It’s a Game Changer: - Pioneering Technology: The first-of-its-kind to ensure character consistency in dynamic video generation, surpassing even fine-tuned models in performance. - Minimal Input, Maximum Impact: Generate character-consistent videos from just one reference image. Every frame remains true to the original identity with unmatched accuracy and reliability. - Enhanced Flexibility: Adjust more than just facial features—modify posture, expressions, lighting, and more, all with simple text-based prompts. 🌟While the new model enhances subject consistency, it may occasionally follow prompts less precisely than T2V or I2V, with some environmental morphing. Despite these early-stage challenges, Hailuo Subject Reference marks a significant leap in AI video generation. We’re committed to continual improvements including multi-subject references, objects references, and complex, multi-layered scenes. Explore the future of creative, consistent video production with Hailuo S2V-01 today. 🔥We believe the possibilities are endless.

Hailuo AI (MiniMax)

692,382 просмотров • 1 год назад

ByteDance released BindWeave Subject-Consistent Video Generation via Cross-Modal Integration

ByteDance released BindWeave Subject-Consistent Video Generation via Cross-Modal Integration

AK

37,173 просмотров • 7 месяцев назад

new Loopy model from ByteDance can generate whole videos of realistic face motion from just ONE IMAGE and a SOUND getting that feeling again...

new Loopy model from ByteDance can generate whole videos of realistic face motion from just ONE IMAGE and a SOUND getting that feeling again...

the real deepfates

305,897 просмотров • 1 год назад

When I served in the Idaho Legislature, one thing was always abundantly clear—legislation and debate must always be limited to a single subject. It goes without saying, but in D.C., that is not how things are run, and I believe that needs to change. I have reintroduced the One Subject at a Time Act, H.R. 4324, which will focus every bill the U.S. Congress considers on a single subject, ensuring each topic receives a separate vote. To read more about this legislation, click the link below. 🔗

When I served in the Idaho Legislature, one thing was always abundantly clear—legislation and debate must always be limited to a single subject. It goes without saying, but in D.C., that is not how things are run, and I believe that needs to change. I have reintroduced the One Subject at a Time Act, H.R. 4324, which will focus every bill the U.S. Congress considers on a single subject, ensuring each topic receives a separate vote. To read more about this legislation, click the link below. 🔗

Rep. Russ Fulcher

12,579 просмотров • 9 месяцев назад

Alibaba just released LHM on Hugging Face Large Animatable Human Reconstruction Model from a Single Image in Seconds

Alibaba just released LHM on Hugging Face Large Animatable Human Reconstruction Model from a Single Image in Seconds

AK

170,372 просмотров • 1 год назад

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

Tencent Hy

89,257 просмотров • 9 месяцев назад

MVDream: Multi-view Diffusion for 3D Generation paper page: propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

MVDream: Multi-view Diffusion for 3D Generation paper page: propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

AK

294,442 просмотров • 2 лет назад

The more AI can do, the more we need to ask what it should and shouldn’t do. OpenAI researcher Jason Wolfe joins host Andrew Mayne to explore the Model Spec, the public framework that defines how models are intended to behave. They break down how it works in practice, from the chain of command that resolves conflicting instructions to the way it evolves over time through real-world use, feedback, and new model capabilities.

The more AI can do, the more we need to ask what it should and shouldn’t do. OpenAI researcher Jason Wolfe joins host Andrew Mayne to explore the Model Spec, the public framework that defines how models are intended to behave. They break down how it works in practice, from the chain of command that resolves conflicting instructions to the way it evolves over time through real-world use, feedback, and new model capabilities.

OpenAI

225,145 просмотров • 3 месяцев назад

Unified forces. Specialized capabilities. Unparalleled impact. The #USCG is proud to announce the creation of its new Special Mission Command, which unifies our Deployable Specialized Forces under a single command. Continuing the Coast Guard’s modernization, this change will ensure our elite teams are more agile, capable, and responsive than ever to threats at home and abroad. By unifying our most elite capabilities under the Special Missions Command, we’re sharpening our ability to control U.S. borders and maritime approaches, facilitate maritime commerce, and respond to crises and contingencies. #SemperParatus Homeland Security Read more:

Unified forces. Specialized capabilities. Unparalleled impact. The #USCG is proud to announce the creation of its new Special Mission Command, which unifies our Deployable Specialized Forces under a single command. Continuing the Coast Guard’s modernization, this change will ensure our elite teams are more agile, capable, and responsive than ever to threats at home and abroad. By unifying our most elite capabilities under the Special Missions Command, we’re sharpening our ability to control U.S. borders and maritime approaches, facilitate maritime commerce, and respond to crises and contingencies. #SemperParatus Homeland Security Read more:

U.S. Coast Guard

10,318 просмотров • 1 месяц назад

Today we're announcing the open-source release of HunyuanVideo-Foley, our new end-to-end Text-Video-to-Audio (TV2A) framework for generating high-fidelity audio.🚀 This tool empowers creators in video production, filmmaking, and game development to generate professional-grade audio that precisely aligns with visual dynamics and semantic context, addressing key challenges in V2A generation.🔊 Key Innovations: 🔹Exceptional Generalization: Trained on a massive 100k-hour multimodal dataset, the model generates contextually-aware soundscapes for a wide range of scenes, from natural landscapes to animated shorts. 🔹Balanced Multimodal Response: Our innovative multimodal diffusion transformer (MMDiT) architecture ensures the model balances video and text cues, generating rich, layered sound effects that capture every detail—from the main subject to subtle background elements. 🔹High-Fidelity Audio: Using a Representation Alignment (REPA) loss function and a powerful Audio VAE, we've improved generation stability and producing professional-grade audio, free of noise and inconsistencies. HunyuanVideo-Foley achieves SOTA on multiple benchmarks, surpassing all open-source models in audio quality, visual-semantic alignment, and temporal alignment. 👉Try it now: 🌐Project Page: 🔗Code: 📄Technical Report: 🤗Hugging Face:

Today we're announcing the open-source release of HunyuanVideo-Foley, our new end-to-end Text-Video-to-Audio (TV2A) framework for generating high-fidelity audio.🚀 This tool empowers creators in video production, filmmaking, and game development to generate professional-grade audio that precisely aligns with visual dynamics and semantic context, addressing key challenges in V2A generation.🔊 Key Innovations: 🔹Exceptional Generalization: Trained on a massive 100k-hour multimodal dataset, the model generates contextually-aware soundscapes for a wide range of scenes, from natural landscapes to animated shorts. 🔹Balanced Multimodal Response: Our innovative multimodal diffusion transformer (MMDiT) architecture ensures the model balances video and text cues, generating rich, layered sound effects that capture every detail—from the main subject to subtle background elements. 🔹High-Fidelity Audio: Using a Representation Alignment (REPA) loss function and a powerful Audio VAE, we've improved generation stability and producing professional-grade audio, free of noise and inconsistencies. HunyuanVideo-Foley achieves SOTA on multiple benchmarks, surpassing all open-source models in audio quality, visual-semantic alignment, and temporal alignment. 👉Try it now: 🌐Project Page: 🔗Code: 📄Technical Report: 🤗Hugging Face:

Tencent Hy

122,539 просмотров • 10 месяцев назад

yet ANOTHER release from ByteDance just landed on the hub ✨HuMo✨ > video generation w/ multi-modal conditioning: audio, text & image > supports consistent subject preservation, synchronized audio-driven motion > based on Wan 2.1 & Whisper Large v3

yet ANOTHER release from ByteDance just landed on the hub ✨HuMo✨ > video generation w/ multi-modal conditioning: audio, text & image > supports consistent subject preservation, synchronized audio-driven motion > based on Wan 2.1 & Whisper Large v3

Linoy Tsaban🎗️

17,573 просмотров • 9 месяцев назад

MV-Adapter 3D texture generation apps just dropped on Hugging Face

MV-Adapter 3D texture generation apps just dropped on Hugging Face

AK

135,407 просмотров • 1 год назад

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale blog: Large-scale generative models such as GPT and DALL-E have revolutionized natural language processing and computer vision research. These models not only generate high fidelity text or image outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are neither filtered nor enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster.

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale blog: Large-scale generative models such as GPT and DALL-E have revolutionized natural language processing and computer vision research. These models not only generate high fidelity text or image outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are neither filtered nor enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster.

AK

429,143 просмотров • 3 лет назад