Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

LETS GOO! Parler TTS 🔥 A fully open-source, Apache 2.0 licensed Text-to-speech model focused on providing maximum controllability. Through voice prompts, you can control the pitch, speed, gender, noise levels, emotion characteristics and more! > Trained on 10K hours of permissive data. > Offers control over the generations. >... show more

Vaibhav (VB) Srivastav

41,658 subscribers

156,386 Aufrufe • vor 2 Jahren •via X (Twitter)

Bildung Gesundheit & Wellness Wissenschaft & Technologie

Anya Rossi• Live Now

Private livecam show

9 Kommentare

Profilbild von Vaibhav (VB) Srivastav

Vaibhav (VB) Srivastavvor 2 Jahren

Try it out in the space directly (& share your generations below)!

Profilbild von Vaibhav (VB) Srivastav

Vaibhav (VB) Srivastavvor 2 Jahren

Check out our inference plus training code base here:

Profilbild von Vaibhav (VB) Srivastav

Vaibhav (VB) Srivastavvor 2 Jahren

You should also be able to use it in a Colab with less than 10 lines of code: import torch from parler_tts import ParlerTTSForConditionalGeneration from transformers import AutoTokenizer import soundfile as sf device = "cuda:0" if torch. cuda. is_available() else "cpu" model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler_tts_mini_v0.1").to(device) tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler_tts_mini_v0.1") prompt = "Hey, how are you doing today?" description = "A female speaker with a slightly low-pitched voice delivers her words quite expressively, in a very confined sounding environment with clear audio quality." input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device) prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device) generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids) audio_arr = generation.cpu().numpy().squeeze() sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

Profilbild von Dennis Lysenko

Dennis Lysenkovor 2 Jahren

@reach_vb this is awesome -- can we run this on Replicate?

Profilbild von Vaibhav (VB) Srivastav

Vaibhav (VB) Srivastavvor 2 Jahren

Not yet, but you can try it out and use it here:

Profilbild von Javier de la Rosa @versae@mastodon.social

Javier de la Rosa @[email protected]vor 2 Jahren

This is really cool! I've been looking at Parler and Data-Speech and would love to give it a try for low-resource languages. What's the minimum amount of hours needed for this to adapt to another language? And does the audio need to be separated by speaker?

Profilbild von Vaibhav (VB) Srivastav

Vaibhav (VB) Srivastavvor 2 Jahren

We will release fine-tuning support soon. I think for the most part the quality of the dataset matters way more than the quantity. You’d need to have enough diversity to ensure a balance in the voice prompts. Once that is in you should be able to train in any language. That said, we haven’t tried this yet, so this is all a hypothesis at this point.

Profilbild von bitBrain

bitBrainvor 2 Jahren

@ClementDelangue but can it laugh? nono sorry, I mean Holy shit! nice!

Profilbild von bane

banevor 2 Jahren

@huggingface Not bad

Ähnliche Videos

Introducing Indic-Parler TTS - Trained on 10K hours of data, 938M params, supports 20 Indic languages, emotional synthesis, apache 2.0 licensed! 🔥 A collaboration w/ AI4Bharat & Hugging Face - w/ fully customisable speech and voice personas! Try it out directly below or use the model weights as you want! 🇮🇳/acc

Introducing Indic-Parler TTS - Trained on 10K hours of data, 938M params, supports 20 Indic languages, emotional synthesis, apache 2.0 licensed! 🔥 A collaboration w/ AI4Bharat & Hugging Face - w/ fully customisable speech and voice personas! Try it out directly below or use the model weights as you want! 🇮🇳/acc

Vaibhav (VB) Srivastav

42,165 Aufrufe • vor 1 Jahr

Smart Turn v2: open source, native audio turn detection in 14 languages. New checkpoint of the open source, open data, open training code, semantic VAD model on Hugging Face, fal, and Pipecat AI. - 3x faster inference (12ms on an L40) - 14 languages (13 more than v1, which was english-only) - New synthetic data set `chirp_3_all` with ~163k audio samples - 99% accuracy on held out `human_5_all` test data Good turn detection is critical for voice agents. This model "understands" both semantic and audio patterns, and mitigates the voice AI trade-off between unwanted turn latency vs the agent interrupting people before they are finished speaking. Training scripts for both Modal and local training are in the repo. We want to make it as easy as possible to contribute to or customize this model! Here's a demo running the smart-turn model with default settings, aimed at generally hitting 400ms total turn detection time. You can tune things to be faster, too. You can help by contributing data, doing architecture expermints, or cleaning open source data! Keep reading ...

Smart Turn v2: open source, native audio turn detection in 14 languages. New checkpoint of the open source, open data, open training code, semantic VAD model on Hugging Face, fal, and Pipecat AI. - 3x faster inference (12ms on an L40) - 14 languages (13 more than v1, which was english-only) - New synthetic data set `chirp_3_all` with ~163k audio samples - 99% accuracy on held out `human_5_all` test data Good turn detection is critical for voice agents. This model "understands" both semantic and audio patterns, and mitigates the voice AI trade-off between unwanted turn latency vs the agent interrupting people before they are finished speaking. Training scripts for both Modal and local training are in the repo. We want to make it as easy as possible to contribute to or customize this model! Here's a demo running the smart-turn model with default settings, aimed at generally hitting 400ms total turn detection time. You can tune things to be faster, too. You can help by contributing data, doing architecture expermints, or cleaning open source data! Keep reading ...

kwindla

42,246 Aufrufe • vor 1 Jahr

NVIDIA just released a new open source transcription model, Nemotron Speech ASR, designed from the ground up for low-latency use cases like voice agents. Here's a voice agent built with this new model. 24ms transcription finalization and total voice-to-voice inference time under 500ms. This agent actually uses *three* NVIDIA open source models: - Nemotron Speech ASR - Nemotron 3 Nano 30GB in a 4-bit quant (released in December) - A preview checkpoint of the upcoming Magpie text-to-speech model These models are all truly open source: weights, training data, training code, and inference code. This is a big deal! Jensen said in the CES keynote yesterday that he expects open source models to catch up to proprietary models this year in a number of categories. NVIDIA is putting their weight behind making this happen. (As Alan Kay said, the best way to predict the future is to invent it.) The code for this agent is open source too, of course. You can deploy it to production with Modal and Pipecat AI cloud, or run locally on an NVIDIA DGX Spark or RTX 5090.

NVIDIA just released a new open source transcription model, Nemotron Speech ASR, designed from the ground up for low-latency use cases like voice agents. Here's a voice agent built with this new model. 24ms transcription finalization and total voice-to-voice inference time under 500ms. This agent actually uses three NVIDIA open source models: - Nemotron Speech ASR - Nemotron 3 Nano 30GB in a 4-bit quant (released in December) - A preview checkpoint of the upcoming Magpie text-to-speech model These models are all truly open source: weights, training data, training code, and inference code. This is a big deal! Jensen said in the CES keynote yesterday that he expects open source models to catch up to proprietary models this year in a number of categories. NVIDIA is putting their weight behind making this happen. (As Alan Kay said, the best way to predict the future is to invent it.) The code for this agent is open source too, of course. You can deploy it to production with Modal and Pipecat AI cloud, or run locally on an NVIDIA DGX Spark or RTX 5090.

kwindla

274,474 Aufrufe • vor 6 Monaten

Introducing Indic Parler-TTS: Open-Source Text-to-Speech for Over a Billion Indic Speakers! 🌏 In collaboration with Hugging Face, we are excited to release Indic Parler-TTS, a state-of-the-art open-source text-to-speech system designed to bring accessible and high-quality speech technology to India’s diverse linguistic community. Supporting 20 of the 22 scheduled Indian languages—and English in various accents (US, British, Indian)—it’s built to serve over a billion speakers and empower companies, developers, researchers, and communities. Why Indic Parler-TTS Stands Out: 1. Open and Accessible: Fully open-source with permissive licensing for unrestricted usage. 2. Wide Language Support: Includes a vast range of Indic languages, with rich diversity in voices. 3. High-Quality Audio: Produces natural, clear, and lifelike speech. 4. Adaptable and Fine-Tunable: Customize it to new languages, accents, or specific applications. 5. State-of-the-Art Performance: Proven through rigorous evaluation. 6. Versatile and Inclusive: Indic Parler-TTS offers 69 unique voices across 18 Indian languages, making it a perfect fit for diverse use cases like audiobooks, virtual assistants, and educational tools. Let's democratize speech technology together and make speech technology more inclusive and accessible for everyone. ▶️ Experience it now: Demo: Model page:

Introducing Indic Parler-TTS: Open-Source Text-to-Speech for Over a Billion Indic Speakers! 🌏 In collaboration with Hugging Face, we are excited to release Indic Parler-TTS, a state-of-the-art open-source text-to-speech system designed to bring accessible and high-quality speech technology to India’s diverse linguistic community. Supporting 20 of the 22 scheduled Indian languages—and English in various accents (US, British, Indian)—it’s built to serve over a billion speakers and empower companies, developers, researchers, and communities. Why Indic Parler-TTS Stands Out: 1. Open and Accessible: Fully open-source with permissive licensing for unrestricted usage. 2. Wide Language Support: Includes a vast range of Indic languages, with rich diversity in voices. 3. High-Quality Audio: Produces natural, clear, and lifelike speech. 4. Adaptable and Fine-Tunable: Customize it to new languages, accents, or specific applications. 5. State-of-the-Art Performance: Proven through rigorous evaluation. 6. Versatile and Inclusive: Indic Parler-TTS offers 69 unique voices across 18 Indian languages, making it a perfect fit for diverse use cases like audiobooks, virtual assistants, and educational tools. Let's democratize speech technology together and make speech technology more inclusive and accessible for everyone. ▶️ Experience it now: Demo: Model page:

AI4Bharat

28,586 Aufrufe • vor 1 Jahr

HOLY FUCK! Zyphra just dropped Zonos - Apache 2.0 licensed, Multilingual, Text to Speech model with INSTANT voice cloning! 🔥 > Zero-shot TTS with Voice Cloning: Input text and a 10-30 second speaker sample to generate high-quality text-to-speech output > Audio Prefix Inputs: Enhance speaker matching by adding an audio prefix to the text, enabling behaviors like whispering that are hard to achieve with voice cloning alone > Multilingual Support: Supports English, Japanese, Chinese, French, and German > Audio Quality & Emotion Control: Fine-tune speaking rate, pitch, frequency, audio quality, and emotions (e.g., happiness, anger, sadness, fear) > Fast Performance: Runs at ~2x real-time speed on an RTX 4090 > Available on the Hugging Face Hub 🤗

HOLY FUCK! Zyphra just dropped Zonos - Apache 2.0 licensed, Multilingual, Text to Speech model with INSTANT voice cloning! 🔥 > Zero-shot TTS with Voice Cloning: Input text and a 10-30 second speaker sample to generate high-quality text-to-speech output > Audio Prefix Inputs: Enhance speaker matching by adding an audio prefix to the text, enabling behaviors like whispering that are hard to achieve with voice cloning alone > Multilingual Support: Supports English, Japanese, Chinese, French, and German > Audio Quality & Emotion Control: Fine-tune speaking rate, pitch, frequency, audio quality, and emotions (e.g., happiness, anger, sadness, fear) > Fast Performance: Runs at ~2x real-time speed on an RTX 4090 > Available on the Hugging Face Hub 🤗

Vaibhav (VB) Srivastav

298,858 Aufrufe • vor 1 Jahr

Introducing Fish Speech 1.5 🎉 - Making state-of-the-art TTS accessible to everyone! Highlights: - #2 ranked on TTS-Arena (as "Anonymous Sparkle") - 1M hours of multilingual training data - 13 languages supported, including English, Chinese, Japanese & more - <150ms latency with high-quality instant voice cloning - Pretrained model now open source - Cost-effective self-hosting or cloud options Let's check out the details 🧵⬇️

Introducing Fish Speech 1.5 🎉 - Making state-of-the-art TTS accessible to everyone! Highlights: - #2 ranked on TTS-Arena (as "Anonymous Sparkle") - 1M hours of multilingual training data - 13 languages supported, including English, Chinese, Japanese & more - <150ms latency with high-quality instant voice cloning - Pretrained model now open source - Cost-effective self-hosting or cloud options Let's check out the details 🧵⬇️

Fish Audio

101,606 Aufrufe • vor 1 Jahr

This giant free dataset could make helper robots way smarter, way faster: An open-source robotics stack from Berkeley AI researchers featuring the largest teleoperation dataset released to date with over 3,500 hours of bimanual manipulation data across 200 tasks. The video showcases autonomous bimanual robot performance on dexterous tasks including box folding, Lego sorting, AirPod insertion, t-shirt folding, backpack packing, and box unlocking using learned policies. Sim-to-real correlations, training insights like flow loss predicting real-world success, lightweight infrastructure for DAgger interventions Thank you for sharing, Ritvik Singh, and everyone else who contributed to this! Links to the paper plus dataset at under permissive licensing. ——- Weekly robotics and AI insights. Subscribe free:

This giant free dataset could make helper robots way smarter, way faster: An open-source robotics stack from Berkeley AI researchers featuring the largest teleoperation dataset released to date with over 3,500 hours of bimanual manipulation data across 200 tasks. The video showcases autonomous bimanual robot performance on dexterous tasks including box folding, Lego sorting, AirPod insertion, t-shirt folding, backpack packing, and box unlocking using learned policies. Sim-to-real correlations, training insights like flow loss predicting real-world success, lightweight infrastructure for DAgger interventions Thank you for sharing, Ritvik Singh, and everyone else who contributed to this! Links to the paper plus dataset at under permissive licensing. ——- Weekly robotics and AI insights. Subscribe free:

Ilir Aliu

13,264 Aufrufe • vor 1 Monat

LegoGPT, an LLM-based system that generates physically stable LEGO structures from text prompts, backed by a new 47,000+ sample dataset and physics-aware filtering during inference. → LegoGPT is trained on a custom dataset, StableText2Lego, which includes 47,000+ 3D LEGO models mapped to text, spanning 28,000+ unique objects. → The model predicts LEGO bricks sequentially like tokens, using next-token prediction in a transformer setup. → To ensure physical stability, LegoGPT integrates physics-aware rollback and validity filtering, pruning out structurally invalid brick placements. → The generated designs are aesthetically aligned with prompts, physically buildable, and tested both with human manual assembly and robotic arms. → The team also introduced a text-driven LEGO coloring/texturing pipeline, enabling more expressive and customized outputs. → The dataset, code, and models are all publicly released under an open-access license.

LegoGPT, an LLM-based system that generates physically stable LEGO structures from text prompts, backed by a new 47,000+ sample dataset and physics-aware filtering during inference. → LegoGPT is trained on a custom dataset, StableText2Lego, which includes 47,000+ 3D LEGO models mapped to text, spanning 28,000+ unique objects. → The model predicts LEGO bricks sequentially like tokens, using next-token prediction in a transformer setup. → To ensure physical stability, LegoGPT integrates physics-aware rollback and validity filtering, pruning out structurally invalid brick placements. → The generated designs are aesthetically aligned with prompts, physically buildable, and tested both with human manual assembly and robotic arms. → The team also introduced a text-driven LEGO coloring/texturing pipeline, enabling more expressive and customized outputs. → The dataset, code, and models are all publicly released under an open-access license.

Rohan Paul

75,268 Aufrufe • vor 1 Jahr

Today we released Meta Spirit LM — our first open source multimodal language model that freely mixes text and speech. Many existing AI voice experiences today use ASR to techniques to process speech before synthesizing with an LLM to generate text — but these approaches compromise the expressive aspects of speech. Using phonetic, pitch and tone tokens, Spirit LM models can overcome these limitations for both inputs and outputs to generate more natural sounding speech while also learning new tasks across ASR, TTS and speech classification. We hope that sharing this work will enable the research community to further new approaches for text and speech integration.

Today we released Meta Spirit LM — our first open source multimodal language model that freely mixes text and speech. Many existing AI voice experiences today use ASR to techniques to process speech before synthesizing with an LLM to generate text — but these approaches compromise the expressive aspects of speech. Using phonetic, pitch and tone tokens, Spirit LM models can overcome these limitations for both inputs and outputs to generate more natural sounding speech while also learning new tasks across ASR, TTS and speech classification. We hope that sharing this work will enable the research community to further new approaches for text and speech integration.

AI at Meta

351,739 Aufrufe • vor 1 Jahr

NEW: Higgs Audio V2 from BosonAI open, unified TTS model w/ voice cloning, beats GPT 4o mini tts and ElevenLabs v2 🔥 > Trained on 10M hours (speech, music, events) > Built on top of Llama 3.2 3B > Works real-time and on edge > Beats GPT-4o-mini-tts, ElevenLabs v2 in prosody & emotion Multi-speaker dialog > Zero-shot voice cloning 🤩 > Available on Hugging Face Kudos to folks at Boson AI for releasing such a brilliant work and all the details around the model! 🤗

NEW: Higgs Audio V2 from BosonAI open, unified TTS model w/ voice cloning, beats GPT 4o mini tts and ElevenLabs v2 🔥 > Trained on 10M hours (speech, music, events) > Built on top of Llama 3.2 3B > Works real-time and on edge > Beats GPT-4o-mini-tts, ElevenLabs v2 in prosody & emotion Multi-speaker dialog > Zero-shot voice cloning 🤩 > Available on Hugging Face Kudos to folks at Boson AI for releasing such a brilliant work and all the details around the model! 🤗

Vaibhav (VB) Srivastav

79,585 Aufrufe • vor 1 Jahr

VoxCPM 2 just dropped by OpenBMB Only 2B-param open-source TTS (Text-to-Speech) model built for production-grade multilingual voice work. Apache-2.0 license, Can run on only 8GB VRAM. • Eliminates the "robotic" feel of traditional TTS, delivering prosody and emotional depth suitable for high-stakes professional environments like filmmaking, gaming, animation, and audiobooks. • 30-language multilingual: no language tag needed, just type in a supported language and generate directly. • Voice design: create a brand-new voice from a text description alone, like age, tone, pace, or emotion. No reference audio required. Describe the desired voice characteristics (gender, age, tone, emotion, pace …) in Control Instruction, and VoxCPM2 will craft a unique voice from your description alone. • Controllable cloning: clone from a short clip, then steer delivery style without losing the speaker’s core voice. • Ultimate cloning: use reference audio + transcript for continuation-style cloning that keeps the tiny vocal details. • 48kHz output: takes 16kHz reference audio and produces studio-quality speech without an external upsampler. • Real-time ready: around 0.3 RTF on RTX 4090, even lower with Nano-VLLM. • Commercial use: Apache-2.0 licensed. Developer-Friendly Infrastructure: - Native Torch Inference: Direct support for PyTorch-based workflows. - Training Flexibility: Supports both full-parameter and LoRA fine-tuning for specific domain adaptation. - Production Readiness: Compatible with voxcpm-nanovllm for large-scale, high-concurrency deployment.

Rohan Paul

13,541 Aufrufe • vor 3 Monaten

🚨 FASHN VTON v1.5 is now open source. Released in early 2025 and still widely used, this virtual try-on model generates photorealistic images from a person image and a garment image. Key points: • Pixel-space RGB generation, no VAE • Maskless inference, no person segmentation needed • 972M parameters, ~5s on H100, runs on consumer GPUs • Apache 2.0 licensed, first commercially usable open-source VTON Why open source? While the industry moves toward massive generalist models, FASHN VTON v1.5 proves a focused alternative. This is a production-grade virtual try-on model you can train for $5–10k, own, study, and extend. Built for researchers, developers, and fashion tech teams who want more than black-box APIs. More info in the comments.

🚨 FASHN VTON v1.5 is now open source. Released in early 2025 and still widely used, this virtual try-on model generates photorealistic images from a person image and a garment image. Key points: • Pixel-space RGB generation, no VAE • Maskless inference, no person segmentation needed • 972M parameters, ~5s on H100, runs on consumer GPUs • Apache 2.0 licensed, first commercially usable open-source VTON Why open source? While the industry moves toward massive generalist models, FASHN VTON v1.5 proves a focused alternative. This is a production-grade virtual try-on model you can train for $5–10k, own, study, and extend. Built for researchers, developers, and fashion tech teams who want more than black-box APIs. More info in the comments.

FASHN AI

19,111 Aufrufe • vor 5 Monaten

This is a pretty wild model! You can use it to turn an image into a 3D object with texture. The quality is out of this world! I'm not even a designer, and I've been using this nonstop for the last 2 hours. The model is Hunyuan 3D 2.1. It's open source. You'll find model weights, training/inference code, data pipelines, and architecture on their repository. You can even fine-tune it if you want! GitHub Repository: By the way, the model runs on consumer-grade GPUs. You don't need a datacenter for this! I've been using the model from the HuggingFace demo page: To use it, go to the link and upload an image. That's it! Check out the video I recorded for a couple of examples.

This is a pretty wild model! You can use it to turn an image into a 3D object with texture. The quality is out of this world! I'm not even a designer, and I've been using this nonstop for the last 2 hours. The model is Hunyuan 3D 2.1. It's open source. You'll find model weights, training/inference code, data pipelines, and architecture on their repository. You can even fine-tune it if you want! GitHub Repository: By the way, the model runs on consumer-grade GPUs. You don't need a datacenter for this! I've been using the model from the HuggingFace demo page: To use it, go to the link and upload an image. That's it! Check out the video I recorded for a couple of examples.

Santiago

44,783 Aufrufe • vor 1 Jahr

We’re excited to announce the release and open-source of HunyuanImage 3.0 — the largest and most powerful open-source text-to-image model to date, with over 80 billion total parameters, of which 13 billion are activated per token during inference.The effect is completely comparable to the industry’s flagship closed-source model.🚀🚀🚀 HunyuanImage 3.0 originates from our internally developed native multimodal large language model, with fine-tuning and post-training focused on text-to-image generation. This unique foundation gives the model a powerful set of capabilities: ✅Reason with world knowledge ✅Understand complex, thousand-word prompts ✅Generate precise text within images Different from traditional DiT architecture image generation models, HunyuanImage 3.0’s MoE architecture uses a Transfusion-based approach to deeply couple Diffusion and LLM training for a single, powerful system. Built on Hunyuan-A13B, HunyuanImage 3.0 was trained on a massive dataset: 5 billion image-text pairs, video frames, interleaved image-text data, and 6 trillion tokens of text corpora. This hybrid training across multimodal generation, understanding, and LLM capabilities allows the model to seamlessly integrate multiple tasks. Whether you're an illustrator, designer, or creator, this is built to slash your workflow from hours to minutes. HunyuanImage 3.0 can generate intricate text, detailed comics, expressive emojis, and lively, engaging illustrations for educational content. The current release focuses solely on text-to-image generation and future updates will include image-to-image, image editing, multi-turn interaction, and more. 👉🏻Try it now: 🔗GitHub: 🤗Hugging Face:

We’re excited to announce the release and open-source of HunyuanImage 3.0 — the largest and most powerful open-source text-to-image model to date, with over 80 billion total parameters, of which 13 billion are activated per token during inference.The effect is completely comparable to the industry’s flagship closed-source model.🚀🚀🚀 HunyuanImage 3.0 originates from our internally developed native multimodal large language model, with fine-tuning and post-training focused on text-to-image generation. This unique foundation gives the model a powerful set of capabilities: ✅Reason with world knowledge ✅Understand complex, thousand-word prompts ✅Generate precise text within images Different from traditional DiT architecture image generation models, HunyuanImage 3.0’s MoE architecture uses a Transfusion-based approach to deeply couple Diffusion and LLM training for a single, powerful system. Built on Hunyuan-A13B, HunyuanImage 3.0 was trained on a massive dataset: 5 billion image-text pairs, video frames, interleaved image-text data, and 6 trillion tokens of text corpora. This hybrid training across multimodal generation, understanding, and LLM capabilities allows the model to seamlessly integrate multiple tasks. Whether you're an illustrator, designer, or creator, this is built to slash your workflow from hours to minutes. HunyuanImage 3.0 can generate intricate text, detailed comics, expressive emojis, and lively, engaging illustrations for educational content. The current release focuses solely on text-to-image generation and future updates will include image-to-image, image editing, multi-turn interaction, and more. 👉🏻Try it now: 🔗GitHub: 🤗Hugging Face:

Tencent Hy

412,658 Aufrufe • vor 10 Monaten

🔥 JUST IN: Open-source robotics dataset from 100% real-world scenarios! 🤯 Chinese robotics company AGIBOT just released AGIBOT WORLD 2026, an open-source dataset systematically covering key embodied AI research directions. Built entirely from real-world environments: commercial spaces, and homes. Collected using AGIBOT G2 robots in free-form collection mode, providing structured, accurately annotated, high-quality data. Digital twin technology creates 1:1 scale replicas in simulation matching the real environments. Both real-world and simulation data are open-sourced. The AGIBOT G2 platform collects multiple data types simultaneously: RGB(D) cameras, tactile sensors, force sensors, LiDAR, IMU, and full-body joint states. Whole-body control coordinates arms, waist, and hands for complex tasks. First-person teleoperation lets operators control the robot from its perspective. The tasks covered are fine-grained manipulation, ultra-long-horizon tasks, spatial navigation, dual-arm coordination, and multi-agent/human-robot collaboration. The dataset includes error-recovery trajectories with annotations. Most datasets only show successful demonstrations. AGIBOT includes failures and how the robot recovers, teaching models how to handle mistakes. After collection, data is tested through policy training and real-robot deployment to ensure quality. Then processed through industrial quality control with multiple screening and cleaning rounds. Making it open-source accelerates embodied AI research by giving researchers access to high-quality real-world robot data at scale. 🇨🇳 Learn more here: ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

🔥 JUST IN: Open-source robotics dataset from 100% real-world scenarios! 🤯 Chinese robotics company AGIBOT just released AGIBOT WORLD 2026, an open-source dataset systematically covering key embodied AI research directions. Built entirely from real-world environments: commercial spaces, and homes. Collected using AGIBOT G2 robots in free-form collection mode, providing structured, accurately annotated, high-quality data. Digital twin technology creates 1:1 scale replicas in simulation matching the real environments. Both real-world and simulation data are open-sourced. The AGIBOT G2 platform collects multiple data types simultaneously: RGB(D) cameras, tactile sensors, force sensors, LiDAR, IMU, and full-body joint states. Whole-body control coordinates arms, waist, and hands for complex tasks. First-person teleoperation lets operators control the robot from its perspective. The tasks covered are fine-grained manipulation, ultra-long-horizon tasks, spatial navigation, dual-arm coordination, and multi-agent/human-robot collaboration. The dataset includes error-recovery trajectories with annotations. Most datasets only show successful demonstrations. AGIBOT includes failures and how the robot recovers, teaching models how to handle mistakes. After collection, data is tested through policy training and real-robot deployment to ensure quality. Then processed through industrial quality control with multiple screening and cleaning rounds. Making it open-source accelerates embodied AI research by giving researchers access to high-quality real-world robot data at scale. 🇨🇳 Learn more here: ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

Lukas Ziegler

40,583 Aufrufe • vor 3 Monaten

Tencent presents GameGen-O Open-world Video Game Generation We introduce GameGen-O, the first diffusion transformer model tailored for the generation of open-world video games. This model facilitates high-quality, open-domain generation by simulating a wide array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, thus allowing for the gameplay simulation. The development of GameGen-O involves a comprehensive data collection and processing effort from scratch. We collect and build the first Open-World Video Game Dataset (OGameData), amassed extensive data from over a hundred of next-generation open-world games, employing a proprietary data pipeline for efficient sorting, scoring, filtering, and decoupled captioning. This robust and extensive OGameData forms the foundation of our model's training process. GameGen-O undergoes a two-stage training process, consisting of foundation model pretraining and instruction tuning. In the first phase, the model is pre-trained on the OGameData via the text-to-video and video continuation, endowing GameGen-O with the capability for open-domain video game generation. In the second phase, the pre-trained model is frozen, and we fine-tuned using a trainable InstructNet, which enables the production of subsequent frames based on multimodal structural instructions. This whole training process imparts the model with the ability to generate and interactively control content. In summary, GameGen-O represents a notable initial step forward in the realm of open-world video game generation via generative models. It underscores the potential of generative models to serve as an alternative to rendering techniques, which can efficiently combine creative generation with interactive capabilities.

Tencent presents GameGen-O Open-world Video Game Generation We introduce GameGen-O, the first diffusion transformer model tailored for the generation of open-world video games. This model facilitates high-quality, open-domain generation by simulating a wide array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, thus allowing for the gameplay simulation. The development of GameGen-O involves a comprehensive data collection and processing effort from scratch. We collect and build the first Open-World Video Game Dataset (OGameData), amassed extensive data from over a hundred of next-generation open-world games, employing a proprietary data pipeline for efficient sorting, scoring, filtering, and decoupled captioning. This robust and extensive OGameData forms the foundation of our model's training process. GameGen-O undergoes a two-stage training process, consisting of foundation model pretraining and instruction tuning. In the first phase, the model is pre-trained on the OGameData via the text-to-video and video continuation, endowing GameGen-O with the capability for open-domain video game generation. In the second phase, the pre-trained model is frozen, and we fine-tuned using a trainable InstructNet, which enables the production of subsequent frames based on multimodal structural instructions. This whole training process imparts the model with the ability to generate and interactively control content. In summary, GameGen-O represents a notable initial step forward in the realm of open-world video game generation via generative models. It underscores the potential of generative models to serve as an alternative to rendering techniques, which can efficiently combine creative generation with interactive capabilities.

AK

367,000 Aufrufe • vor 1 Jahr

1/5 🚀 Thrilled to open-source OSCAR 🤖 — an action-conditioned world model for robotics, led by the visiting student in my group Zhuoyuan Wu! It generalizes across different robot embodiments with precise action controllability. All trained on a single GH200 GPU, and outperforms existing open-sourced baselines, which have larger model capacity and need more compute. Everything is public, including training data. 📄 Paper: 🌐 Project: 💻 Code: 🤗 Robot data: 🤗 Human data: 🤗 Weights: #Robotics #WorldModels #AI #OpenSource

1/5 🚀 Thrilled to open-source OSCAR 🤖 — an action-conditioned world model for robotics, led by the visiting student in my group Zhuoyuan Wu! It generalizes across different robot embodiments with precise action controllability. All trained on a single GH200 GPU, and outperforms existing open-sourced baselines, which have larger model capacity and need more compute. Everything is public, including training data. 📄 Paper: 🌐 Project: 💻 Code: 🤗 Robot data: 🤗 Human data: 🤗 Weights: #Robotics #WorldModels #AI #OpenSource

Jun Gao

104,280 Aufrufe • vor 1 Monat

🔥🔥🔥We’ve been listening to your feedback! Our latest world model HY-World 1.5 just got a major upgrade to make world generation more accessible than ever: 🛠️ Open Training Code: Fully customizable code for building and training your own models. ⚡ Accelerated Inference: Turbocharged speed and optimized VRAM for real-time interaction. 📉 Lite 5B Model: A new lightweight model that fits into small-VRAM GPUs. 🙌 Zero Waitlist: Our online app is now fully open to everyone—no application required. This is just the beginning. HY-World is building the future of spatial intelligence—open, accessible, and community-driven. 🕹️ Play now: ⭐ GitHub:

🔥🔥🔥We’ve been listening to your feedback! Our latest world model HY-World 1.5 just got a major upgrade to make world generation more accessible than ever: 🛠️ Open Training Code: Fully customizable code for building and training your own models. ⚡ Accelerated Inference: Turbocharged speed and optimized VRAM for real-time interaction. 📉 Lite 5B Model: A new lightweight model that fits into small-VRAM GPUs. 🙌 Zero Waitlist: Our online app is now fully open to everyone—no application required. This is just the beginning. HY-World is building the future of spatial intelligence—open, accessible, and community-driven. 🕹️ Play now: ⭐ GitHub:

Tencent Hy

20,581 Aufrufe • vor 6 Monaten

Here is an open-source tool to generate a complete dataset. 1. Describe the data you want 2. An orchestrator agent searches the web 3. Sub-agents run in parallel to fetch the data 4. You get a structured dataset you can download For example, you can run Bigset with the query "all leica lenses being sold on amazon", or "leica stores in kyoto with their opening hours and ratings". Bigset uses TinyFish's free Search and Fetch APIs in the background. You can configure it to refresh the data on a schedule. You can self-host it with your own keys. Here is the GitHub repository: You can get free TinyFish API keys here: Thanks to the TinyFish team for partnering with me on this post.

Here is an open-source tool to generate a complete dataset. 1. Describe the data you want 2. An orchestrator agent searches the web 3. Sub-agents run in parallel to fetch the data 4. You get a structured dataset you can download For example, you can run Bigset with the query "all leica lenses being sold on amazon", or "leica stores in kyoto with their opening hours and ratings". Bigset uses TinyFish's free Search and Fetch APIs in the background. You can configure it to refresh the data on a schedule. You can self-host it with your own keys. Here is the GitHub repository: You can get free TinyFish API keys here: Thanks to the TinyFish team for partnering with me on this post.

Santiago

20,756 Aufrufe • vor 1 Monat