Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

🔉 Introducing SAM Audio, the first unified model that isolates any sound from complex audio mixtures using text, visual, or span prompts. We’re sharing SAM Audio with the community, along with a perception encoder model, benchmarks and research papers, to empower others to explore new forms of expression and... show more

AI at Meta

810,664 subscribers

1,248,952 görüntüleme • 6 ay önce •via X (Twitter)

Bilim & Teknoloji

Anya Rossi• Live Now

Private livecam show

0 Yorum

Yorum bulunmuyor

Orijinal gönderinin yorumları burada görünecek

Benzer Videolar

Introducing SAM Audio: the first unified AI model that allows you to isolate and edit sound from complex audio mixtures. This could mean isolating the guitar in a video of your band, filtering out traffic noises, or removing the sound of a dog barking in your podcast, all with text, visual, and time span prompts.

Introducing SAM Audio: the first unified AI model that allows you to isolate and edit sound from complex audio mixtures. This could mean isolating the guitar in a video of your band, filtering out traffic noises, or removing the sound of a dog barking in your podcast, all with text, visual, and time span prompts.

Meta Newsroom

204,357 görüntüleme • 6 ay önce

1. Meta’s open-sourced multisensory model Meta is back (again!) with yet another exciting open-source project. Introducing ImageBind, a new AI research model that understands and combines text, audio, visual, movement, thermal, AND depth data.

1. Meta’s open-sourced multisensory model Meta is back (again!) with yet another exciting open-source project. Introducing ImageBind, a new AI research model that understands and combines text, audio, visual, movement, thermal, AND depth data.

Rowan Cheung

173,984 görüntüleme • 3 yıl önce

This is a big day. Meta is open-sourcing AudioCraft. You can now generate incredible music and sounds with a single prompt. It includes the most performant Generative AI Model (audio) on the market, the "Llama" of Audio. The research framework contains the weights and code of these models: ▸ MusicGen: controllable text-to-music model. ▸ AudioGen: text-to-sound model. ▸ EnCodec: high fidelity neural audio codec. ▸ Multi Band Diffusion: An EnCodec compatible decoder using diffusion. This is going to tremendously speed up audio research 👏

This is a big day. Meta is open-sourcing AudioCraft. You can now generate incredible music and sounds with a single prompt. It includes the most performant Generative AI Model (audio) on the market, the "Llama" of Audio. The research framework contains the weights and code of these models: ▸ MusicGen: controllable text-to-music model. ▸ AudioGen: text-to-sound model. ▸ EnCodec: high fidelity neural audio codec. ▸ Multi Band Diffusion: An EnCodec compatible decoder using diffusion. This is going to tremendously speed up audio research 👏

Lior Alexander

231,597 görüntüleme • 2 yıl önce

OSINT Tool: Extract Sounds from Audio Files 🎵🔍 Most audio tools just let you play or cut files. Finding a specific sound inside a long recording can be tedious. AudioGhost AI changes that. It uses a memory-optimized SAM-Audio model to extract specific sounds from audio files using text queries. You can: 🎯 Search for particular sounds in any audio 🖥️ Use a modern, user-friendly interface 💾 Work efficiently even with large files This is perfect for investigations, research, or any project where audio analysis matters. 🔗 Website link: __________ P.S. ♻️ Repost if you found this helpful. If you liked this post and would like to learn more methods and techniques to discover information about people, check out my OSINT Mastery course

OSINT Tool: Extract Sounds from Audio Files 🎵🔍 Most audio tools just let you play or cut files. Finding a specific sound inside a long recording can be tedious. AudioGhost AI changes that. It uses a memory-optimized SAM-Audio model to extract specific sounds from audio files using text queries. You can: 🎯 Search for particular sounds in any audio 🖥️ Use a modern, user-friendly interface 💾 Work efficiently even with large files This is perfect for investigations, research, or any project where audio analysis matters. 🔗 Website link: __________ P.S. ♻️ Repost if you found this helpful. If you liked this post and would like to learn more methods and techniques to discover information about people, check out my OSINT Mastery course

CyberSudo

16,095 görüntüleme • 4 ay önce

Meet SAM 3, a unified model that enables detection, segmentation, and tracking of objects across images and videos. SAM 3 introduces some of our most highly requested features like text and exemplar prompts to segment all objects of a target category. Learnings from SAM 3 will help power new features in Instagram Edits and Vibes, bringing advanced segmentation capabilities directly to creators. 🔗 Learn more:

Meet SAM 3, a unified model that enables detection, segmentation, and tracking of objects across images and videos. SAM 3 introduces some of our most highly requested features like text and exemplar prompts to segment all objects of a target category. Learnings from SAM 3 will help power new features in Instagram Edits and Vibes, bringing advanced segmentation capabilities directly to creators. 🔗 Learn more:

AI at Meta

189,875 görüntüleme • 7 ay önce

As detailed in the Meta Movie Gen technical report, today we’re open sourcing Movie Gen Bench: two new media generation benchmarks that we hope will help to enable the AI research community to progress work on more capable audio and video generation models. Movie Gen Video Bench is the largest and most comprehensive benchmark ever released for evaluating text-to-video generation. It includes a collection of 1,000+ prompts that cover concepts ranging from detailed human activity to animals, physics, unusual subjects and more — with broad coverage across different motion levels. Movie Gen Audio Bench is a first-of-its-kind benchmark aimed at evaluating video-to-audio and (text+video)-to-audio generation. It includes 527 generated videos and associated sound effects and music prompts covering a diverse set of ambient environments and sound effects. To enable fair and easy comparison to our models for future works, these new benchmarks include non cherry-picked generated videos and audio from Movie Gen. In releasing these new benchmarks we hope to promote fair & extensive evaluations in media generation research to enable greater progress in this field.

As detailed in the Meta Movie Gen technical report, today we’re open sourcing Movie Gen Bench: two new media generation benchmarks that we hope will help to enable the AI research community to progress work on more capable audio and video generation models. Movie Gen Video Bench is the largest and most comprehensive benchmark ever released for evaluating text-to-video generation. It includes a collection of 1,000+ prompts that cover concepts ranging from detailed human activity to animals, physics, unusual subjects and more — with broad coverage across different motion levels. Movie Gen Audio Bench is a first-of-its-kind benchmark aimed at evaluating video-to-audio and (text+video)-to-audio generation. It includes 527 generated videos and associated sound effects and music prompts covering a diverse set of ambient environments and sound effects. To enable fair and easy comparison to our models for future works, these new benchmarks include non cherry-picked generated videos and audio from Movie Gen. In releasing these new benchmarks we hope to promote fair & extensive evaluations in media generation research to enable greater progress in this field.

AI at Meta

156,240 görüntüleme • 1 yıl önce

🎥 Today we’re premiering Meta Movie Gen: the most advanced media foundation models to-date. Developed by AI research teams at Meta, Movie Gen delivers state-of-the-art results across a range of capabilities. We’re excited for the potential of this line of research to usher in entirely new possibilities for casual creators and creative professionals alike. More details and examples of what Movie Gen can do ➡️ 🛠️ Movie Gen models and capabilities Movie Gen Video: 30B parameter transformer model that can generate high-quality and high-definition images and videos from a single text prompt. Movie Gen Audio: A 13B parameter transformer model that can take a video input along with optional text prompts for controllability to generate high-fidelity audio synced to the video. It can generate ambient sound, instrumental background music and foley sound — delivering state-of-the-art results in audio quality, video-to-audio alignment and text-to-audio alignment. Precise video editing: Using a generated or existing video and accompanying text instructions as an input it can perform localized edits such as adding, removing or replacing elements — or global changes like background or style changes. Personalized videos: Using an image of a person and a text prompt, the model can generate a video with state-of-the-art results on character preservation and natural movement in video. We’re continuing to work closely with creative professionals from across the field to integrate their feedback as we work towards a potential release. We look forward to sharing more on this work and the creative possibilities it will enable in the future.

🎥 Today we’re premiering Meta Movie Gen: the most advanced media foundation models to-date. Developed by AI research teams at Meta, Movie Gen delivers state-of-the-art results across a range of capabilities. We’re excited for the potential of this line of research to usher in entirely new possibilities for casual creators and creative professionals alike. More details and examples of what Movie Gen can do ➡️ 🛠️ Movie Gen models and capabilities Movie Gen Video: 30B parameter transformer model that can generate high-quality and high-definition images and videos from a single text prompt. Movie Gen Audio: A 13B parameter transformer model that can take a video input along with optional text prompts for controllability to generate high-fidelity audio synced to the video. It can generate ambient sound, instrumental background music and foley sound — delivering state-of-the-art results in audio quality, video-to-audio alignment and text-to-audio alignment. Precise video editing: Using a generated or existing video and accompanying text instructions as an input it can perform localized edits such as adding, removing or replacing elements — or global changes like background or style changes. Personalized videos: Using an image of a person and a text prompt, the model can generate a video with state-of-the-art results on character preservation and natural movement in video. We’re continuing to work closely with creative professionals from across the field to integrate their feedback as we work towards a potential release. We look forward to sharing more on this work and the creative possibilities it will enable in the future.

AI at Meta

2,264,113 görüntüleme • 1 yıl önce

Meet MoshiVis🎙️🖼️, the first open-source real-time speech model that can talk about images! It sees, understands, and talks about images — naturally, and out loud. Voice interaction with a compact model endowed with visual understanding opens up new applications, from audio description for the visual impaired to visual access to information. Try it out 👉 Blog post 👉

Meet MoshiVis🎙️🖼️, the first open-source real-time speech model that can talk about images! It sees, understands, and talks about images — naturally, and out loud. Voice interaction with a compact model endowed with visual understanding opens up new applications, from audio description for the visual impaired to visual access to information. Try it out 👉 Blog post 👉

kyutai

47,924 görüntüleme • 1 yıl önce

Today we’re excited to unveil a new generation of Segment Anything Models: 1️⃣ SAM 3 enables detecting, segmenting and tracking of objects across images and videos, now with short text phrases and exemplar prompts. 🔗 Learn more about SAM 3: 2️⃣ SAM 3D brings the model collection into the 3rd dimension to enable precise reconstruction of 3D objects and people from a single 2D image. 🔗 Learn more about SAM 3D: These models offer innovative capabilities and unique tools for developers and researchers to create, experiment and uplevel media workflows.

Today we’re excited to unveil a new generation of Segment Anything Models: 1️⃣ SAM 3 enables detecting, segmenting and tracking of objects across images and videos, now with short text phrases and exemplar prompts. 🔗 Learn more about SAM 3: 2️⃣ SAM 3D brings the model collection into the 3rd dimension to enable precise reconstruction of 3D objects and people from a single 2D image. 🔗 Learn more about SAM 3D: These models offer innovative capabilities and unique tools for developers and researchers to create, experiment and uplevel media workflows.

AI at Meta

1,087,548 görüntüleme • 7 ay önce

Type a sentence, get any sound - from talking cats to singing saxophones. Brilliant release by NVIDIA ✨ NVIDIA just unveiled Fugatto, a groundbreaking 2.5B parameter audio AI model that can generate and transform any combination of music, voices, and sounds using text prompts and audio inputs Fugatto could ultimately allow developers and creators to bring sounds to life simply by inputting text prompts, → The model demonstrates unique capabilities like creating hybrid sounds (trumpet barking), changing accents/emotions in voices, and allowing fine-grained control over sound transitions - trained on millions of audio samples using 32 NVIDIA H100 GPUs 👨‍🔧 Architecture Built as a foundational generative transformer model leveraging NVIDIA's previous work in speech modeling and audio understanding. The training process involved creating a specialized blended dataset containing millions of audio samples → ComposableART's Innovation in Audio Control Introduces a novel technique allowing combination of instructions that were only seen separately during training. Users can blend different audio attributes and control their intensity → Temporal Interpolation Capabilities Enables generation of evolving soundscapes with precise control over transitions. Can create dynamic audio sequences like rainstorms fading into birdsong at dawn → Processes both text and audio inputs flexibly, enabling tasks like removing instruments from songs or modifying specific audio characteristics while preserving others → Shows capabilities beyond its training data, creating entirely new sound combinations through interaction between different trained abilities 🔍 Real-world Applications → Allows rapid prototyping of musical ideas, style experimentation, and real-time sound creation during studio sessions → Enables dynamic audio asset generation matching gameplay situations, reducing pre-recorded audio requirements → Can modify voice characteristics for language learning applications, allowing content delivery in familiar voices NVIDIA AI Developer

Type a sentence, get any sound - from talking cats to singing saxophones. Brilliant release by NVIDIA ✨ NVIDIA just unveiled Fugatto, a groundbreaking 2.5B parameter audio AI model that can generate and transform any combination of music, voices, and sounds using text prompts and audio inputs Fugatto could ultimately allow developers and creators to bring sounds to life simply by inputting text prompts, → The model demonstrates unique capabilities like creating hybrid sounds (trumpet barking), changing accents/emotions in voices, and allowing fine-grained control over sound transitions - trained on millions of audio samples using 32 NVIDIA H100 GPUs 👨‍🔧 Architecture Built as a foundational generative transformer model leveraging NVIDIA's previous work in speech modeling and audio understanding. The training process involved creating a specialized blended dataset containing millions of audio samples → ComposableART's Innovation in Audio Control Introduces a novel technique allowing combination of instructions that were only seen separately during training. Users can blend different audio attributes and control their intensity → Temporal Interpolation Capabilities Enables generation of evolving soundscapes with precise control over transitions. Can create dynamic audio sequences like rainstorms fading into birdsong at dawn → Processes both text and audio inputs flexibly, enabling tasks like removing instruments from songs or modifying specific audio characteristics while preserving others → Shows capabilities beyond its training data, creating entirely new sound combinations through interaction between different trained abilities 🔍 Real-world Applications → Allows rapid prototyping of musical ideas, style experimentation, and real-time sound creation during studio sessions → Enables dynamic audio asset generation matching gameplay situations, reducing pre-recorded audio requirements → Can modify voice characteristics for language learning applications, allowing content delivery in familiar voices NVIDIA AI Developer

Rohan Paul

96,354 görüntüleme • 1 yıl önce

Today we’re launching Stable Audio 2.5: The first audio model built for enterprise-grade sound production 🔊 Audio influences brand engagement by 86%, but few enterprises are leveraging audio as an extension of their brand, making customized sound an untapped differentiator. Stable Audio 2.5 is purpose-built for this opportunity to create customizable, high-quality audio at scale, with capabilities that include: ▶️ Improved musical composition: Generate full songs with multi-part structure, meaning a clear intro, middle, and outro. ▶️ Audio inpainting: Input audio, select where the track should start, and the model uses the context to generate the rest of the track. ▶️ Customization: Our team can fine-tune Stable Audio 2.5 to help enterprises create the right sound for their brand. ▶️ Faster inference: The model can generate up to three-minute long tracks in under two seconds on a GPU, outputting in just eight steps (compared to ~50 in the previous model). You can learn more here 👉

Stability AI

67,491 görüntüleme • 9 ay önce

Along with text, images, video and code, Gemini is able to process raw audio signal end-to-end. 🔊 It can listen to and understand speech, making it not only useful for transcription but a model that has a much more nuanced perception of its environment. ↓

Along with text, images, video and code, Gemini is able to process raw audio signal end-to-end. 🔊 It can listen to and understand speech, making it not only useful for transcription but a model that has a much more nuanced perception of its environment. ↓

Google DeepMind

140,150 görüntüleme • 2 yıl önce

As part of Meta Movie Gen, we trained a 13B parameter audio generation model that can take a video + optional text prompts to generate high quality audio — including ambient sound, foley & instrumental background music — all synced to the video. Details ➡️

As part of Meta Movie Gen, we trained a 13B parameter audio generation model that can take a video + optional text prompts to generate high quality audio — including ambient sound, foley & instrumental background music — all synced to the video. Details ➡️

AI at Meta

76,692 görüntüleme • 1 yıl önce

Today we're announcing the open-source release of HunyuanVideo-Foley, our new end-to-end Text-Video-to-Audio (TV2A) framework for generating high-fidelity audio.🚀 This tool empowers creators in video production, filmmaking, and game development to generate professional-grade audio that precisely aligns with visual dynamics and semantic context, addressing key challenges in V2A generation.🔊 Key Innovations: 🔹Exceptional Generalization: Trained on a massive 100k-hour multimodal dataset, the model generates contextually-aware soundscapes for a wide range of scenes, from natural landscapes to animated shorts. 🔹Balanced Multimodal Response: Our innovative multimodal diffusion transformer (MMDiT) architecture ensures the model balances video and text cues, generating rich, layered sound effects that capture every detail—from the main subject to subtle background elements. 🔹High-Fidelity Audio: Using a Representation Alignment (REPA) loss function and a powerful Audio VAE, we've improved generation stability and producing professional-grade audio, free of noise and inconsistencies. HunyuanVideo-Foley achieves SOTA on multiple benchmarks, surpassing all open-source models in audio quality, visual-semantic alignment, and temporal alignment. 👉Try it now: 🌐Project Page: 🔗Code: 📄Technical Report: 🤗Hugging Face:

Today we're announcing the open-source release of HunyuanVideo-Foley, our new end-to-end Text-Video-to-Audio (TV2A) framework for generating high-fidelity audio.🚀 This tool empowers creators in video production, filmmaking, and game development to generate professional-grade audio that precisely aligns with visual dynamics and semantic context, addressing key challenges in V2A generation.🔊 Key Innovations: 🔹Exceptional Generalization: Trained on a massive 100k-hour multimodal dataset, the model generates contextually-aware soundscapes for a wide range of scenes, from natural landscapes to animated shorts. 🔹Balanced Multimodal Response: Our innovative multimodal diffusion transformer (MMDiT) architecture ensures the model balances video and text cues, generating rich, layered sound effects that capture every detail—from the main subject to subtle background elements. 🔹High-Fidelity Audio: Using a Representation Alignment (REPA) loss function and a powerful Audio VAE, we've improved generation stability and producing professional-grade audio, free of noise and inconsistencies. HunyuanVideo-Foley achieves SOTA on multiple benchmarks, surpassing all open-source models in audio quality, visual-semantic alignment, and temporal alignment. 👉Try it now: 🌐Project Page: 🔗Code: 📄Technical Report: 🤗Hugging Face:

Tencent Hy

122,539 görüntüleme • 10 ay önce

Starting today you can try our new foundation research model for audio generation. The demo includes Zero shot TTS, Text to sound effects, Infilling and more! Try Audiobox ➡️

Starting today you can try our new foundation research model for audio generation. The demo includes Zero shot TTS, Text to sound effects, Infilling and more! Try Audiobox ➡️

AI at Meta

515,618 görüntüleme • 2 yıl önce

Introducing Meta Perception Language Model (PLM): an open & reproducible vision-language model tackling challenging visual tasks. Learn more about how PLM can help the open source community build more capable computer vision systems. Read the research paper, and download the code and dataset:

Introducing Meta Perception Language Model (PLM): an open & reproducible vision-language model tackling challenging visual tasks. Learn more about how PLM can help the open source community build more capable computer vision systems. Read the research paper, and download the code and dataset:

AI at Meta

94,330 görüntüleme • 1 yıl önce

Introducing Poe Apps: a new, easy way to create and use visual interfaces into any combination of the 100+ text, image, video, and audio models on Poe. (1/5)

Introducing Poe Apps: a new, easy way to create and use visual interfaces into any combination of the 100+ text, image, video, and audio models on Poe. (1/5)

Poe

66,708 görüntüleme • 1 yıl önce

Bark Text-to-Audio Model Full Text Input: "Why was six afraid of seven?" Ignore Bark's "I'm done with this input" token and tell Bark to just keep generating more audio anyway.

Bark Text-to-Audio Model Full Text Input: "Why was six afraid of seven?" Ignore Bark's "I'm done with this input" token and tell Bark to just keep generating more audio anyway.

Jonathan Fly 👾

461,816 görüntüleme • 3 yıl önce

How to train a model that actually understands both audio and text like Voxtral from Mistral AI? Here is a quick video walkthrough of the paper.

How to train a model that actually understands both audio and text like Voxtral from Mistral AI? Here is a quick video walkthrough of the paper.

Sophia Yang, Ph.D.

49,801 görüntüleme • 11 ay önce

Introducing 𝘾𝙡𝙞𝙥𝘼𝙣𝙮𝙩𝙝𝙞𝙣𝙜, the first-ever multimodal AI clipping that lets you clip any moment from any video using visual, audio, and sentiment cues. 🔥 Analyze everything in any video ClipAnything uses state-of-the-art video understanding that analyzes each frame through visual, audio, and sentiment cues, identifying objects, scenes, actions, sounds, emotions, texts, and more. ⭐️ Use natural language prompts to find any moments You can use natural language prompts to find any scene, action, character, event, emotional moment, viral topic, and more. 🌶️ Create viral clips from any video Extending beyond talking-head videos, ClipAnything can clip any type of video, such as vlogs, sports, TV shows, news, music & videos with little to no dialogue. To learn more about it, please go to ClipAnything webpage:

Introducing 𝘾𝙡𝙞𝙥𝘼𝙣𝙮𝙩𝙝𝙞𝙣𝙜, the first-ever multimodal AI clipping that lets you clip any moment from any video using visual, audio, and sentiment cues. 🔥 Analyze everything in any video ClipAnything uses state-of-the-art video understanding that analyzes each frame through visual, audio, and sentiment cues, identifying objects, scenes, actions, sounds, emotions, texts, and more. ⭐️ Use natural language prompts to find any moments You can use natural language prompts to find any scene, action, character, event, emotional moment, viral topic, and more. 🌶️ Create viral clips from any video Extending beyond talking-head videos, ClipAnything can clip any type of video, such as vlogs, sports, TV shows, news, music & videos with little to no dialogue. To learn more about it, please go to ClipAnything webpage:

OpusClip

74,286 görüntüleme • 1 yıl önce