Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

MusicHiFi Fast High-Fidelity Stereo Vocoding Diffusion-based audio and music generation models commonly generate music by constructing an image representation of audio (e.g., a mel-spectrogram) and then converting it to audio using a phase reconstruction model or vocoder.

AK

460,011 subscribers

27,285 görüntüleme • 2 yıl önce •via X (Twitter)

Bilim & Teknoloji Müzik

Anya Rossi• Live Now

Private livecam show

6 Yorum

AK profil fotoğrafı

AK2 yıl önce

Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their effectiveness. We propose MusicHiFi -- an efficient high-fidelity stereophonic vocoder. Our method employs a cascade of three generative adversarial

AK profil fotoğrafı

AK2 yıl önce

networks (GANs) that convert low-resolution mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth expansion, and upmixes to stereophonic audio. Compared to previous work, we propose 1) a unified GAN-based generator and discriminator architecture

AK profil fotoğrafı

AK2 yıl önce

and training procedure for each stage of our cascade, 2) a new fast, near downsampling-compatible bandwidth extension module, and 3) a new fast downmix-compatible mono-to-stereo upmixer that ensures the preservation of monophonic content in the output. We

AK profil fotoğrafı

AK2 yıl önce

evaluate our approach using both objective and subjective listening tests and find our approach yields comparable or better audio quality, better spatialization control, and significantly faster inference speed compared to past work.

AK profil fotoğrafı

AK2 yıl önce

paper page:

Nicholas J. Bryan profil fotoğrafı

Nicholas J. Bryan2 yıl önce

Thank you @_akhaliq! Checkout more from @__gzhu__ at

Benzer Videolar

This is a big day. Meta is open-sourcing AudioCraft. You can now generate incredible music and sounds with a single prompt. It includes the most performant Generative AI Model (audio) on the market, the "Llama" of Audio. The research framework contains the weights and code of these models: ▸ MusicGen: controllable text-to-music model. ▸ AudioGen: text-to-sound model. ▸ EnCodec: high fidelity neural audio codec. ▸ Multi Band Diffusion: An EnCodec compatible decoder using diffusion. This is going to tremendously speed up audio research 👏

This is a big day. Meta is open-sourcing AudioCraft. You can now generate incredible music and sounds with a single prompt. It includes the most performant Generative AI Model (audio) on the market, the "Llama" of Audio. The research framework contains the weights and code of these models: ▸ MusicGen: controllable text-to-music model. ▸ AudioGen: text-to-sound model. ▸ EnCodec: high fidelity neural audio codec. ▸ Multi Band Diffusion: An EnCodec compatible decoder using diffusion. This is going to tremendously speed up audio research 👏

Lior Alexander

231,604 görüntüleme • 2 yıl önce

Stable Diffusion Moment of Audio?? Ace-Step Audio Model just dropped and is now natively supported in ComfyUI. ACE-Step is an open-source music generation model jointly developed by ACE Studio and StepFun.

Stable Diffusion Moment of Audio?? Ace-Step Audio Model just dropped and is now natively supported in ComfyUI. ACE-Step is an open-source music generation model jointly developed by ACE Studio and StepFun.

ComfyUI

23,495 görüntüleme • 1 yıl önce

As part of Meta Movie Gen, we trained a 13B parameter audio generation model that can take a video + optional text prompts to generate high quality audio — including ambient sound, foley & instrumental background music — all synced to the video. Details ➡️

As part of Meta Movie Gen, we trained a 13B parameter audio generation model that can take a video + optional text prompts to generate high quality audio — including ambient sound, foley & instrumental background music — all synced to the video. Details ➡️

AI at Meta

76,692 görüntüleme • 1 yıl önce

Seed-Music A Unified Framework for High Quality and Controlled Music Generation discuss: We introduce Seed-Music, a suite of music generation systems capable of producing high-quality music with fine-grained style control. Our unified framework leverages both auto-regressive language modeling and diffusion approaches to support two key music creation workflows: controlled music generation and post-production editing. For controlled music generation, our system enables vocal music generation with performance controls from multi-modal inputs, including style descriptions, audio references, musical scores, and voice prompts. For post-production editing, it offers interactive tools for editing lyrics and vocal melodies directly in the generated audio.

Seed-Music A Unified Framework for High Quality and Controlled Music Generation discuss: We introduce Seed-Music, a suite of music generation systems capable of producing high-quality music with fine-grained style control. Our unified framework leverages both auto-regressive language modeling and diffusion approaches to support two key music creation workflows: controlled music generation and post-production editing. For controlled music generation, our system enables vocal music generation with performance controls from multi-modal inputs, including style descriptions, audio references, musical scores, and voice prompts. For post-production editing, it offers interactive tools for editing lyrics and vocal melodies directly in the generated audio.

AK

32,967 görüntüleme • 1 yıl önce

Today we're announcing the open-source release of HunyuanVideo-Foley, our new end-to-end Text-Video-to-Audio (TV2A) framework for generating high-fidelity audio.🚀 This tool empowers creators in video production, filmmaking, and game development to generate professional-grade audio that precisely aligns with visual dynamics and semantic context, addressing key challenges in V2A generation.🔊 Key Innovations: 🔹Exceptional Generalization: Trained on a massive 100k-hour multimodal dataset, the model generates contextually-aware soundscapes for a wide range of scenes, from natural landscapes to animated shorts. 🔹Balanced Multimodal Response: Our innovative multimodal diffusion transformer (MMDiT) architecture ensures the model balances video and text cues, generating rich, layered sound effects that capture every detail—from the main subject to subtle background elements. 🔹High-Fidelity Audio: Using a Representation Alignment (REPA) loss function and a powerful Audio VAE, we've improved generation stability and producing professional-grade audio, free of noise and inconsistencies. HunyuanVideo-Foley achieves SOTA on multiple benchmarks, surpassing all open-source models in audio quality, visual-semantic alignment, and temporal alignment. 👉Try it now: 🌐Project Page: 🔗Code: 📄Technical Report: 🤗Hugging Face:

Today we're announcing the open-source release of HunyuanVideo-Foley, our new end-to-end Text-Video-to-Audio (TV2A) framework for generating high-fidelity audio.🚀 This tool empowers creators in video production, filmmaking, and game development to generate professional-grade audio that precisely aligns with visual dynamics and semantic context, addressing key challenges in V2A generation.🔊 Key Innovations: 🔹Exceptional Generalization: Trained on a massive 100k-hour multimodal dataset, the model generates contextually-aware soundscapes for a wide range of scenes, from natural landscapes to animated shorts. 🔹Balanced Multimodal Response: Our innovative multimodal diffusion transformer (MMDiT) architecture ensures the model balances video and text cues, generating rich, layered sound effects that capture every detail—from the main subject to subtle background elements. 🔹High-Fidelity Audio: Using a Representation Alignment (REPA) loss function and a powerful Audio VAE, we've improved generation stability and producing professional-grade audio, free of noise and inconsistencies. HunyuanVideo-Foley achieves SOTA on multiple benchmarks, surpassing all open-source models in audio quality, visual-semantic alignment, and temporal alignment. 👉Try it now: 🌐Project Page: 🔗Code: 📄Technical Report: 🤗Hugging Face:

Tencent Hy

122,539 görüntüleme • 10 ay önce

Introducing ElevenLabs Image & Video - the best audio, image and video models now in one platform. Generate with leading models like Veo, Sora, Kling, Wan and Seedance, then enhance with the highest quality voices, music, and sound effects.

Introducing ElevenLabs Image & Video - the best audio, image and video models now in one platform. Generate with leading models like Veo, Sora, Kling, Wan and Seedance, then enhance with the highest quality voices, music, and sound effects.

ElevenLabs

149,194 görüntüleme • 7 ay önce

🎥 Today we’re premiering Meta Movie Gen: the most advanced media foundation models to-date. Developed by AI research teams at Meta, Movie Gen delivers state-of-the-art results across a range of capabilities. We’re excited for the potential of this line of research to usher in entirely new possibilities for casual creators and creative professionals alike. More details and examples of what Movie Gen can do ➡️ 🛠️ Movie Gen models and capabilities Movie Gen Video: 30B parameter transformer model that can generate high-quality and high-definition images and videos from a single text prompt. Movie Gen Audio: A 13B parameter transformer model that can take a video input along with optional text prompts for controllability to generate high-fidelity audio synced to the video. It can generate ambient sound, instrumental background music and foley sound — delivering state-of-the-art results in audio quality, video-to-audio alignment and text-to-audio alignment. Precise video editing: Using a generated or existing video and accompanying text instructions as an input it can perform localized edits such as adding, removing or replacing elements — or global changes like background or style changes. Personalized videos: Using an image of a person and a text prompt, the model can generate a video with state-of-the-art results on character preservation and natural movement in video. We’re continuing to work closely with creative professionals from across the field to integrate their feedback as we work towards a potential release. We look forward to sharing more on this work and the creative possibilities it will enable in the future.

🎥 Today we’re premiering Meta Movie Gen: the most advanced media foundation models to-date. Developed by AI research teams at Meta, Movie Gen delivers state-of-the-art results across a range of capabilities. We’re excited for the potential of this line of research to usher in entirely new possibilities for casual creators and creative professionals alike. More details and examples of what Movie Gen can do ➡️ 🛠️ Movie Gen models and capabilities Movie Gen Video: 30B parameter transformer model that can generate high-quality and high-definition images and videos from a single text prompt. Movie Gen Audio: A 13B parameter transformer model that can take a video input along with optional text prompts for controllability to generate high-fidelity audio synced to the video. It can generate ambient sound, instrumental background music and foley sound — delivering state-of-the-art results in audio quality, video-to-audio alignment and text-to-audio alignment. Precise video editing: Using a generated or existing video and accompanying text instructions as an input it can perform localized edits such as adding, removing or replacing elements — or global changes like background or style changes. Personalized videos: Using an image of a person and a text prompt, the model can generate a video with state-of-the-art results on character preservation and natural movement in video. We’re continuing to work closely with creative professionals from across the field to integrate their feedback as we work towards a potential release. We look forward to sharing more on this work and the creative possibilities it will enable in the future.

AI at Meta

2,264,308 görüntüleme • 1 yıl önce

Ovi is out on Hugging Face Twin Backbone Cross-Modal Fusion for Audio-Video Generation Ovi is a veo-3 like, video+audio generation model that simultaneously generates both video and audio content from text or text+image inputs. Video+Audio Generation: Generate synchronized video and audio content simultaneously Flexible Input: Supports text-only or text+image conditioning 5-second Videos: Generates 5-second videos at 24 FPS, area of 720×720, at various aspect ratios (9:16, 16:9, 1:1, etc)

Ovi is out on Hugging Face Twin Backbone Cross-Modal Fusion for Audio-Video Generation Ovi is a veo-3 like, video+audio generation model that simultaneously generates both video and audio content from text or text+image inputs. Video+Audio Generation: Generate synchronized video and audio content simultaneously Flexible Input: Supports text-only or text+image conditioning 5-second Videos: Generates 5-second videos at 24 FPS, area of 720×720, at various aspect ratios (9:16, 16:9, 1:1, etc)

AK

23,082 görüntüleme • 9 ay önce

AudioLDM 2: text to audio/music/speech generation Code/models/paper coming soon

AudioLDM 2: text to audio/music/speech generation Code/models/paper coming soon

AK

202,455 görüntüleme • 2 yıl önce

3DTopia-XL High-Quality 3D PBR Asset Generation via Primitive Diffusion demo: model: 3DTopia-XL scales high-quality 3D asset generation using Diffusion Transformer (DiT) built upon an expressive and efficient 3D representation, PrimX. The denoising process takes 5 seconds to generate a 3D PBR asset from text/image input which is ready for the graphics pipeline to use.

3DTopia-XL High-Quality 3D PBR Asset Generation via Primitive Diffusion demo: model: 3DTopia-XL scales high-quality 3D asset generation using Diffusion Transformer (DiT) built upon an expressive and efficient 3D representation, PrimX. The denoising process takes 5 seconds to generate a 3D PBR asset from text/image input which is ready for the graphics pipeline to use.

AK

87,086 görüntüleme • 1 yıl önce

Google announces MusicLM: a model to generate music from text. Here are some crazy things it can do: 1. Given audio of a melody, it can generate new music inspired by that melody customized by prompts! Here's someone humming bella ciao turned into a cappella chorus, EDM, etc.

Google announces MusicLM: a model to generate music from text. Here are some crazy things it can do: 1. Given audio of a melody, it can generate new music inspired by that melody customized by prompts! Here's someone humming bella ciao turned into a cappella chorus, EDM, etc.

bleedingedge.ai

379,914 görüntüleme • 3 yıl önce

Loopy Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency paper page: With the introduction of diffusion-based video generation techniques, audio-conditioned human video generation has recently achieved significant breakthroughs in both the naturalness of motion and the synthesis of portrait details. Due to the limited control of audio signals in driving human motion, existing methods often add auxiliary spatial signals to stabilize movements, which may compromise the naturalness and freedom of motion. In this paper, we propose an end-to-end audio-only conditioned video diffusion model named Loopy. Specifically, we designed an inter- and intra-clip temporal module and an audio-to-latents module, enabling the model to leverage long-term motion information from the data to learn natural motion patterns and improving audio-portrait movement correlation. This method removes the need for manually specified spatial motion templates used in existing methods to constrain motion during inference. Extensive experiments show that Loopy outperforms recent audio-driven portrait diffusion models, delivering more lifelike and high-quality results across various scenarios.

AK

128,803 görüntüleme • 1 yıl önce

🔥 Google Veo 3 lets you generate videos with high-fidelity camera moves and built-in audio.

🔥 Google Veo 3 lets you generate videos with high-fidelity camera moves and built-in audio.

Magnific

12,021,532 görüntüleme • 1 yıl önce

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 görüntüleme • 2 yıl önce

Can AI truly edit audio, not just generate it? 🎧 Tencent Hy, in collaboration with SJTU, SII, NTU, TJU, ZODA, PKU, FDU, and other collaborators, introduces MMAE. MMAE--A Massive Multitask Audio Editing Benchmark, is the first comprehensive evaluation benchmark for speech and audio "Banana🍌" Instead of simply requiring the AI to "generate" audio, it demands that the AI understand an existing audio clip and precisely modify it according to natural language instructions—altering what needs to be changed while leaving the rest untouched. Current models show an Exact Match Rate (EMR) below 5%, revealing a major gap in reliable audio editing. MMAE includes: ✅ 2,000 high-fidelity samples from real-world scenarios ✅ 17,741 fine-grained rubric evaluation items ✅ 7 modality settings across sound, music, speech and their mixtures ✅ 6 task complexity from basic modifications to multi-hop reasoning and multi-round editing ✅ 8 operation types across local and global granularities How to use: arXiv: GitHub: HuggingFace: Demo:

Can AI truly edit audio, not just generate it? 🎧 Tencent Hy, in collaboration with SJTU, SII, NTU, TJU, ZODA, PKU, FDU, and other collaborators, introduces MMAE. MMAE--A Massive Multitask Audio Editing Benchmark, is the first comprehensive evaluation benchmark for speech and audio "Banana🍌" Instead of simply requiring the AI to "generate" audio, it demands that the AI understand an existing audio clip and precisely modify it according to natural language instructions—altering what needs to be changed while leaving the rest untouched. Current models show an Exact Match Rate (EMR) below 5%, revealing a major gap in reliable audio editing. MMAE includes: ✅ 2,000 high-fidelity samples from real-world scenarios ✅ 17,741 fine-grained rubric evaluation items ✅ 7 modality settings across sound, music, speech and their mixtures ✅ 6 task complexity from basic modifications to multi-hop reasoning and multi-round editing ✅ 8 operation types across local and global granularities How to use: arXiv: GitHub: HuggingFace: Demo:

Tencent Hy

18,207 görüntüleme • 23 gün önce

Using Stable Audio 3 to generate variations of an existing loop. Unconditional generation (no prompt), renoising the latents to 0.5, and just using different seeds seems to generate a nice neighbourhood around the original. Generally keeps the harmonic context and feel.

Using Stable Audio 3 to generate variations of an existing loop. Unconditional generation (no prompt), renoising the latents to 0.5, and just using different seeds seems to generate a nice neighbourhood around the original. Generally keeps the harmonic context and feel.

Tero Parviainen

10,779 görüntüleme • 1 ay önce

Mirage is LIVE Generate energetic, high-converting ads with people that don’t exist — complete with animated body language and micro-expressions — using Mirage, the first foundation model built to generate UGC-style content. Get started with a script or an audio file, then specify the look of your spokesperson, their background, outfit, objects, and even emotion. It’s never been easier to iterate and scale ad production with Mirage, now available in Captions Ad Studio.

Mirage is LIVE Generate energetic, high-converting ads with people that don’t exist — complete with animated body language and micro-expressions — using Mirage, the first foundation model built to generate UGC-style content. Get started with a script or an audio file, then specify the look of your spokesperson, their background, outfit, objects, and even emotion. It’s never been easier to iterate and scale ad production with Mirage, now available in Captions Ad Studio.

Mirage

144,658 görüntüleme • 1 yıl önce

Our new Character-3 model makes it easy to generate expressive animated characters with natural movement. Just upload an image, add audio, and bring your scenes to life.

Our new Character-3 model makes it easy to generate expressive animated characters with natural movement. Just upload an image, add audio, and bring your scenes to life.

Hedra

218,134 görüntüleme • 1 yıl önce

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

Sherwin Bahmani

66,523 görüntüleme • 9 ay önce

11k people are using the ElevenLabs MCP server in Claude. It connects to your ElevenLabs account and lets you generate speech, sound effects, and music directly in Claude. Generate audio with ElevenLabs without friction.

11k people are using the ElevenLabs MCP server in Claude. It connects to your ElevenLabs account and lets you generate speech, sound effects, and music directly in Claude. Generate audio with ElevenLabs without friction.

ElevenLabs Developers

28,604 görüntüleme • 3 ay önce