Video wird geladen...
Video konnte nicht geladen werden
🔊New NVIDIA paper: Audio-SDS🔊 We repurpose Score Distillation Sampling (SDS) for audio, turning any pretrained audio diffusion model into a tool for diverse tasks, including source separation, impact synthesis & more. 🎧 Demos, audio examples, paper:
39,375 Aufrufe • vor 1 Jahr •via X (Twitter)
17 Kommentare

Intuitively, our update moves the audio in a direction to increase its probability given the prompt, by noising and denoising with our diffusion model, then “nudging” our audio towards it by propagating the update through our differentiable rendering to our audio parameters.

We propose three novel audio tasks: ① FM Synthesis, ② Physical Impact Synthesis, and ③ Prompt-Guided Source Separation. This image briefly summarizes the use case, optimizable parameters, rendering function, and parameter update.

① FM Synthesis: A toy setup where we generate settings aligning with prompts like “kick drum, bass, reverb” using sine oscillators modulating each other’s frequency as in a synthesizer. We visualize the final optimized parameters as the dial settings on a synthesizer instrument's user interface.

② Physical Impact Synthesis: We generate impacts consistent with prompts like “hitting pot with wooden spoon” by convolving an impact with a learned object and reverb impulse. We learn the parametrized forms of the object and reverb impulses.

③ Prompt-Guided Source Separation: A prompt-conditioning source separation for a given audio, such as separating a “sax …” and “cars …” from a music recording on a road, by using the audio-SDS update for each channel while forcing the sum of channels to reconstruct the audio.

Modifications to SDS for Audio Diffusion: 🅰 We use an augmented Decoder-SDS in audio space, 🅱 using a spectrogram emphasis to better weight transients, and 🅲️multiple denoising steps to increase fidelity. This image highlights these in red in the detailed overview of our update.

Results on Fully-Automatic In-the-Wild Source Separation: We demonstrate a pipeline that takes a video from the internet, captions the audio with a model (like AudioCaps), and provides that to an LLM-assistant who suggests source decompositions. We run our method on the suggested decompositions.

Results on Tuning FM Synthesizers & Impact Synthesis: We improve CLAP scores over training for prompts, along with qualitative results. Impact synthesis shows improved performance on impact-oriented prompts.

Results on Prompt-Guided Source Separation: We report an improved SDR to ground-truth sources when available and show improved CLAP scores after training.

This project was led by the great work of @jrichterpowell, along with Antonio Torralba. See more work from the @NVIDIA Spatial Intelligence Lab: Work supported indirectly by @MIT_CSAIL @VectorInst #nvidia #mit

⚠️ Limitations ⚠️ Audio-Model Bias: We rely on Stable Audio Open, so when this struggles, e.g., on rare instruments, speech, audio without silence at the end, or out-of-domain SFX, our method can have difficulties. Other diffusion models can help here. Clip-Length Budget: We optimized on ≤10 s clips; minute-scale audio may have artifacts or blow up memory. A hierarchical/windowed Audio-SDS could help here.

🔭 Next stops for Audio-SDS ① Working with longer, >minute-scale audio ② Non-text conditioning—tempo, spatial information, etc. ③ Leveraging stereo generation ④ New tasks: learning physical parameters, VR SFX, and beyond ⑤ Drop in other pretrained backbones

🚀 Vision of the Future: Content designers easily use one video + audio diffusion backbone with SDS-style updates to nudge any differentiable task—impacts, lighting, cloth, fluids—until the joint model says “looks & sounds right” given powerful user controls, like text.

💡 SDS treats any differentiable parameter set as optimizable from a prompt. Source-guided separation emerged when we brainstormed novel uses. We hope for similarly practical tasks to surface—e.g., automatic Foley layering?—as the community experiments.

Our work is inspired by and builds on the SDS update of DreamFusion ( @poolio, @ajayjain, @jon_barron, @BenMildenHall), and related updates (VSD @zhengyiWang, SDI @ottogin1, @ocariz__, @vincesitzmann, SJC @DuXiaodan, @RaymondYeh, many more!)

We find a new set of use-cases for Stable Audio Open (@jordiponsdotme, @StabilityAI, @huggingface) or other pretrained audio models (AudioLDM @LiuHaohe, @ZehuaChenICL, @markplumbley, and more)

🚀 Just released: our groundbreaking documentation update for HUDI! 🐸 Dive deep into the innovative DataMask features and explore the future of decentralized data with our new Data Apps, including the revolutionary Health app. Secure, private, and now truly usable—welcome to the next level of Web3 data management! 🌐🔐 👉🔗 #Web3 #DataPrivacy #DataApps #HUDI #DeFi
Ähnliche Videos
Sensitive content
🔊VORE AUDIO🌊 This is the audio version of this comic here: Full audio here: 🔊FA: 🔊Cohost: 🔊Weasyl: Characters belong to @samsonvee.bsky.social and @caoleroni.bsky.social🔞🔜 MFF Comic by find me on bsky!
MonsterChow
32,107 Aufrufe • vor 1 Jahr
Sensitive content
A team of vitamin “P” (Pokemon) 💊 Audio comm. for Beta 🔊 J.Fiera/ck19 .°•Audio cömş open.°• 🎨 Terito ooo! #fatfurry | #furry | #vore | #comic
J.Fiera/ck19 .°•Audio cömş open.°•
34,501 Aufrufe • vor 8 Monaten

