AK

@_akhaliq • 508,843 subscribers

AI research paper tweets, ML @Gradio (acq. by @HuggingFace 🤗) dm for promo ,submit papers here: https://t.co/UzmYN5XOCi

Shorts

AI Generative fill with memes

AI Generative fill with memes

991,509 views

vibe coding AI apps for free has never been easier 100% open source app, DeepSite on Hugging Face

vibe coding AI apps for free has never been easier 100% open source app, DeepSite on Hugging Face

395,040 views

3D Gaussian Splatting for Real-Time Radiance Field Rendering paper page: Radiance Field methods have recently revolutionized novel-view synthesis of scenes captured with multiple photos or videos. However, achieving high visual quality still requires neural networks that are costly to train and render, while recent faster methods inevitably trade off speed for quality. For unbounded and complete scenes (rather than isolated objects) and 1080p resolution rendering, no current method can achieve real-time display rates. We introduce three key elements that allow us to achieve state-of-the-art visual quality while maintaining competitive training times and importantly allow high-quality real-time (>= 30 fps) novel-view synthesis at 1080p resolution. First, starting from sparse points produced during camera calibration, we represent the scene with 3D Gaussians that preserve desirable properties of continuous volumetric radiance fields for scene optimization while avoiding unnecessary computation in empty space; Second, we perform interleaved optimization/density control of the 3D Gaussians, notably optimizing anisotropic covariance to achieve an accurate representation of the scene; Third, we develop a fast visibility-aware rendering algorithm that supports anisotropic splatting and both accelerates training and allows realtime rendering. We demonstrate state-of-the-art visual quality and real-time rendering on several established datasets.

3D Gaussian Splatting for Real-Time Radiance Field Rendering paper page: Radiance Field methods have recently revolutionized novel-view synthesis of scenes captured with multiple photos or videos. However, achieving high visual quality still requires neural networks that are costly to train and render, while recent faster methods inevitably trade off speed for quality. For unbounded and complete scenes (rather than isolated objects) and 1080p resolution rendering, no current method can achieve real-time display rates. We introduce three key elements that allow us to achieve state-of-the-art visual quality while maintaining competitive training times and importantly allow high-quality real-time (>= 30 fps) novel-view synthesis at 1080p resolution. First, starting from sparse points produced during camera calibration, we represent the scene with 3D Gaussians that preserve desirable properties of continuous volumetric radiance fields for scene optimization while avoiding unnecessary computation in empty space; Second, we perform interleaved optimization/density control of the 3D Gaussians, notably optimizing anisotropic covariance to achieve an accurate representation of the scene; Third, we develop a fast visibility-aware rendering algorithm that supports anisotropic splatting and both accelerates training and allows realtime rendering. We demonstrate state-of-the-art visual quality and real-time rendering on several established datasets.

633,532 views

Stable Diffusion AI Deepfake De-Aged Harrison Ford SD+ControlNet+EbSynth+Fusion reddit thread:

Stable Diffusion AI Deepfake De-Aged Harrison Ford SD+ControlNet+EbSynth+Fusion reddit thread:

551,588 views

Apples or Hamsters? 🍎🐹, Kling AI

Apples or Hamsters? 🍎🐹, Kling AI

291,522 views

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation paper page: Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation paper page: Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

375,123 views

OmniSVG announced on Hugging Face A Unified Scalable Vector Graphics Generation Model

OmniSVG announced on Hugging Face A Unified Scalable Vector Graphics Generation Model

128,540 views

Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity propose Mind-Video that learns spatiotemporal information from continuous fMRI data of the cerebral cortex progressively through masked brain modeling, multimodal contrastive learning with spatiotemporal attention, and co-training with an augmented Stable Diffusion model that incorporates network temporal inflation paper page:

Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity propose Mind-Video that learns spatiotemporal information from continuous fMRI data of the cerebral cortex progressively through masked brain modeling, multimodal contrastive learning with spatiotemporal attention, and co-training with an augmented Stable Diffusion model that incorporates network temporal inflation paper page:

255,231 views

MMaDA-Parallel Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

MMaDA-Parallel Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

66,127 views

Can Vision-Language Models Solve the Shell Game? paper:

Can Vision-Language Models Solve the Shell Game? paper:

38,459 views

NeuRBF: A Neural Fields Representation with Adaptive Radial Basis Functions paper page: present a novel type of neural fields that uses general radial bases for signal representation. State-of-the-art neural fields typically rely on grid-based representations for storing local neural features and N-dimensional linear kernels for interpolating features at continuous query points. The spatial positions of their neural features are fixed on grid nodes and cannot well adapt to target signals. Our method instead builds upon general radial bases with flexible kernel position and shape, which have higher spatial adaptivity and can more closely fit target signals. To further improve the channel-wise capacity of radial basis functions, we propose to compose them with multi-frequency sinusoid functions. This technique extends a radial basis to multiple Fourier radial bases of different frequency bands without requiring extra parameters, facilitating the representation of details. Moreover, by marrying adaptive radial bases with grid-based ones, our hybrid combination inherits both adaptivity and interpolation smoothness. We carefully designed weighting schemes to let radial bases adapt to different types of signals effectively. Our experiments on 2D image and 3D signed distance field representation demonstrate the higher accuracy and compactness of our method than prior arts. When applied to neural radiance field reconstruction, our method achieves state-of-the-art rendering quality, with small model size and comparable training speed.

NeuRBF: A Neural Fields Representation with Adaptive Radial Basis Functions paper page: present a novel type of neural fields that uses general radial bases for signal representation. State-of-the-art neural fields typically rely on grid-based representations for storing local neural features and N-dimensional linear kernels for interpolating features at continuous query points. The spatial positions of their neural features are fixed on grid nodes and cannot well adapt to target signals. Our method instead builds upon general radial bases with flexible kernel position and shape, which have higher spatial adaptivity and can more closely fit target signals. To further improve the channel-wise capacity of radial basis functions, we propose to compose them with multi-frequency sinusoid functions. This technique extends a radial basis to multiple Fourier radial bases of different frequency bands without requiring extra parameters, facilitating the representation of details. Moreover, by marrying adaptive radial bases with grid-based ones, our hybrid combination inherits both adaptivity and interpolation smoothness. We carefully designed weighting schemes to let radial bases adapt to different types of signals effectively. Our experiments on 2D image and 3D signed distance field representation demonstrate the higher accuracy and compactness of our method than prior arts. When applied to neural radiance field reconstruction, our method achieves state-of-the-art rendering quality, with small model size and comparable training speed.

194,469 views

new stealth model Carrot 🥕 now available as default model in anycoder for vibe coding made a working gemma-3-270m chatbot in transformers.js, one shot

new stealth model Carrot 🥕 now available as default model in anycoder for vibe coding made a working gemma-3-270m chatbot in transformers.js, one shot

75,249 views

Show-o One Single Transformer to Unify Multimodal Understanding and Generation discuss: We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks including visual question-answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, it demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters tailored for understanding or generation. This significantly highlights its potential as a next-generation foundation model.

Show-o One Single Transformer to Unify Multimodal Understanding and Generation discuss: We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks including visual question-answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, it demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters tailored for understanding or generation. This significantly highlights its potential as a next-generation foundation model.

124,048 views

Google presents MobileDiffusion Subsecond Text-to-Image Generation on Mobile Devices paper page: MobileDiffusion achieves a remarkable sub-second inference speed for generating a 512 × 512 image on mobile devices, establishing a new state of the art.

Google presents MobileDiffusion Subsecond Text-to-Image Generation on Mobile Devices paper page: MobileDiffusion achieves a remarkable sub-second inference speed for generating a 512 × 512 image on mobile devices, establishing a new state of the art.

150,538 views

Fin-R1 is out on Hugging Face A Large Language Model for Financial Reasoning through Reinforcement Learning

Fin-R1 is out on Hugging Face A Large Language Model for Financial Reasoning through Reinforcement Learning

89,100 views

chat with papers for any arXiv link to HF paper you can now chat using Hugging Chat All Hugging Face Papers now include a built-in assistant, powered by HuggingChat and the Hugging Face MCP server. It helps you quickly understand papers by answering questions, summarizing key ideas, and providing context as you browse the latest research

chat with papers for any arXiv link to HF paper you can now chat using Hugging Chat All Hugging Face Papers now include a built-in assistant, powered by HuggingChat and the Hugging Face MCP server. It helps you quickly understand papers by answering questions, summarizing key ideas, and providing context as you browse the latest research

40,029 views

DMax Aggressive Parallel Decoding for dLLMs paper:

DMax Aggressive Parallel Decoding for dLLMs paper:

23,503 views

Break-A-Scene: Extracting Multiple Concepts from a Single Image introduce the task of textual scene decomposition: given a single image of a scene that may contain several concepts, we aim to extract a distinct text token for each concept, enabling fine-grained control over the generated scenes. To this end, we propose augmenting the input image with masks that indicate the presence of target concepts. These masks can be provided by the user or generated automatically by a pre-trained segmentation model. We then present a novel two-phase customization process that optimizes a set of dedicated textual embeddings (handles), as well as the model weights, striking a delicate balance between accurately capturing the concepts and avoiding overfitting. We employ a masked diffusion loss to enable handles to generate their assigned concepts, complemented by a novel loss on cross-attention maps to prevent entanglement. We also introduce union-sampling, a training strategy aimed to improve the ability of combining multiple concepts in generated images. We use several automatic metrics to quantitatively compare our method against several baselines, and further affirm the results using a user study. Finally, we showcase several applications of our method paper page:

Break-A-Scene: Extracting Multiple Concepts from a Single Image introduce the task of textual scene decomposition: given a single image of a scene that may contain several concepts, we aim to extract a distinct text token for each concept, enabling fine-grained control over the generated scenes. To this end, we propose augmenting the input image with masks that indicate the presence of target concepts. These masks can be provided by the user or generated automatically by a pre-trained segmentation model. We then present a novel two-phase customization process that optimizes a set of dedicated textual embeddings (handles), as well as the model weights, striking a delicate balance between accurately capturing the concepts and avoiding overfitting. We employ a masked diffusion loss to enable handles to generate their assigned concepts, complemented by a novel loss on cross-attention maps to prevent entanglement. We also introduce union-sampling, a training strategy aimed to improve the ability of combining multiple concepts in generated images. We use several automatic metrics to quantitatively compare our method against several baselines, and further affirm the results using a user study. Finally, we showcase several applications of our method paper page:

154,511 views

STEVE-1: A Generative Model for Text-to-Behavior in Minecraft paper page: Constructing AI models that respond to text instructions is challenging, especially for sequential decision-making tasks. This work introduces an instruction-tuned Video Pretraining (VPT) model for Minecraft called STEVE-1, demonstrating that the unCLIP approach, utilized in DALL-E 2, is also effective for creating instruction-following sequential decision-making agents. STEVE-1 is trained in two steps: adapting the pretrained VPT model to follow commands in MineCLIP's latent space, then training a prior to predict latent codes from text. This allows us to finetune VPT through self-supervised behavioral cloning and hindsight relabeling, bypassing the need for costly human text annotations. By leveraging pretrained models like VPT and MineCLIP and employing best practices from text-conditioned image generation, STEVE-1 costs just $60 to train and can follow a wide range of short-horizon open-ended text and visual instructions in Minecraft. STEVE-1 sets a new bar for open-ended instruction following in Minecraft with low-level controls (mouse and keyboard) and raw pixel inputs, far outperforming previous baselines. We provide experimental evidence highlighting key factors for downstream performance, including pretraining, classifier-free guidance, and data scaling. All resources, including our model weights, training scripts, and evaluation tools are made available for further research.

STEVE-1: A Generative Model for Text-to-Behavior in Minecraft paper page: Constructing AI models that respond to text instructions is challenging, especially for sequential decision-making tasks. This work introduces an instruction-tuned Video Pretraining (VPT) model for Minecraft called STEVE-1, demonstrating that the unCLIP approach, utilized in DALL-E 2, is also effective for creating instruction-following sequential decision-making agents. STEVE-1 is trained in two steps: adapting the pretrained VPT model to follow commands in MineCLIP's latent space, then training a prior to predict latent codes from text. This allows us to finetune VPT through self-supervised behavioral cloning and hindsight relabeling, bypassing the need for costly human text annotations. By leveraging pretrained models like VPT and MineCLIP and employing best practices from text-conditioned image generation, STEVE-1 costs just $60 to train and can follow a wide range of short-horizon open-ended text and visual instructions in Minecraft. STEVE-1 sets a new bar for open-ended instruction following in Minecraft with low-level controls (mouse and keyboard) and raw pixel inputs, far outperforming previous baselines. We provide experimental evidence highlighting key factors for downstream performance, including pretraining, classifier-free guidance, and data scaling. All resources, including our model weights, training scripts, and evaluation tools are made available for further research.

144,783 views

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

grok 3 build a endless runner style game where a hugging face collects GPUs

grok 3 build a endless runner style game where a hugging face collects GPUs

8,664,439 views • 1 year ago

20 second tutorial on making apps with Grok 3 and deploying on Hugging Face example showing gradio app with halftone effect

20 second tutorial on making apps with Grok 3 and deploying on Hugging Face example showing gradio app with halftone effect

4,896,068 views • 1 year ago

Vibe coding with xAI grok 4 and Qwen Image on my phone

Vibe coding with xAI grok 4 and Qwen Image on my phone

3,151,227 views • 11 months ago

Wan-Streamer v0.1 End-to-end Real-time Interactive Foundation Models

Wan-Streamer v0.1 End-to-end Real-time Interactive Foundation Models

117,254 views • 24 days ago

LingBot-Video is out on Hugging Face MoE-based video foundation model built for embodied intelligence 30B params, only 3B active at inference Augmented with 70K hours of embodied data on top of large-scale internet video pretraining

LingBot-Video is out on Hugging Face MoE-based video foundation model built for embodied intelligence 30B params, only 3B active at inference Augmented with 70K hours of embodied data on top of large-scale internet video pretraining

40,680 views • 11 days ago

DeepSeek-R1 write a script for a bouncing yellow ball within a Rhombicosidodecahedron, make sure to handle collision detection properly. make the Rhombicosidodecahedron slowly rotate. make sure ball stays within the Rhombicosidodecahedron. implement it in p5.js

DeepSeek-R1 write a script for a bouncing yellow ball within a Rhombicosidodecahedron, make sure to handle collision detection properly. make the Rhombicosidodecahedron slowly rotate. make sure ball stays within the Rhombicosidodecahedron. implement it in p5.js

1,478,629 views • 1 year ago

o3-mini prompt: make a app called chatgpt ad maker that takes in a image and does a black and white dotted image effect with sliders to adjust dot size

o3-mini prompt: make a app called chatgpt ad maker that takes in a image and does a black and white dotted image effect with sliders to adjust dot size

1,395,023 views • 1 year ago

grok 3 prompt: I’d like to make a p5.js simulation of a sphere made up of ASCII numbers, rotating. The closest numbers should be pure white, and the farthest ones should fade to gray, on a black background

grok 3 prompt: I’d like to make a p5.js simulation of a sphere made up of ASCII numbers, rotating. The closest numbers should be pure white, and the farthest ones should fade to gray, on a black background

1,325,652 views • 1 year ago

LingBot-World 2.0 (Infinity) is out on Hugging Face interactive world model with: Hour-long generation with zero quality drift Rich actions & events: attack, cast spells, shoot, summon storms Agentic world: a Director Agent drives real-time world evolution 720p/60fps. Playable like a game

LingBot-World 2.0 (Infinity) is out on Hugging Face interactive world model with: Hour-long generation with zero quality drift Rich actions & events: attack, cast spells, shoot, summon storms Agentic world: a Director Agent drives real-time world evolution 720p/60fps. Playable like a game

28,721 views • 11 days ago

Fixing things with AI

Fixing things with AI

1,884,054 views • 3 years ago

OpenAI o3-mini just one shotted this prompt: write a script for 100 bouncing yellow balls within a sphere, make sure to handle collision detection properly. make the sphere slowly rotate. make sure balls stays within the sphere. implement it in p5.js

OpenAI o3-mini just one shotted this prompt: write a script for 100 bouncing yellow balls within a sphere, make sure to handle collision detection properly. make the sphere slowly rotate. make sure balls stays within the sphere. implement it in p5.js

814,838 views • 1 year ago

AI is taking over

AI is taking over

1,608,164 views • 3 years ago

SpatialLM just dropped on Hugging Face Large Language Model for Spatial Understanding

SpatialLM just dropped on Hugging Face Large Language Model for Spatial Understanding

674,115 views • 1 year ago

Microsoft just released TRELLIS 2

Microsoft just released TRELLIS 2

294,647 views • 7 months ago

This is HUGE The AI App store is here Ask anything you want to do with AI With ~400k Apps, this is the best place to find the AI apps you need developers can build apps, users can try them out and find new apps with AI search

This is HUGE The AI App store is here Ask anything you want to do with AI With ~400k Apps, this is the best place to find the AI apps you need developers can build apps, users can try them out and find new apps with AI search

662,855 views • 1 year ago

GeoCode: Interpretable Shape Programs abs: project page:

GeoCode: Interpretable Shape Programs abs: project page:

1,428,749 views • 3 years ago

OSWorld2.0 Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

OSWorld2.0 Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

29,125 views • 19 days ago

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model with Gradio demo local demo: This paper studies the human image animation task, which aims to generate a video of a certain reference identity following a particular motion sequence. Existing animation works typically employ the frame-warping technique to animate the reference image towards the target motion. Despite achieving reasonable results, these approaches face challenges in maintaining temporal consistency throughout the animation due to the lack of temporal modeling and poor preservation of reference identity. In this work, we introduce MagicAnimate, a diffusion-based framework that aims at enhancing temporal consistency, preserving reference image faithfully, and improving animation fidelity. To achieve this, we first develop a video diffusion model to encode temporal information. Second, to maintain the appearance coherence across frames, we introduce a novel appearance encoder to retain the intricate details of the reference image. Leveraging these two innovations, we further employ a simple video fusion technique to encourage smooth transitions for long video animation. Empirical results demonstrate the superiority of our method over baseline approaches on two benchmarks. Notably, our approach outperforms the strongest baseline by over 38% in terms of video fidelity on the challenging TikTok dancing dataset. Code and model will be made available.

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model with Gradio demo local demo: This paper studies the human image animation task, which aims to generate a video of a certain reference identity following a particular motion sequence. Existing animation works typically employ the frame-warping technique to animate the reference image towards the target motion. Despite achieving reasonable results, these approaches face challenges in maintaining temporal consistency throughout the animation due to the lack of temporal modeling and poor preservation of reference identity. In this work, we introduce MagicAnimate, a diffusion-based framework that aims at enhancing temporal consistency, preserving reference image faithfully, and improving animation fidelity. To achieve this, we first develop a video diffusion model to encode temporal information. Second, to maintain the appearance coherence across frames, we introduce a novel appearance encoder to retain the intricate details of the reference image. Leveraging these two innovations, we further employ a simple video fusion technique to encourage smooth transitions for long video animation. Empirical results demonstrate the superiority of our method over baseline approaches on two benchmarks. Notably, our approach outperforms the strongest baseline by over 38% in terms of video fidelity on the challenging TikTok dancing dataset. Code and model will be made available.

810,578 views • 2 years ago

Training AI to Play Pokemon with Reinforcement Learning by Peter Whidden github: youtube:

Training AI to Play Pokemon with Reinforcement Learning by Peter Whidden github: youtube:

837,796 views • 2 years ago