AK's banner
AK's profile picture

AK

@_akhaliq449,077 subscribers

AI research paper tweets, ML @Gradio (acq. by @HuggingFace 🤗) dm for promo ,submit papers here: https://t.co/UzmYN5XOCi

Shorts

AI Generative fill with memes

AI Generative fill with memes

991,509 views

vibe coding AI apps for free has never been easier 100% open source app, DeepSite on Hugging Face

vibe coding AI apps for free has never been easier 100% open source app, DeepSite on Hugging Face

394,983 views

3D Gaussian Splatting for Real-Time Radiance Field Rendering paper page: Radiance Field methods have recently revolutionized novel-view synthesis of scenes captured with multiple photos or videos. However, achieving high visual quality still requires neural networks that are costly to train and render, while recent faster methods inevitably trade off speed for quality. For unbounded and complete scenes (rather than isolated objects) and 1080p resolution rendering, no current method can achieve real-time display rates. We introduce three key elements that allow us to achieve state-of-the-art visual quality while maintaining competitive training times and importantly allow high-quality real-time (>= 30 fps) novel-view synthesis at 1080p resolution. First, starting from sparse points produced during camera calibration, we represent the scene with 3D Gaussians that preserve desirable properties of continuous volumetric radiance fields for scene optimization while avoiding unnecessary computation in empty space; Second, we perform interleaved optimization/density control of the 3D Gaussians, notably optimizing anisotropic covariance to achieve an accurate representation of the scene; Third, we develop a fast visibility-aware rendering algorithm that supports anisotropic splatting and both accelerates training and allows realtime rendering. We demonstrate state-of-the-art visual quality and real-time rendering on several established datasets.

3D Gaussian Splatting for Real-Time Radiance Field Rendering paper page: Radiance Field methods have recently revolutionized novel-view synthesis of scenes captured with multiple photos or videos. However, achieving high visual quality still requires neural networks that are costly to train and render, while recent faster methods inevitably trade off speed for quality. For unbounded and complete scenes (rather than isolated objects) and 1080p resolution rendering, no current method can achieve real-time display rates. We introduce three key elements that allow us to achieve state-of-the-art visual quality while maintaining competitive training times and importantly allow high-quality real-time (>= 30 fps) novel-view synthesis at 1080p resolution. First, starting from sparse points produced during camera calibration, we represent the scene with 3D Gaussians that preserve desirable properties of continuous volumetric radiance fields for scene optimization while avoiding unnecessary computation in empty space; Second, we perform interleaved optimization/density control of the 3D Gaussians, notably optimizing anisotropic covariance to achieve an accurate representation of the scene; Third, we develop a fast visibility-aware rendering algorithm that supports anisotropic splatting and both accelerates training and allows realtime rendering. We demonstrate state-of-the-art visual quality and real-time rendering on several established datasets.

633,035 views

Stable Diffusion AI Deepfake De-Aged Harrison Ford SD+ControlNet+EbSynth+Fusion reddit thread:

Stable Diffusion AI Deepfake De-Aged Harrison Ford SD+ControlNet+EbSynth+Fusion reddit thread:

551,491 views

Apples or Hamsters? 🍎🐹, Kling AI

Apples or Hamsters? 🍎🐹, Kling AI

291,116 views

Can Vision-Language Models Solve the Shell Game? paper:

Can Vision-Language Models Solve the Shell Game? paper:

38,440 views

OmniSVG announced on Hugging Face A Unified Scalable Vector Graphics Generation Model

OmniSVG announced on Hugging Face A Unified Scalable Vector Graphics Generation Model

128,474 views

DMax Aggressive Parallel Decoding for dLLMs paper:

DMax Aggressive Parallel Decoding for dLLMs paper:

23,473 views

MMaDA-Parallel Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

MMaDA-Parallel Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

66,029 views

Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity propose Mind-Video that learns spatiotemporal information from continuous fMRI data of the cerebral cortex progressively through masked brain modeling, multimodal contrastive learning with spatiotemporal attention, and co-training with an augmented Stable Diffusion model that incorporates network temporal inflation paper page:

Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity propose Mind-Video that learns spatiotemporal information from continuous fMRI data of the cerebral cortex progressively through masked brain modeling, multimodal contrastive learning with spatiotemporal attention, and co-training with an augmented Stable Diffusion model that incorporates network temporal inflation paper page:

255,195 views

new stealth model Carrot 🥕 now available as default model in anycoder for vibe coding made a working gemma-3-270m chatbot in transformers.js, one shot

new stealth model Carrot 🥕 now available as default model in anycoder for vibe coding made a working gemma-3-270m chatbot in transformers.js, one shot

75,249 views

NeuRBF: A Neural Fields Representation with Adaptive Radial Basis Functions paper page: present a novel type of neural fields that uses general radial bases for signal representation. State-of-the-art neural fields typically rely on grid-based representations for storing local neural features and N-dimensional linear kernels for interpolating features at continuous query points. The spatial positions of their neural features are fixed on grid nodes and cannot well adapt to target signals. Our method instead builds upon general radial bases with flexible kernel position and shape, which have higher spatial adaptivity and can more closely fit target signals. To further improve the channel-wise capacity of radial basis functions, we propose to compose them with multi-frequency sinusoid functions. This technique extends a radial basis to multiple Fourier radial bases of different frequency bands without requiring extra parameters, facilitating the representation of details. Moreover, by marrying adaptive radial bases with grid-based ones, our hybrid combination inherits both adaptivity and interpolation smoothness. We carefully designed weighting schemes to let radial bases adapt to different types of signals effectively. Our experiments on 2D image and 3D signed distance field representation demonstrate the higher accuracy and compactness of our method than prior arts. When applied to neural radiance field reconstruction, our method achieves state-of-the-art rendering quality, with small model size and comparable training speed.

NeuRBF: A Neural Fields Representation with Adaptive Radial Basis Functions paper page: present a novel type of neural fields that uses general radial bases for signal representation. State-of-the-art neural fields typically rely on grid-based representations for storing local neural features and N-dimensional linear kernels for interpolating features at continuous query points. The spatial positions of their neural features are fixed on grid nodes and cannot well adapt to target signals. Our method instead builds upon general radial bases with flexible kernel position and shape, which have higher spatial adaptivity and can more closely fit target signals. To further improve the channel-wise capacity of radial basis functions, we propose to compose them with multi-frequency sinusoid functions. This technique extends a radial basis to multiple Fourier radial bases of different frequency bands without requiring extra parameters, facilitating the representation of details. Moreover, by marrying adaptive radial bases with grid-based ones, our hybrid combination inherits both adaptivity and interpolation smoothness. We carefully designed weighting schemes to let radial bases adapt to different types of signals effectively. Our experiments on 2D image and 3D signed distance field representation demonstrate the higher accuracy and compactness of our method than prior arts. When applied to neural radiance field reconstruction, our method achieves state-of-the-art rendering quality, with small model size and comparable training speed.

194,437 views

chat with papers for any arXiv link to HF paper you can now chat using Hugging Chat All Hugging Face Papers now include a built-in assistant, powered by HuggingChat and the Hugging Face MCP server. It helps you quickly understand papers by answering questions, summarizing key ideas, and providing context as you browse the latest research

chat with papers for any arXiv link to HF paper you can now chat using Hugging Chat All Hugging Face Papers now include a built-in assistant, powered by HuggingChat and the Hugging Face MCP server. It helps you quickly understand papers by answering questions, summarizing key ideas, and providing context as you browse the latest research

39,895 views

Show-o One Single Transformer to Unify Multimodal Understanding and Generation discuss: We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks including visual question-answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, it demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters tailored for understanding or generation. This significantly highlights its potential as a next-generation foundation model.

Show-o One Single Transformer to Unify Multimodal Understanding and Generation discuss: We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks including visual question-answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, it demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters tailored for understanding or generation. This significantly highlights its potential as a next-generation foundation model.

123,982 views

Fin-R1 is out on Hugging Face A Large Language Model for Financial Reasoning through Reinforcement Learning

Fin-R1 is out on Hugging Face A Large Language Model for Financial Reasoning through Reinforcement Learning

88,985 views

Google presents MobileDiffusion Subsecond Text-to-Image Generation on Mobile Devices paper page: MobileDiffusion achieves a remarkable sub-second inference speed for generating a 512 × 512 image on mobile devices, establishing a new state of the art.

Google presents MobileDiffusion Subsecond Text-to-Image Generation on Mobile Devices paper page: MobileDiffusion achieves a remarkable sub-second inference speed for generating a 512 × 512 image on mobile devices, establishing a new state of the art.

150,538 views

MoCapAnything Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

MoCapAnything Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

36,954 views

Break-A-Scene: Extracting Multiple Concepts from a Single Image introduce the task of textual scene decomposition: given a single image of a scene that may contain several concepts, we aim to extract a distinct text token for each concept, enabling fine-grained control over the generated scenes. To this end, we propose augmenting the input image with masks that indicate the presence of target concepts. These masks can be provided by the user or generated automatically by a pre-trained segmentation model. We then present a novel two-phase customization process that optimizes a set of dedicated textual embeddings (handles), as well as the model weights, striking a delicate balance between accurately capturing the concepts and avoiding overfitting. We employ a masked diffusion loss to enable handles to generate their assigned concepts, complemented by a novel loss on cross-attention maps to prevent entanglement. We also introduce union-sampling, a training strategy aimed to improve the ability of combining multiple concepts in generated images. We use several automatic metrics to quantitatively compare our method against several baselines, and further affirm the results using a user study. Finally, we showcase several applications of our method paper page:

Break-A-Scene: Extracting Multiple Concepts from a Single Image introduce the task of textual scene decomposition: given a single image of a scene that may contain several concepts, we aim to extract a distinct text token for each concept, enabling fine-grained control over the generated scenes. To this end, we propose augmenting the input image with masks that indicate the presence of target concepts. These masks can be provided by the user or generated automatically by a pre-trained segmentation model. We then present a novel two-phase customization process that optimizes a set of dedicated textual embeddings (handles), as well as the model weights, striking a delicate balance between accurately capturing the concepts and avoiding overfitting. We employ a masked diffusion loss to enable handles to generate their assigned concepts, complemented by a novel loss on cross-attention maps to prevent entanglement. We also introduce union-sampling, a training strategy aimed to improve the ability of combining multiple concepts in generated images. We use several automatic metrics to quantitatively compare our method against several baselines, and further affirm the results using a user study. Finally, we showcase several applications of our method paper page:

154,507 views

STEVE-1: A Generative Model for Text-to-Behavior in Minecraft paper page: Constructing AI models that respond to text instructions is challenging, especially for sequential decision-making tasks. This work introduces an instruction-tuned Video Pretraining (VPT) model for Minecraft called STEVE-1, demonstrating that the unCLIP approach, utilized in DALL-E 2, is also effective for creating instruction-following sequential decision-making agents. STEVE-1 is trained in two steps: adapting the pretrained VPT model to follow commands in MineCLIP's latent space, then training a prior to predict latent codes from text. This allows us to finetune VPT through self-supervised behavioral cloning and hindsight relabeling, bypassing the need for costly human text annotations. By leveraging pretrained models like VPT and MineCLIP and employing best practices from text-conditioned image generation, STEVE-1 costs just $60 to train and can follow a wide range of short-horizon open-ended text and visual instructions in Minecraft. STEVE-1 sets a new bar for open-ended instruction following in Minecraft with low-level controls (mouse and keyboard) and raw pixel inputs, far outperforming previous baselines. We provide experimental evidence highlighting key factors for downstream performance, including pretraining, classifier-free guidance, and data scaling. All resources, including our model weights, training scripts, and evaluation tools are made available for further research.

STEVE-1: A Generative Model for Text-to-Behavior in Minecraft paper page: Constructing AI models that respond to text instructions is challenging, especially for sequential decision-making tasks. This work introduces an instruction-tuned Video Pretraining (VPT) model for Minecraft called STEVE-1, demonstrating that the unCLIP approach, utilized in DALL-E 2, is also effective for creating instruction-following sequential decision-making agents. STEVE-1 is trained in two steps: adapting the pretrained VPT model to follow commands in MineCLIP's latent space, then training a prior to predict latent codes from text. This allows us to finetune VPT through self-supervised behavioral cloning and hindsight relabeling, bypassing the need for costly human text annotations. By leveraging pretrained models like VPT and MineCLIP and employing best practices from text-conditioned image generation, STEVE-1 costs just $60 to train and can follow a wide range of short-horizon open-ended text and visual instructions in Minecraft. STEVE-1 sets a new bar for open-ended instruction following in Minecraft with low-level controls (mouse and keyboard) and raw pixel inputs, far outperforming previous baselines. We provide experimental evidence highlighting key factors for downstream performance, including pretraining, classifier-free guidance, and data scaling. All resources, including our model weights, training scripts, and evaluation tools are made available for further research.

144,704 views

Videos

Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold paper page:
0:22

Sensitive content

This media may contain sensitive content.

_akhaliq's profile picture

Fixing things with AI

AK

1,883,907 views • 3 years ago

_akhaliq's profile picture

Microsoft just released TRELLIS 2

AK

293,285 views • 5 months ago

_akhaliq's profile picture

AI is taking over

AK

1,608,115 views • 3 years ago