Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

CosmicMan A Text-to-Image Foundation Model for Humans We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and

AK

457,510 subscribers

46,778 görüntüleme • 2 yıl önce •via X (Twitter)

Bilim & Teknoloji

Anya Rossi• Live Now

Private livecam show

9 Yorum

AK profil fotoğrafı

AK2 yıl önce

text-image misalignment for humans, CosmicMan enables generating photo-realistic human images with meticulous appearance, reasonable structure, and precise text-image alignment with detailed dense descriptions. At the heart of CosmicMan's success are the new reflections and

AK profil fotoğrafı

AK2 yıl önce

perspectives on data and models: (1) We found that data quality and a scalable data production flow are essential for the final results from trained models. Hence, we propose a new data production paradigm, Annotate Anyone, which serves as a perpetual data flywheel to produce

AK profil fotoğrafı

AK2 yıl önce

high-quality data with accurate yet cost-effective annotations over time. Based on this, we constructed a large-scale dataset, CosmicMan-HQ 1.0, with 6 Million high-quality real-world human images in a mean resolution of 1488x1255, and attached with precise text annotations

AK profil fotoğrafı

AK2 yıl önce

deriving from 115 Million attributes in diverse granularities. (2) We argue that a text-to-image foundation model specialized for humans must be pragmatic -- easy to integrate into down-streaming tasks while effective in producing high-quality human images. Hence, we propose to

AK profil fotoğrafı

AK2 yıl önce

model the relationship between dense text descriptions and image pixels in a decomposed manner, and present Decomposed-Attention-Refocusing (Daring) training framework. It seamlessly decomposes the cross-attention features in existing text-to-image diffusion

AK profil fotoğrafı

AK2 yıl önce

model, and enforces attention refocusing without adding extra modules. Through Daring, we show that explicitly discretizing continuous text space into several basic groups that align with human body structure is the key to tackling the misalignment problem in a breeze.

AK profil fotoğrafı

AK2 yıl önce

paper page:

AK profil fotoğrafı

AK2 yıl önce

daily papers:

Bobcat profil fotoğrafı

Bobcat2 yıl önce

👀

Benzer Videolar

🔥Text-to-3D Foundation Model🔥 We are excited to announce #3DTopia, a generalist 🧊text-to-3D🧊 foundation model, which produces ** high-quality 3D assets within 5 minutes ** - Code: - Video:

🔥Text-to-3D Foundation Model🔥 We are excited to announce #3DTopia, a generalist 🧊text-to-3D🧊 foundation model, which produces high-quality 3D assets within 5 minutes - Code: - Video:

Ziwei Liu

62,424 görüntüleme • 2 yıl önce

StarVector is out on Hugging Face StarVector is a foundation model for generating Scalable Vector Graphics (SVG) code from images and text. It utilizes a Vision-Language Modeling architecture to understand both visual and textual inputs, enabling high-quality vectorization and text-guided SVG creation.

StarVector is out on Hugging Face StarVector is a foundation model for generating Scalable Vector Graphics (SVG) code from images and text. It utilizes a Vision-Language Modeling architecture to understand both visual and textual inputs, enabling high-quality vectorization and text-guided SVG creation.

AK

254,259 görüntüleme • 1 yıl önce

Multi-modal #LLMs understand a lot about humans. But do they understand our 3D pose? We train #PoseGPT to estimate, generate, and reason about 3D human pose (#SMPL) in images and text. This is the first true foundation model for understanding 3D humans.

Multi-modal #LLMs understand a lot about humans. But do they understand our 3D pose? We train #PoseGPT to estimate, generate, and reason about 3D human pose (#SMPL) in images and text. This is the first true foundation model for understanding 3D humans.

Michael Black

81,398 görüntüleme • 2 yıl önce

Introducing StyleDrop, a model that allows a significantly higher level of stylized text-to-image synthesis by using a few style reference images that describe the style for text-to-image generation, bypassing the burden of text prompt engineering. More→

Introducing StyleDrop, a model that allows a significantly higher level of stylized text-to-image synthesis by using a few style reference images that describe the style for text-to-image generation, bypassing the burden of text prompt engineering. More→

Google AI

80,357 görüntüleme • 2 yıl önce

Current 3D generative models are slow and low quality. We present GRM, a large-scale model that reconstructs 3D Gaussians in 0.1s and generates high-quality 3D assets from text or single images in a few seconds. Demo: 1/4

Current 3D generative models are slow and low quality. We present GRM, a large-scale model that reconstructs 3D Gaussians in 0.1s and generates high-quality 3D assets from text or single images in a few seconds. Demo: 1/4

Gordon Wetzstein

19,189 görüntüleme • 2 yıl önce

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

Tencent Hy

89,257 görüntüleme • 9 ay önce

Designing an Encoder for Fast Personalization of Text-to-Image Models TL;DR: use an encoder to personalize a text-to-image model to new concepts with a single image and 5-15 tuning steps abs: project page:

Designing an Encoder for Fast Personalization of Text-to-Image Models TL;DR: use an encoder to personalize a text-to-image model to new concepts with a single image and 5-15 tuning steps abs: project page:

AK

165,158 görüntüleme • 3 yıl önce

Can a small academic team build a strong text-to-image model using only public datasets? Introducing i1: a simple, fully open recipe for strong text-to-image models

Can a small academic team build a strong text-to-image model using only public datasets? Introducing i1: a simple, fully open recipe for strong text-to-image models

Zhuang Liu

65,784 görüntüleme • 5 gün önce

Humans learn and improve from failures. Similarly, foundation models adapt based on human feedback. Can we leverage this failure understanding to enhance robotics systems that use foundation models? Introducing AHA—a vision-language model for detecting and reasoning over failures in robotic manipulation. Project page: 🧵Thread👇 Aha!

Humans learn and improve from failures. Similarly, foundation models adapt based on human feedback. Can we leverage this failure understanding to enhance robotics systems that use foundation models? Introducing AHA—a vision-language model for detecting and reasoning over failures in robotic manipulation. Project page: 🧵Thread👇 Aha!

Jiafei Duan

48,739 görüntüleme • 1 yıl önce

DiffSplat Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation DiffSplat is a generative framework to synthesize 3D Gaussian Splats from text prompts & single-view images in ⚡️ 1~2 seconds. It is fine-tuned directly from a pretrained text-to-image diffusion model.

DiffSplat Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation DiffSplat is a generative framework to synthesize 3D Gaussian Splats from text prompts & single-view images in ⚡️ 1~2 seconds. It is fine-tuned directly from a pretrained text-to-image diffusion model.

AK

38,416 görüntüleme • 1 yıl önce

Exciting milestones in our generative AI research: Emu Video, which lets you create high quality videos from a text prompt, and Emu Edit, which enables detailed image editing based on your instructions. These new models are built on Emu, our foundation model for image generation and technology from them will underpin new creative features across our apps next year. Try it out: Emu Video: Emu Edit:

Exciting milestones in our generative AI research: Emu Video, which lets you create high quality videos from a text prompt, and Emu Edit, which enables detailed image editing based on your instructions. These new models are built on Emu, our foundation model for image generation and technology from them will underpin new creative features across our apps next year. Try it out: Emu Video: Emu Edit:

Boz

110,720 görüntüleme • 2 yıl önce

The first truly open-source audio-video model. LTX-2 is a DiT-based foundation model with all core video generation capabilities in one unified model. Designed to run locally on consumer GPUs. - text-to-video - image-to-video - and video-to-video modes 100% open-source.

The first truly open-source audio-video model. LTX-2 is a DiT-based foundation model with all core video generation capabilities in one unified model. Designed to run locally on consumer GPUs. - text-to-video - image-to-video - and video-to-video modes 100% open-source.

Akshay 🚀

66,012 görüntüleme • 5 ay önce

Today, we are releasing Stable Video Diffusion, our first foundation model for generative AI video based on the image model, Stable Diffusion. As part of this research preview, the code, weights, and research paper are now available. Additionally, today you can sign up for our waitlist to access a new upcoming web experience featuring a Text-To-Video interface. To access the model & sign up for our waitlist, visit our website here:

Today, we are releasing Stable Video Diffusion, our first foundation model for generative AI video based on the image model, Stable Diffusion. As part of this research preview, the code, weights, and research paper are now available. Additionally, today you can sign up for our waitlist to access a new upcoming web experience featuring a Text-To-Video interface. To access the model & sign up for our waitlist, visit our website here:

Stability AI

1,024,438 görüntüleme • 2 yıl önce

We’re excited to announce the release and open-source of HunyuanImage 3.0 — the largest and most powerful open-source text-to-image model to date, with over 80 billion total parameters, of which 13 billion are activated per token during inference.The effect is completely comparable to the industry’s flagship closed-source model.🚀🚀🚀 HunyuanImage 3.0 originates from our internally developed native multimodal large language model, with fine-tuning and post-training focused on text-to-image generation. This unique foundation gives the model a powerful set of capabilities: ✅Reason with world knowledge ✅Understand complex, thousand-word prompts ✅Generate precise text within images Different from traditional DiT architecture image generation models, HunyuanImage 3.0’s MoE architecture uses a Transfusion-based approach to deeply couple Diffusion and LLM training for a single, powerful system. Built on Hunyuan-A13B, HunyuanImage 3.0 was trained on a massive dataset: 5 billion image-text pairs, video frames, interleaved image-text data, and 6 trillion tokens of text corpora. This hybrid training across multimodal generation, understanding, and LLM capabilities allows the model to seamlessly integrate multiple tasks. Whether you're an illustrator, designer, or creator, this is built to slash your workflow from hours to minutes. HunyuanImage 3.0 can generate intricate text, detailed comics, expressive emojis, and lively, engaging illustrations for educational content. The current release focuses solely on text-to-image generation and future updates will include image-to-image, image editing, multi-turn interaction, and more. 👉🏻Try it now: 🔗GitHub: 🤗Hugging Face:

We’re excited to announce the release and open-source of HunyuanImage 3.0 — the largest and most powerful open-source text-to-image model to date, with over 80 billion total parameters, of which 13 billion are activated per token during inference.The effect is completely comparable to the industry’s flagship closed-source model.🚀🚀🚀 HunyuanImage 3.0 originates from our internally developed native multimodal large language model, with fine-tuning and post-training focused on text-to-image generation. This unique foundation gives the model a powerful set of capabilities: ✅Reason with world knowledge ✅Understand complex, thousand-word prompts ✅Generate precise text within images Different from traditional DiT architecture image generation models, HunyuanImage 3.0’s MoE architecture uses a Transfusion-based approach to deeply couple Diffusion and LLM training for a single, powerful system. Built on Hunyuan-A13B, HunyuanImage 3.0 was trained on a massive dataset: 5 billion image-text pairs, video frames, interleaved image-text data, and 6 trillion tokens of text corpora. This hybrid training across multimodal generation, understanding, and LLM capabilities allows the model to seamlessly integrate multiple tasks. Whether you're an illustrator, designer, or creator, this is built to slash your workflow from hours to minutes. HunyuanImage 3.0 can generate intricate text, detailed comics, expressive emojis, and lively, engaging illustrations for educational content. The current release focuses solely on text-to-image generation and future updates will include image-to-image, image editing, multi-turn interaction, and more. 👉🏻Try it now: 🔗GitHub: 🤗Hugging Face:

Tencent Hy

412,572 görüntüleme • 9 ay önce

MusicLM: Generating Music From Text Presents MusicLM, a model for generating high-fidelity music from text. MusicLM generates music at 24 kHz that remains consistent over several minutes. proj: abs: data:

MusicLM: Generating Music From Text Presents MusicLM, a model for generating high-fidelity music from text. MusicLM generates music at 24 kHz that remains consistent over several minutes. proj: abs: data:

Aran Komatsuzaki

163,083 görüntüleme • 3 yıl önce

FET/ASI Crypto Drops Bombshell Update: Text to Image Generation ASI-1 Mini🔥 ASI-1 Mini is the first model in the ASI:Train family. Today they introduce Text-to-Image Generation, enabling users to create high-quality images from textual descriptions. Fetch.ai Artificial Superintelligence Alliance

FET/ASI Crypto Drops Bombshell Update: Text to Image Generation ASI-1 Mini🔥 ASI-1 Mini is the first model in the ASI:Train family. Today they introduce Text-to-Image Generation, enabling users to create high-quality images from textual descriptions. Fetch.ai Artificial Superintelligence Alliance

cryptoFOXXY

16,613 görüntüleme • 1 yıl önce

Today, we are excited to release FLUX.1 Tools, a suite of models designed to add control and steerability to our base text-to-image model FLUX.1, enabling the modification and re-creation of real and generated images. Learn more in our blogpost:

Today, we are excited to release FLUX.1 Tools, a suite of models designed to add control and steerability to our base text-to-image model FLUX.1, enabling the modification and re-creation of real and generated images. Learn more in our blogpost:

Black Forest Labs

349,920 görüntüleme • 1 yıl önce

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 görüntüleme • 2 yıl önce

Introducing Frames: An image generation model offering unprecedented stylistic control. Frames is our newest foundation model for image generation, marking a big step forward in stylistic control and visual fidelity. With Frames, you can begin to architect worlds that represent very specific points of view and aesthetic characteristics. See below for examples. World 1089: Mise-en-scène (1/11)

Introducing Frames: An image generation model offering unprecedented stylistic control. Frames is our newest foundation model for image generation, marking a big step forward in stylistic control and visual fidelity. With Frames, you can begin to architect worlds that represent very specific points of view and aesthetic characteristics. See below for examples. World 1089: Mise-en-scène (1/11)

Runway

263,330 görüntüleme • 1 yıl önce

Physical Intelligence's π₀, a general-purpose robot foundation model that combines Internet-scale vision-language pretraining with robot interaction data to execute tasks. They aim "to develop foundation models that can control any robot to perform any task" Autonomous demos:

Physical Intelligence's π₀, a general-purpose robot foundation model that combines Internet-scale vision-language pretraining with robot interaction data to execute tasks. They aim "to develop foundation models that can control any robot to perform any task" Autonomous demos:

The Humanoid Hub

196,055 görüntüleme • 1 yıl önce