Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

[1/n] Do distinct large models admit a simple map that aligns their embedding spaces? We show that across multimodal contrastive models—trained on different data and architectures—an orthogonal map aligns image embeddings. Strikingly, the same map also aligns text embeddings.

Sharut Gupta

2,736 subscribers

37,310 görüntüleme • 5 ay önce •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Yorum

Yorum bulunmuyor

Orijinal gönderinin yorumları burada görünecek

Benzer Videolar

New short course: Building Multimodal Search and RAG", by Weaviate AI Database's Sebastia(N_) Witalec ✊🏽✊🏾✊🏿. Contrastive learning is used to train models to map vectors into an embedding space by pulling similar concepts closer together and pushing dissimilar concepts away from each other. This technique is also used to train multimodal embedding models that capture semantic similarity across different modalities like text, images, and audio. These multimodal embeddings can be used to build multimodal search and RAG systems. In this course, you'll learn how contrastive learning works, and how to add multimodality to RAG – so your models can draw on diverse, relevant context to answer questions. For example, a query about a financial report might synthesize information from text snippets, graphs, tables, and slides. You will also learn how visual instruction tuning lets you integrate image understanding into language models, and build a multi-vector recommender system using Weaviate’s open-source vector database. Please sign up here:

New short course: Building Multimodal Search and RAG", by Weaviate AI Database's Sebastia(N_) Witalec ✊🏽✊🏾✊🏿. Contrastive learning is used to train models to map vectors into an embedding space by pulling similar concepts closer together and pushing dissimilar concepts away from each other. This technique is also used to train multimodal embedding models that capture semantic similarity across different modalities like text, images, and audio. These multimodal embeddings can be used to build multimodal search and RAG systems. In this course, you'll learn how contrastive learning works, and how to add multimodality to RAG – so your models can draw on diverse, relevant context to answer questions. For example, a query about a financial report might synthesize information from text snippets, graphs, tables, and slides. You will also learn how visual instruction tuning lets you integrate image understanding into language models, and build a multi-vector recommender system using Weaviate’s open-source vector database. Please sign up here:

Andrew Ng

104,371 görüntüleme • 2 yıl önce

Consistency models, CTMs, shortcut models, align your flow, mean flow... What's the connection, and how should you learn them in practice? We show they're all different sides of the same coin connected by one central object: the flow map. 🧵(1/n)

Consistency models, CTMs, shortcut models, align your flow, mean flow... What's the connection, and how should you learn them in practice? We show they're all different sides of the same coin connected by one central object: the flow map. 🧵(1/n)

Nicholas Boffi

65,689 görüntüleme • 9 ay önce

I downloaded something like 300GB of open models and wrote a bunch of map-reduce style processing scripts to make this graph. It's plotting the distribution of weight values across a variety of popular open models, to show that models are almost entirely made up of small floats.

I downloaded something like 300GB of open models and wrote a bunch of map-reduce style processing scripts to make this graph. It's plotting the distribution of weight values across a variety of popular open models, to show that models are almost entirely made up of small floats.

Sam Rose

58,141 görüntüleme • 5 ay önce

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

Andrew Ng

107,825 görüntüleme • 1 yıl önce

Meta announces Movie Gen A Cast of Media Foundation Models We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user’s image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models

Meta announces Movie Gen A Cast of Media Foundation Models We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user’s image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models

AK

62,719 görüntüleme • 1 yıl önce

Explore state-of-the-art multimodal prompting in our new short course Large Multimodal Model Prompting with Gemini, taught by Erwin Huizenga in collaboration with Google Cloud. One interesting insight from this course: with multimodal models, prompt structure matters significantly. Placing text inputs, such as a patient's medical history, before image inputs, like an X-ray, can enhance the model's ability to contextualize and interpret visual data effectively. In other contexts, such as image captioning, you may get better results by putting the image first. Multimodal models behave differently than text-only LLMs, and effective prompting for models varies depending on the model you’re using. In this course you’ll learn how to effectively prompt Gemini models. Gemini's multimodal capabilities also enable new approaches in AI application development, for example: - The Gemini library handles various video formats (MP4, MOV, MPEG), streamlining applications using these formats. - Large context window (up to 1 million tokens) enables processing of extensive content, like analyzing multiple 50-minute videos simultaneously. - Function calling feature integrates real-time data (e.g., current exchange rates) into model responses. The course demonstrates building multimodal applications with real-world examples including document analyzers that reason across text and graphs simultaneously, video content extractors that find and timestamp specific information from multiple hours of footage, and automated expense report systems processing receipt images while cross-referencing company policies. Sign up here:

Explore state-of-the-art multimodal prompting in our new short course Large Multimodal Model Prompting with Gemini, taught by Erwin Huizenga in collaboration with Google Cloud. One interesting insight from this course: with multimodal models, prompt structure matters significantly. Placing text inputs, such as a patient's medical history, before image inputs, like an X-ray, can enhance the model's ability to contextualize and interpret visual data effectively. In other contexts, such as image captioning, you may get better results by putting the image first. Multimodal models behave differently than text-only LLMs, and effective prompting for models varies depending on the model you’re using. In this course you’ll learn how to effectively prompt Gemini models. Gemini's multimodal capabilities also enable new approaches in AI application development, for example: - The Gemini library handles various video formats (MP4, MOV, MPEG), streamlining applications using these formats. - Large context window (up to 1 million tokens) enables processing of extensive content, like analyzing multiple 50-minute videos simultaneously. - Function calling feature integrates real-time data (e.g., current exchange rates) into model responses. The course demonstrates building multimodal applications with real-world examples including document analyzers that reason across text and graphs simultaneously, video content extractors that find and timestamp specific information from multiple hours of footage, and automated expense report systems processing receipt images while cross-referencing company policies. Sign up here:

Andrew Ng

74,060 görüntüleme • 1 yıl önce

What if you could search 50M satellite image embeddings with no server, no database, and no API? Part 2 of Caleb Robinson and my series on Compressing Earth Embeddings is live at We binarized the global Clay v1.5 Sentinel-2 embeddings from 183GB → 7GB and built TerraBit — a retrieval demo that runs with zero backend. DuckDB-WASM, cleverly partitioned GeoParquet on S3, and brute-force Hamming in an in-memory index makes the experience feel too good to be true. If you also wondered how Google's TurboQuant fares on compressing earth embeddings -- don't worry, we got you covered in the blog -- hint: TurboQuant int4 is a solid choice

What if you could search 50M satellite image embeddings with no server, no database, and no API? Part 2 of Caleb Robinson and my series on Compressing Earth Embeddings is live at We binarized the global Clay v1.5 Sentinel-2 embeddings from 183GB → 7GB and built TerraBit — a retrieval demo that runs with zero backend. DuckDB-WASM, cleverly partitioned GeoParquet on S3, and brute-force Hamming in an in-memory index makes the experience feel too good to be true. If you also wondered how Google's TurboQuant fares on compressing earth embeddings -- don't worry, we got you covered in the blog -- hint: TurboQuant int4 is a solid choice

Isaac Corley

43,017 görüntüleme • 3 ay önce

Working on multimodal instruction tuning and finding it hard to scale? Building Web/GUI agents but data is too narrow? Introducing 🚀MultiUI: 7.3M multimodal instructions from 1M webpage UIs, offering diverse data to boost text-rich visual understanding. Key takeaways: 🌟WebUI-trained models show major gains in visual web understanding and agent tasks. 💻 🌟Models also generalize well to non-UI tasks like DocVQA/OCR. 📄 How it works: We generate multimodal instructions with a text LLM using structured text from webpage accessibility trees. We then pair them with UI screenshots, to train multimodal models. Homepage: Paper: Dataset: Model: Congrats to the student lead Junpeng Liu and the team Tianyue Ou Yifan Song Yuxiao Qu Chenyan Xiong Wenhu Chen Graham Neubig ! More details are in the following threads ⬇️

Working on multimodal instruction tuning and finding it hard to scale? Building Web/GUI agents but data is too narrow? Introducing 🚀MultiUI: 7.3M multimodal instructions from 1M webpage UIs, offering diverse data to boost text-rich visual understanding. Key takeaways: 🌟WebUI-trained models show major gains in visual web understanding and agent tasks. 💻 🌟Models also generalize well to non-UI tasks like DocVQA/OCR. 📄 How it works: We generate multimodal instructions with a text LLM using structured text from webpage accessibility trees. We then pair them with UI screenshots, to train multimodal models. Homepage: Paper: Dataset: Model: Congrats to the student lead Junpeng Liu and the team Tianyue Ou Yifan Song Yuxiao Qu Chenyan Xiong Wenhu Chen Graham Neubig ! More details are in the following threads ⬇️

Xiang Yue

57,699 görüntüleme • 1 yıl önce

Here's a copy/paste prompt recipe and vid showing exactly how to ask an LLM for an interactive map with satellite/map layers + a georeferencer that lets you see how old maps correspond with modern geography. Today the computer can’t make good print maps (that's your hill to climb ) but it can, with five bucks and twenty minutes, make good interactive maps. No software/GIS knowledge necessary, you just need a few nouns and an LLM. Scroll to the bottom for the repo/live map if you want those. I'm using Claude Code as an extension in VS Code but you can use the Claude CLI, Cursor, whatever. 1) Let's grab an old cadastral map and see who owned big tracts of a city; I found this an 1854 map of Niagara Falls, NY I found in the Library of Congress: , grabbed the .jp2, saved as a jpg from photoshop. 2) Let's ask Claude Code for a map. You can see exactly what I did in the video but my prompt, sans simple "hey it's busted" debugging, is written out in the following paragraphs. I explain the map-specific nouns in brackets. You can likely dump this whole thing in your LLM window and it'll work; I'd try plan mode + skip permissions. THE PROMPT Make an interactive map with MapLibre GL JS [maplibre is a javascript mapping library, a FOSS version of Mapbox GL JS. This lets us display tiled map data and arbitrary images on the map] Add basemap toggles with Esri satellite, Carto Positron, and OSM [these map layers require no API keys for light usage; Carto Positron is a nice road map layer and OSM is ugly but comprehensive] Add a globe/mercator projection toggle [I think the globe looks better at low zooms] Add a layer panel on the left with visibility checkboxes and delete buttons. Add a search box on the map that flies to results, with deletable pin markers [Makes this easy to get to your area of interest] Include an interactive local georeferencer: drop a JPG, pick ground control points on a zoomable/pannable image viewer, place them on the map, watch it warp with a progress bar centered on the map. [The georeferencer uses math ("affine transform"??) to match points on the old map to points on the new map; generally you click road intersections on the old map, match them on the new map, repeat a dozen times and everything aligns] The georeferenced map overlay defaults to 25% opacity with a slider above the control point list. [I want it easy to see the underlying modern geography] Add Export/import control point buttons [this saves the control points as a JSON so you can save and reimport your work] Add a button to export the warped image as a GeoTIFF with a .prj [In case you want to add the georeferenced image to a real GIS program like QGIS] Look up all relevant docs before starting [Claude sometimes uses outdated stuff] Split everything into separate HTML/CSS/JS files [Claude tends to pile everything in index.html, which is hard to read] Use Optima font, base color #FEFAF6 [I just like this style] Let me test with a local server [it serves it on a simple server so you can nav your host to localhost:8000 and try it out] Log all errors [so you don't have to play telephone with the LLM describing what's busted] 3) Once your LLM finishes, test it out in your browser; if it doesn't work, ask the LLM to check logs. Repeat 'til functional. 4) After this works on your computer, you can show it to everyone by hosting it on GitHub: prompt with "write a README explaining what everything does, add it to a new GitHub repo, deploy using GitHub pages, gimme the live URL" Here's what Claude made for me, try it yourself: • Upload the JPG in the repo, which is linked below • "Add GCP" • Click somewhere recognizable on the old map, like the tip of an island or a road intersection • Click the matching point on the new map • Repeat til you have least 3x points • Hit "georeference" • You'll see the old map atop the new map; if you want a better fit, delete bad points or add a dozen new ones, hit georeference again, repeat Repo: Is this map robust? Human-maintainable? Elegant? Performant? Secure? No, but *your* personal web map need not be. It just needs to work for *your* narrow use case, because it’s *your* map.

Here's a copy/paste prompt recipe and vid showing exactly how to ask an LLM for an interactive map with satellite/map layers + a georeferencer that lets you see how old maps correspond with modern geography. Today the computer can’t make good print maps (that's your hill to climb ) but it can, with five bucks and twenty minutes, make good interactive maps. No software/GIS knowledge necessary, you just need a few nouns and an LLM. Scroll to the bottom for the repo/live map if you want those. I'm using Claude Code as an extension in VS Code but you can use the Claude CLI, Cursor, whatever. 1) Let's grab an old cadastral map and see who owned big tracts of a city; I found this an 1854 map of Niagara Falls, NY I found in the Library of Congress: , grabbed the .jp2, saved as a jpg from photoshop. 2) Let's ask Claude Code for a map. You can see exactly what I did in the video but my prompt, sans simple "hey it's busted" debugging, is written out in the following paragraphs. I explain the map-specific nouns in brackets. You can likely dump this whole thing in your LLM window and it'll work; I'd try plan mode + skip permissions. THE PROMPT Make an interactive map with MapLibre GL JS [maplibre is a javascript mapping library, a FOSS version of Mapbox GL JS. This lets us display tiled map data and arbitrary images on the map] Add basemap toggles with Esri satellite, Carto Positron, and OSM [these map layers require no API keys for light usage; Carto Positron is a nice road map layer and OSM is ugly but comprehensive] Add a globe/mercator projection toggle [I think the globe looks better at low zooms] Add a layer panel on the left with visibility checkboxes and delete buttons. Add a search box on the map that flies to results, with deletable pin markers [Makes this easy to get to your area of interest] Include an interactive local georeferencer: drop a JPG, pick ground control points on a zoomable/pannable image viewer, place them on the map, watch it warp with a progress bar centered on the map. [The georeferencer uses math ("affine transform"??) to match points on the old map to points on the new map; generally you click road intersections on the old map, match them on the new map, repeat a dozen times and everything aligns] The georeferenced map overlay defaults to 25% opacity with a slider above the control point list. [I want it easy to see the underlying modern geography] Add Export/import control point buttons [this saves the control points as a JSON so you can save and reimport your work] Add a button to export the warped image as a GeoTIFF with a .prj [In case you want to add the georeferenced image to a real GIS program like QGIS] Look up all relevant docs before starting [Claude sometimes uses outdated stuff] Split everything into separate HTML/CSS/JS files [Claude tends to pile everything in index.html, which is hard to read] Use Optima font, base color #FEFAF6 [I just like this style] Let me test with a local server [it serves it on a simple server so you can nav your host to localhost:8000 and try it out] Log all errors [so you don't have to play telephone with the LLM describing what's busted] 3) Once your LLM finishes, test it out in your browser; if it doesn't work, ask the LLM to check logs. Repeat 'til functional. 4) After this works on your computer, you can show it to everyone by hosting it on GitHub: prompt with "write a README explaining what everything does, add it to a new GitHub repo, deploy using GitHub pages, gimme the live URL" Here's what Claude made for me, try it yourself: • Upload the JPG in the repo, which is linked below • "Add GCP" • Click somewhere recognizable on the old map, like the tip of an island or a road intersection • Click the matching point on the new map • Repeat til you have least 3x points • Hit "georeference" • You'll see the old map atop the new map; if you want a better fit, delete bad points or add a dozen new ones, hit georeference again, repeat Repo: Is this map robust? Human-maintainable? Elegant? Performant? Secure? No, but your personal web map need not be. It just needs to work for your narrow use case, because it’s your map.

Evan Applegate

15,772 görüntüleme • 4 ay önce

Google just proved that bigger isn't always better. Their 308M parameter model is outperforming models 2x its size. Google just released 𝗘𝗺𝗯𝗲𝗱𝗱𝗶𝗻𝗴𝗚𝗲𝗺𝗺𝗮, and it's proving that lightweight embedding models can punch way above their weight class. At just 308M parameters (578MB), it's the new state-of-the-art for models under 500M parameters across MTEB multilingual, English, and code benchmarks. But the really impressive part is that it ranks 8th overall on MTEB(Multilingual, v2) - that's 𝟭𝟳 𝗽𝗹𝗮𝗰𝗲𝘀 above the second-best sub-500M model, and it's delivering performance 𝗰𝗼𝗺𝗽𝗮𝗿𝗮𝗯𝗹𝗲 𝘁𝗼 𝗺𝗼𝗱𝗲𝗹𝘀 𝗻𝗲𝗮𝗿𝗹𝘆 𝗱𝗼𝘂𝗯𝗹𝗲 𝗶𝘁𝘀 𝘀𝗶𝘇𝗲. There are three key parts of their training recipe that sets it apart: 𝟭. 𝗘𝗻𝗰𝗼𝗱𝗲𝗿-𝗗𝗲𝗰𝗼𝗱𝗲𝗿 𝗜𝗻𝗶𝘁𝗶𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 Instead of starting from a decoder-only Gemma 3 model, they first adapted it to encoder-decoder, then used just the encoder. By basing EmbeddingGemma off an LLM that already has world and language understanding, it gives it a stronger starting point. 𝟮. 𝗧𝗵𝗿𝗲𝗲-𝗟𝗼𝘀𝘀 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴 They combine three different loss functions, instead of just having one: • Contrastive loss (NCE) with in-batch negatives and hardness weighting • Spread-out regularization to ensure embeddings utilize the full space (for quantization and ANN retrieval) • Embedding matching distillation from Gemini Embedding - not just learning from relevance scores, but directly aligning the embedding space with the teacher model 𝟯. 𝗠𝗼𝗱𝗲𝗹 𝗦𝗼𝘂𝗽𝗶𝗻𝗴 Rather than just averaging checkpoints from the same training run, they use optimization techniques to find multiple specialized training mixtures. Each mixture creates an "expert" model in different domains, and averaging all their parameters creates a final model that's actually better than individual models. Extras: • Matryoshka embeddings supporting 768, 512, 256, and 128 dimensions • Quantization-aware training - maintains quality even at int4 precision • 100+ languages from Gemma 3 pretraining • Exceptional performance on low-resource languages (check their XTREME-UP results) Is it the absolute best embedding model? No - Gemini Embedding still leads overall. But that's not really the point. EmbeddingGemma proves you can achieve state-of-the-art performance in a small package that's actually deployable on-device, in low-latency applications, and in resource-constrained environments. This makes good embeddings accessible for use cases that I'm seeing more and more: offline applications, privacy-sensitive deployments, and high-throughput scenarios where inference cost actually matters. Full paper: Shoutout to the EmbeddingGemma team at Google DeepMind for this awesome open source work 💙 and to Daniel Williams for helping me with this video! 🫶

Google just proved that bigger isn't always better. Their 308M parameter model is outperforming models 2x its size. Google just released 𝗘𝗺𝗯𝗲𝗱𝗱𝗶𝗻𝗴𝗚𝗲𝗺𝗺𝗮, and it's proving that lightweight embedding models can punch way above their weight class. At just 308M parameters (578MB), it's the new state-of-the-art for models under 500M parameters across MTEB multilingual, English, and code benchmarks. But the really impressive part is that it ranks 8th overall on MTEB(Multilingual, v2) - that's 𝟭𝟳 𝗽𝗹𝗮𝗰𝗲𝘀 above the second-best sub-500M model, and it's delivering performance 𝗰𝗼𝗺𝗽𝗮𝗿𝗮𝗯𝗹𝗲 𝘁𝗼 𝗺𝗼𝗱𝗲𝗹𝘀 𝗻𝗲𝗮𝗿𝗹𝘆 𝗱𝗼𝘂𝗯𝗹𝗲 𝗶𝘁𝘀 𝘀𝗶𝘇𝗲. There are three key parts of their training recipe that sets it apart: 𝟭. 𝗘𝗻𝗰𝗼𝗱𝗲𝗿-𝗗𝗲𝗰𝗼𝗱𝗲𝗿 𝗜𝗻𝗶𝘁𝗶𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 Instead of starting from a decoder-only Gemma 3 model, they first adapted it to encoder-decoder, then used just the encoder. By basing EmbeddingGemma off an LLM that already has world and language understanding, it gives it a stronger starting point. 𝟮. 𝗧𝗵𝗿𝗲𝗲-𝗟𝗼𝘀𝘀 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴 They combine three different loss functions, instead of just having one: • Contrastive loss (NCE) with in-batch negatives and hardness weighting • Spread-out regularization to ensure embeddings utilize the full space (for quantization and ANN retrieval) • Embedding matching distillation from Gemini Embedding - not just learning from relevance scores, but directly aligning the embedding space with the teacher model 𝟯. 𝗠𝗼𝗱𝗲𝗹 𝗦𝗼𝘂𝗽𝗶𝗻𝗴 Rather than just averaging checkpoints from the same training run, they use optimization techniques to find multiple specialized training mixtures. Each mixture creates an "expert" model in different domains, and averaging all their parameters creates a final model that's actually better than individual models. Extras: • Matryoshka embeddings supporting 768, 512, 256, and 128 dimensions • Quantization-aware training - maintains quality even at int4 precision • 100+ languages from Gemma 3 pretraining • Exceptional performance on low-resource languages (check their XTREME-UP results) Is it the absolute best embedding model? No - Gemini Embedding still leads overall. But that's not really the point. EmbeddingGemma proves you can achieve state-of-the-art performance in a small package that's actually deployable on-device, in low-latency applications, and in resource-constrained environments. This makes good embeddings accessible for use cases that I'm seeing more and more: offline applications, privacy-sensitive deployments, and high-throughput scenarios where inference cost actually matters. Full paper: Shoutout to the EmbeddingGemma team at Google DeepMind for this awesome open source work 💙 and to Daniel Williams for helping me with this video! 🫶

Victoria Slocum

21,592 görüntüleme • 8 ay önce

I'm excited to release Latent Scope 1.0! Thanks to Fable clearing 2 years of backlog, you can now point your workhorse model at a directory of images, a text corpus or even existing RAG db and get back an interactive map for looking at your data through the eyes of a transformer. Latent Scope started as a set of scripts with a frontend to try and make it less tedious to run the process of projecting & clustering embeddings. Now we let agents do the tedium and the library is essentially a skill that reliably outputs data maps that you can explore. In this release we enabled a ton of stuff, starting with performance increases by switching the backend to LanceDB and enabling cuML GPU acceleration. We added support for images, and late interaction (ColBERT models) as well as 3D UMAP support. We also added Leland McInnes new EVoC clustering method. All that is topped off with a skill definition and an Impeccable powered UI revamp. See the thread for more videos and demos!

I'm excited to release Latent Scope 1.0! Thanks to Fable clearing 2 years of backlog, you can now point your workhorse model at a directory of images, a text corpus or even existing RAG db and get back an interactive map for looking at your data through the eyes of a transformer. Latent Scope started as a set of scripts with a frontend to try and make it less tedious to run the process of projecting & clustering embeddings. Now we let agents do the tedium and the library is essentially a skill that reliably outputs data maps that you can explore. In this release we enabled a ton of stuff, starting with performance increases by switching the backend to LanceDB and enabling cuML GPU acceleration. We added support for images, and late interaction (ColBERT models) as well as 3D UMAP support. We also added Leland McInnes new EVoC clustering method. All that is topped off with a skill definition and an Impeccable powered UI revamp. See the thread for more videos and demos!

Ian Johnson 🔬🤖

17,648 görüntüleme • 16 gün önce

Over the last few months, we’ve been thinking about how to learn from “off-domain” data - data from non-robot sources like video or simulation. These data sources are not quite good enough to learn policies (even monolithic VLA models) directly, but they still contain lots of information that can be useful for generalizable robot control. How can we develop robot learning models that are able to make use of this type of data for generalizable control? In new work, that we call HAMSTER, we show that VLMs can be useful for enabling robotic learning from off-domain data, but specifically when used through hierarchical VLA architectures. We show that this class of models can learn generalizable robot policies for the real world from large-scale, off-domain data. A 🧵 (1/10)

Over the last few months, we’ve been thinking about how to learn from “off-domain” data - data from non-robot sources like video or simulation. These data sources are not quite good enough to learn policies (even monolithic VLA models) directly, but they still contain lots of information that can be useful for generalizable robot control. How can we develop robot learning models that are able to make use of this type of data for generalizable control? In new work, that we call HAMSTER, we show that VLMs can be useful for enabling robotic learning from off-domain data, but specifically when used through hierarchical VLA architectures. We show that this class of models can learn generalizable robot policies for the real world from large-scale, off-domain data. A 🧵 (1/10)

Abhishek Gupta

11,994 görüntüleme • 1 yıl önce

[CLIP] by Hand ✍️ The CLIP (Contrastive Language–Image Pre-training) model, a groundbreaking work by OpenAI, redefines the intersection of computer vision and natural language processing. It is the basis of all the multi-modal foundation models we see today. How does CLIP work? Goal: 🟨 Learn a shared embedding space for text and image [1] Given ↳ A mini batch of 3 text-image pairs ↳ OpenAI used 400 million text-image pairs to train its original CLIP model. Process 1st pair: "big table" [2] 🟪 Text → 2 Vectors (3D) ↳ Look up word embedding vectors using word2vec. [3] 🟩 Image → 2 Vectors (4D) ↳ Divide the image into two patches. ↳ Flatten each patch [4] Process other pairs ↳ Repeat [2]-[3] [5] 🟪 Text Encoder & 🟩 Image Encoder ↳ Encode input vectors into feature vectors ↳ Here, both encoders are simple one layer perceptron (linear + ReLU) ↳ In practice, the encoders are usually transformer models. [6] 🟪 🟩 Mean Pooling: 2 → 1 vector ↳ Average 2 feature vectors into a single vector by averaging across the columns ↳ The goal is to have one vector to represent each image or text [7] 🟪 🟩 -> 🟨 Projection ↳ Note that the text and image feature vectors from the encoders have different dimensions (3D vs. 4D). ↳ Use a linear layer to project image and text vectors to a 2D shared embedding space. 🏋️ Contrastive Pre-training 🏋️ [8] Prepare for MatMul ↳ Copy text vectors (T1,T2,T3) ↳ Copy the transpose of image vectors (I1,I2,I3) ↳ They are all in the 2D shared embedding space. [9] 🟦 MatMul ↳ Multiply T and I matrices. ↳ This is equivalent to taking dot product between every pair of image and text vectors. ↳ The purpose is to use dot product to estimate the similarity between a pair of image-text. [10] 🟦 Softmax: e^x ↳ Raise e to the power of the number in each cell ↳ To simplify hand calculation, we approximate e^□ with 3^□. [11] 🟦 Softmax: ∑ ↳ Sum each row for 🟩 image→🟪 text ↳ Sum each column for 🟪 text→ 🟩 image [12] 🟦 Softmax: 1 / sum ↳ Divide each element by the column sum to obtain a similarity matrix for 🟪 text→🟩 image ↳ Divide each element by the row sum to obtain a similarity matrix for 🟩 image→🟪 text [13] 🟥 Loss Gradients ↳ The "Targets" for the similarity matrices are Identity Matrices. ↳ Why? If I and T come from the same pair (i=j), we want the highest value, which is 1, and 0 otherwise. ↳ Apply the simple equation of [Similarity - Target] to compute gradients of for both directions. ↳ Why so simple? Because when Softmax and Cross-Entropy Loss are used together, the math magically works out that way. ↳ These gradients kick off the backpropagation process to update weights and biases of the encoders and projection layers (red borders).

[CLIP] by Hand ✍️ The CLIP (Contrastive Language–Image Pre-training) model, a groundbreaking work by OpenAI, redefines the intersection of computer vision and natural language processing. It is the basis of all the multi-modal foundation models we see today. How does CLIP work? Goal: 🟨 Learn a shared embedding space for text and image [1] Given ↳ A mini batch of 3 text-image pairs ↳ OpenAI used 400 million text-image pairs to train its original CLIP model. Process 1st pair: "big table" [2] 🟪 Text → 2 Vectors (3D) ↳ Look up word embedding vectors using word2vec. [3] 🟩 Image → 2 Vectors (4D) ↳ Divide the image into two patches. ↳ Flatten each patch [4] Process other pairs ↳ Repeat [2]-[3] [5] 🟪 Text Encoder & 🟩 Image Encoder ↳ Encode input vectors into feature vectors ↳ Here, both encoders are simple one layer perceptron (linear + ReLU) ↳ In practice, the encoders are usually transformer models. [6] 🟪 🟩 Mean Pooling: 2 → 1 vector ↳ Average 2 feature vectors into a single vector by averaging across the columns ↳ The goal is to have one vector to represent each image or text [7] 🟪 🟩 -> 🟨 Projection ↳ Note that the text and image feature vectors from the encoders have different dimensions (3D vs. 4D). ↳ Use a linear layer to project image and text vectors to a 2D shared embedding space. 🏋️ Contrastive Pre-training 🏋️ [8] Prepare for MatMul ↳ Copy text vectors (T1,T2,T3) ↳ Copy the transpose of image vectors (I1,I2,I3) ↳ They are all in the 2D shared embedding space. [9] 🟦 MatMul ↳ Multiply T and I matrices. ↳ This is equivalent to taking dot product between every pair of image and text vectors. ↳ The purpose is to use dot product to estimate the similarity between a pair of image-text. [10] 🟦 Softmax: e^x ↳ Raise e to the power of the number in each cell ↳ To simplify hand calculation, we approximate e^□ with 3^□. [11] 🟦 Softmax: ∑ ↳ Sum each row for 🟩 image→🟪 text ↳ Sum each column for 🟪 text→ 🟩 image [12] 🟦 Softmax: 1 / sum ↳ Divide each element by the column sum to obtain a similarity matrix for 🟪 text→🟩 image ↳ Divide each element by the row sum to obtain a similarity matrix for 🟩 image→🟪 text [13] 🟥 Loss Gradients ↳ The "Targets" for the similarity matrices are Identity Matrices. ↳ Why? If I and T come from the same pair (i=j), we want the highest value, which is 1, and 0 otherwise. ↳ Apply the simple equation of [Similarity - Target] to compute gradients of for both directions. ↳ Why so simple? Because when Softmax and Cross-Entropy Loss are used together, the math magically works out that way. ↳ These gradients kick off the backpropagation process to update weights and biases of the encoders and projection layers (red borders).

Tom Yeh

67,834 görüntüleme • 2 yıl önce

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

AK

290,517 görüntüleme • 3 yıl önce

New short course: Open Source Models with Hugging Face 🤗, taught by Maria Khalusova, Marc Sun, and Younes Belkada! Hugging Face has been a game changer by letting you quickly grab any of hundreds of thousands of already-trained open source models to assemble into new applications. This course teaches you best practices for building this way, including how to search and choose among models. You’ll learn to use the Transformers library and walk through multiple models for text, audio, and image processing, including zero-shot image segmentation, zero-shot audio classification, and speech recognition. You'll also learn to use multimodal models for visual question answering, image search, and image captioning. Finally, you’ll learn how to demo what you build locally, on the cloud, or via an API using Gradio and Hugging Face Spaces. You can sign up here:

New short course: Open Source Models with Hugging Face 🤗, taught by Maria Khalusova, Marc Sun, and Younes Belkada! Hugging Face has been a game changer by letting you quickly grab any of hundreds of thousands of already-trained open source models to assemble into new applications. This course teaches you best practices for building this way, including how to search and choose among models. You’ll learn to use the Transformers library and walk through multiple models for text, audio, and image processing, including zero-shot image segmentation, zero-shot audio classification, and speech recognition. You'll also learn to use multimodal models for visual question answering, image search, and image captioning. Finally, you’ll learn how to demo what you build locally, on the cloud, or via an API using Gradio and Hugging Face Spaces. You can sign up here:

Andrew Ng

224,538 görüntüleme • 2 yıl önce

Super excited to share the last paper of my PhD: "Hallucination in World Models is Predictable and Preventable"✨ We train a 350M-param generative world model on a large dataset w/ 210 tasks and show that we can predict *when* hallucination happens and use that to fix it! 🧵1/n

Super excited to share the last paper of my PhD: "Hallucination in World Models is Predictable and Preventable"✨ We train a 350M-param generative world model on a large dataset w/ 210 tasks and show that we can predict when hallucination happens and use that to fix it! 🧵1/n

Nicklas Hansen

54,108 görüntüleme • 1 ay önce

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models paper page: Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. In our experiments, JARVIS-1 exhibits nearly perfect performances across over 200 varying tasks from the Minecraft Universe Benchmark, ranging from entry to intermediate levels. JARVIS-1 has achieved a completion rate of 12.5% in the long-horizon diamond pickaxe task. This represents a significant increase up to 5 times compared to previous records. Furthermore, we show that JARVIS-1 is able to self-improve following a life-long learning paradigm thanks to multimodal memory, sparking a more general intelligence and improved autonomy.

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models paper page: Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. In our experiments, JARVIS-1 exhibits nearly perfect performances across over 200 varying tasks from the Minecraft Universe Benchmark, ranging from entry to intermediate levels. JARVIS-1 has achieved a completion rate of 12.5% in the long-horizon diamond pickaxe task. This represents a significant increase up to 5 times compared to previous records. Furthermore, we show that JARVIS-1 is able to self-improve following a life-long learning paradigm thanks to multimodal memory, sparking a more general intelligence and improved autonomy.

AK

141,440 görüntüleme • 2 yıl önce

A map from 1531 reveals something that could rewrite all of human history. “The map maker tells us he uncovered material previously hidden in darkness … ” If you’ve never heard of the Orontius Finaeus map, this will blow your mind: Graham Hancock: “The Orontius Finaeus map shows Antarctica.” “This map was drawn in 1531.” “The problem is that our civilization didn’t discover Antarctica until 1820.” This map is a key piece of evidence that Hancock raises to support his theory of a lost, ancient civilization. “When I say a lost civilization, I do not mean a civilization like ours.” “I think they were very different civilization from ours.” “But they had conquered a number of peaks.” “One of those peaks was navigation and ocean seafaring.” “Hence the survival of maps, which show the world as it looked during the Ice Age.” “Another was astronomy.” “Another really important breakthrough evidenced by the ancient maps … is accurate relative longitudes.” “Particularly when we know that the map was based on older source maps.” “Had somebody found Antarctica long before we did and mapped it with extremely accurate relative longitudes?” “Our civilization didn’t crack the longitude problem until the mid-18th century.” “We didn’t get that until 1750, 1760s, there abouts, with Harrison’s chronometer.” “So finding good longitudes on very ancient maps is another puzzle that I don’t think archeology solved.” Graham Hancock Steven Bartlett

A map from 1531 reveals something that could rewrite all of human history. “The map maker tells us he uncovered material previously hidden in darkness … ” If you’ve never heard of the Orontius Finaeus map, this will blow your mind: Graham Hancock: “The Orontius Finaeus map shows Antarctica.” “This map was drawn in 1531.” “The problem is that our civilization didn’t discover Antarctica until 1820.” This map is a key piece of evidence that Hancock raises to support his theory of a lost, ancient civilization. “When I say a lost civilization, I do not mean a civilization like ours.” “I think they were very different civilization from ours.” “But they had conquered a number of peaks.” “One of those peaks was navigation and ocean seafaring.” “Hence the survival of maps, which show the world as it looked during the Ice Age.” “Another was astronomy.” “Another really important breakthrough evidenced by the ancient maps … is accurate relative longitudes.” “Particularly when we know that the map was based on older source maps.” “Had somebody found Antarctica long before we did and mapped it with extremely accurate relative longitudes?” “Our civilization didn’t crack the longitude problem until the mid-18th century.” “We didn’t get that until 1750, 1760s, there abouts, with Harrison’s chronometer.” “So finding good longitudes on very ancient maps is another puzzle that I don’t think archeology solved.” Graham Hancock Steven Bartlett

Holden Culotta

68,587 görüntüleme • 1 ay önce

In collaboration with Intel, our Depth Fusion showcases the power of our LDM3D diffusion model in generating 360° views from text prompts provided by the user. The LDM3D diffusion model generates a 2D RGB image and its corresponding relative depth map providing a complete RGBD representation corresponding to the text prompt. The LDM 3D model is a specialized version of the stable diffusion V 1.4 model that has been modified to fit both image and depth map data.The model was then fine tuned on a subset of the Laion400M data set - large scale image caption data set. The depth maps used to fine tune our model were generated by the DPTBeiT large 512 depth estimation model that provides highly accurate relative depth estimates for each pixel. We take the generated 2D RGB image and depth map and use them to compute a 360° projection using touchdesigner. Touchdesigner is a versatile platform that allows for the creation of immersive and interactive multimedia experiences. Our application harnesses the power of touchdesigner to bring the generated 360° views to life, providing users with a unique and engaging way to experience their text prompts, whether it’s a description of a tranquil forest, a noisy cityscape or a futuristic sci fi world. Our depth fusion can bring these concepts to life in a vivid and immersive detail. - Scottie Fox, VP Engineering Blockade Labs ScottieFox #AI #VR #3D #gamedev #stablediffusion

In collaboration with Intel, our Depth Fusion showcases the power of our LDM3D diffusion model in generating 360° views from text prompts provided by the user. The LDM3D diffusion model generates a 2D RGB image and its corresponding relative depth map providing a complete RGBD representation corresponding to the text prompt. The LDM 3D model is a specialized version of the stable diffusion V 1.4 model that has been modified to fit both image and depth map data.The model was then fine tuned on a subset of the Laion400M data set - large scale image caption data set. The depth maps used to fine tune our model were generated by the DPTBeiT large 512 depth estimation model that provides highly accurate relative depth estimates for each pixel. We take the generated 2D RGB image and depth map and use them to compute a 360° projection using touchdesigner. Touchdesigner is a versatile platform that allows for the creation of immersive and interactive multimedia experiences. Our application harnesses the power of touchdesigner to bring the generated 360° views to life, providing users with a unique and engaging way to experience their text prompts, whether it’s a description of a tranquil forest, a noisy cityscape or a futuristic sci fi world. Our depth fusion can bring these concepts to life in a vivid and immersive detail. - Scottie Fox, VP Engineering Blockade Labs ScottieFox #AI #VR #3D #gamedev #stablediffusion

Blockade Labs

11,439 görüntüleme • 3 yıl önce

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 görüntüleme • 3 yıl önce