Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

[1/n] Do distinct large models admit a simple map that aligns their embedding spaces? We show that across multimodal contrastive models—trained on different data and architectures—an orthogonal map aligns image embeddings. Strikingly, the same map also aligns text embeddings.

Sharut Gupta

2,695 subscribers

36,956 views • 3 months ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

New short course: Building Multimodal Search and RAG", by Weaviate AI Database's Sebastia(N_) Witalec ✊🏽✊🏾✊🏿. Contrastive learning is used to train models to map vectors into an embedding space by pulling similar concepts closer together and pushing dissimilar concepts away from each other. This technique is also used to train multimodal embedding models that capture semantic similarity across different modalities like text, images, and audio. These multimodal embeddings can be used to build multimodal search and RAG systems. In this course, you'll learn how contrastive learning works, and how to add multimodality to RAG – so your models can draw on diverse, relevant context to answer questions. For example, a query about a financial report might synthesize information from text snippets, graphs, tables, and slides. You will also learn how visual instruction tuning lets you integrate image understanding into language models, and build a multi-vector recommender system using Weaviate’s open-source vector database. Please sign up here:

New short course: Building Multimodal Search and RAG", by Weaviate AI Database's Sebastia(N_) Witalec ✊🏽✊🏾✊🏿. Contrastive learning is used to train models to map vectors into an embedding space by pulling similar concepts closer together and pushing dissimilar concepts away from each other. This technique is also used to train multimodal embedding models that capture semantic similarity across different modalities like text, images, and audio. These multimodal embeddings can be used to build multimodal search and RAG systems. In this course, you'll learn how contrastive learning works, and how to add multimodality to RAG – so your models can draw on diverse, relevant context to answer questions. For example, a query about a financial report might synthesize information from text snippets, graphs, tables, and slides. You will also learn how visual instruction tuning lets you integrate image understanding into language models, and build a multi-vector recommender system using Weaviate’s open-source vector database. Please sign up here:

Andrew Ng

104,371 views • 2 years ago

Consistency models, CTMs, shortcut models, align your flow, mean flow... What's the connection, and how should you learn them in practice? We show they're all different sides of the same coin connected by one central object: the flow map. 🧵(1/n)

Consistency models, CTMs, shortcut models, align your flow, mean flow... What's the connection, and how should you learn them in practice? We show they're all different sides of the same coin connected by one central object: the flow map. 🧵(1/n)

Nicholas Boffi

65,667 views • 8 months ago

Embeddings (and how to create them) are, perhaps, the most interesting idea behind Large Language Models. I built a simple model to help you understand embeddings from scratch. Here is a step-by-step video explanation:

Embeddings (and how to create them) are, perhaps, the most interesting idea behind Large Language Models. I built a simple model to help you understand embeddings from scratch. Here is a step-by-step video explanation:

Santiago

44,107 views • 1 year ago

🌎⚠️ Doomsday map alert! Peggy Bolton's research shows Bill Gates' farmland aligns with dry land on the map. Is he preparing for a catastrophe? Meanwhile, Jeff Bezos is buying up farmland too. And Hawaii stays safe! How convenient for Mark Zuckerberg. 🤔

🌎⚠️ Doomsday map alert! Peggy Bolton's research shows Bill Gates' farmland aligns with dry land on the map. Is he preparing for a catastrophe? Meanwhile, Jeff Bezos is buying up farmland too. And Hawaii stays safe! How convenient for Mark Zuckerberg. 🤔

Red Pill USA

1,420,071 views • 2 years ago

Introducing — Hosted Embedding Marketplace 💈 We’re building a single destination to discover, evaluate, and access relevant embeddings. Move from large expensive models to leaner open-source models without reducing accuracy. Comment 👋 for early access

Introducing — Hosted Embedding Marketplace 💈 We’re building a single destination to discover, evaluate, and access relevant embeddings. Move from large expensive models to leaner open-source models without reducing accuracy. Comment 👋 for early access

vishal ✦

160,334 views • 3 years ago

.Tkay had to pull out the "mom-cam" to prove to Methodz & Zoomaa he's just different Tkay on the series (ELIMS): Map 1 - 58/29 Map 2 - 16/2 Map 3 - N/A Map 4 - 44/24 HIM 😤

.Tkay had to pull out the "mom-cam" to prove to Methodz & Zoomaa he's just different Tkay on the series (ELIMS): Map 1 - 58/29 Map 2 - 16/2 Map 3 - N/A Map 4 - 44/24 HIM 😤

Torn

181,408 views • 1 month ago

I downloaded something like 300GB of open models and wrote a bunch of map-reduce style processing scripts to make this graph. It's plotting the distribution of weight values across a variety of popular open models, to show that models are almost entirely made up of small floats.

I downloaded something like 300GB of open models and wrote a bunch of map-reduce style processing scripts to make this graph. It's plotting the distribution of weight values across a variety of popular open models, to show that models are almost entirely made up of small floats.

Sam Rose

58,141 views • 4 months ago

I recorded this introduction to embeddings (perhaps the most essential concept behind how LLMs.) But this is different from everything else you've seen: We will generate embeddings from scratch using a Siamese Network and a contrastive loss. This video is for 5 year olds.

I recorded this introduction to embeddings (perhaps the most essential concept behind how LLMs.) But this is different from everything else you've seen: We will generate embeddings from scratch using a Siamese Network and a contrastive loss. This video is for 5 year olds.

Santiago

68,976 views • 2 years ago

How do we build multimodal systems that work effectively across the globe? 🌍 Today we release the Aya Vision Technical Report, the detailed recipe behind Aya Vision models, unifying state-of-the-art multilingual capabilities in multimodal and text tasks across 23 languages!

How do we build multimodal systems that work effectively across the globe? 🌍 Today we release the Aya Vision Technical Report, the detailed recipe behind Aya Vision models, unifying state-of-the-art multilingual capabilities in multimodal and text tasks across 23 languages!

Cohere Labs

15,561 views • 1 year ago

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

Andrew Ng

107,548 views • 1 year ago

Sturge: We stand ready. Our personnel are highly trained and our doctrine aligns with yours. We require assets in the interim.

Sturge: We stand ready. Our personnel are highly trained and our doctrine aligns with yours. We require assets in the interim.

Kejan Haynes

21,204 views • 3 months ago

A civil engineer is working on building a Roller Coaster that travels across the map

A civil engineer is working on building a Roller Coaster that travels across the map

Rooster

653,612 views • 6 months ago

I Created The FIRST Ever Simple Edit Practice Map! 🤯 Map Code: 8942-4322-3496 or Search "Crosshair Training" Tag someone below that NEEDS this map ⬇️

I Created The FIRST Ever Simple Edit Practice Map! 🤯 Map Code: 8942-4322-3496 or Search "Crosshair Training" Tag someone below that NEEDS this map ⬇️

Xen Quinn

38,008 views • 11 months ago

New Feature: Ping System Showcase Players can now ping on the mini-map to place a pin on the map. Pings placed on the map will also appear on the mini map. #callofdutymobile #codm #codmobile

New Feature: Ping System Showcase Players can now ping on the mini-map to place a pin on the map. Pings placed on the map will also appear on the mini map. #callofdutymobile #codm #codmobile

Leakers On Duty

42,248 views • 1 year ago

Putin says that he intends to change out Ukraine's leadership. Interesting that this ALSO aligns with Trump's goals.

Putin says that he intends to change out Ukraine's leadership. Interesting that this ALSO aligns with Trump's goals.

Jay in Kyiv

76,164 views • 1 year ago

Launching the 1st Arena for Embedding Models: MTEB Arena🏟️ Vote @ ⚔️ 15 Models: OpenAI Google Cohere Voyage_AI_ Jina AI Salesforce AI Research Nomic E5 GritLM BGE.. 3 Tasks: Retrieval/Clustering/STS Deep dive with me on embeddings & the arena👇 🧵1/13

Launching the 1st Arena for Embedding Models: MTEB Arena🏟️ Vote @ ⚔️ 15 Models: OpenAI Google Cohere Voyage_AI_ Jina AI Salesforce AI Research Nomic E5 GritLM BGE.. 3 Tasks: Retrieval/Clustering/STS Deep dive with me on embeddings & the arena👇 🧵1/13

Niklas Muennighoff

58,947 views • 1 year ago

interpolating image embeddings across arbitrary numbers of vertices. using polygons as lenses through n-dimensional latent space.

interpolating image embeddings across arbitrary numbers of vertices. using polygons as lenses through n-dimensional latent space.

harley turan

23,036 views • 9 months ago

A model that aligns perfectly what we see in reality 👀 Something you will never find on a globe 🌎💦

A model that aligns perfectly what we see in reality 👀 Something you will never find on a globe 🌎💦

Rufus_2688

359,231 views • 2 years ago

Meta announces Movie Gen A Cast of Media Foundation Models We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user’s image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models

Meta announces Movie Gen A Cast of Media Foundation Models We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user’s image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models

AK

62,719 views • 1 year ago

Ever had trouble navigating your city because of outdated map data? Meet Hivemapper: a global, real-time, crowdsourced map that you can build and own. STFU on Solana 🦾

Ever had trouble navigating your city because of outdated map data? Meet Hivemapper: a global, real-time, crowdsourced map that you can build and own. STFU on Solana 🦾

Superteam

138,788 views • 2 years ago