Загрузка видео...

Не удалось загрузить видео

На главную

103,750 просмотров • 5 месяцев назад •via X (Twitter)

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

Vector Database by Hand ✍️ Vector databases are revolutionizing how we search and analyze complex data. They have become the backbone of Retrieval Augmented Generation (#RAG). How do vector databases work? [1] Given ↳ A dataset of three sentences, each has 3 words (or tokens) ↳ In practice, a dataset may contain millions or billions of sentences. The max number of tokens may be tens of thousands (e.g., 32,768 mistral-7b). Process "how are you" [2] 🟨 Word Embeddings ↳ For each word, look up corresponding word embedding vector from a table of 22 vectors, where 22 is the vocabulary size. ↳ In practice, the vocabulary size can be tens of thousands. The word embedding dimensions are in the thousands (e.g., 1024, 4096) [3] 🟩 Encoding ↳ Feed the sequence of word embeddings to an encoder to obtain a sequence of feature vectors, one per word. ↳ Here, the encoder is a simple one layer perceptron (linear layer + ReLU) ↳ In practice, the encoder is a transformer or one of its many variants. [4] 🟩 Mean Pooling ↳ Merge the sequence of feature vectors into a single vector using "mean pooling" which is to average across the columns. ↳ The result is a single vector. We often call it "text embeddings" or "sentence embeddings." ↳ Other pooling techniques are possible, such as CLS. But mean pooling is the most common. [5] 🟦 Indexing ↳ Reduce the dimensions of the text embedding vector by a projection matrix. The reduction rate is 50% (4->2). ↳ In practice, the values in this projection matrix is much more random. ↳ The purpose is similar to that of hashing, which is to obtain a short representation to allow faster comparison and retrieval. ↳ The resulting dimension-reduced index vector is saved in the vector storage. [6] Process "who are you" ↳ Repeat [2]-[5] [7] Process "who am I" ↳ Repeat [2]-[5] Now we have indexed our dataset in the vector database. [8] 🟥 Query: "am I you" ↳ Repeat [2]-[5] ↳ The result is a 2-d query vector. [9] 🟥 Dot Products ↳ Take dot product between the query vector and database vectors. They are all 2-d. ↳ The purpose is to use dot product to estimate similarity. ↳ By transposing the query vector, this step becomes a matrix multiplication. [10] 🟥 Nearest Neighbor ↳ Find the largest dot product by linear scan. ↳ The sentence with the highest dot product is "who am I" ↳ In practice, because scanning billions of vectors is slow, we use an Approximate Nearest Neighbor (ANN) algorithm like the Hierarchical Navigable Small Worlds (HNSW).

Tom Yeh

191,919 просмотров • 2 лет назад

Before the week ends, let's acknowledge one of the most INSANE week ever for open AI, with 25+ notable open-weight drops across every modality: 🧠 LLMs → NVIDIA Nemotron 3 Ultra: 550B hybrid Mamba-MoE, only 55B active, 1M context, MMLU 89.1. NVFP4 variant claims ~5x throughput on Blackwell. First openly-weighted 550B hybrid Mamba-Transformer, closing the gap with frontier closed models. → Google Gemma 4 12B: fully open dense any-to-any (text/image/audio/video), 256k context, encoder-free, 140+ languages, AIME 2026 at 77.5. Shipped with a 23-checkpoint QAT wave (mobile ONNX + MLX). Most deployable model of the week. → StepFun Step-3.7-Flash: 198B sparse MoE VLM, ~11B active, SWE-Bench PRO 56.3. Apache 2.0. → Liquid AI LFM2.5-8B-A1B: edge MoE, just 1.5B active, 128k ctx, MATH500 88.8, MLX-ready. Best on-device option this week. → JetBrains Mellum2-12B-A2.5B-Thinking: their first open MoE, near-Qwen3-14B coding at 2.5B active. Apache 2.0. 🎨 Image gen (the surprise of the week) → Ideogram 4: their FIRST-EVER open weights. 9.3B flow-matching DiT trained from scratch. #2 overall behind GPT Image 2, top open-weight model on Design Arena + LMArena. Strongest open checkpoint for text-rich images, full stop. It has taste. Still can't believe this is open weights. 🔊 Audio & Speech (a breakout week for open TTS, 4 labs shipped) → Boson Higgs Audio v3 4B: 102 languages, 21 emotions, singing/whispering/shouting, sub-second TTFA. → RedNote dots.tts: the only fully continuous (no codec) open TTS pipeline, Apache 2.0. → Google Magenta RealTime 2: real-time music gen, <200ms latency, text+audio+MIDI. multimodalart ported it to PyTorch within hours with live ZeroGPU demos. → NVIDIA Nemotron-3.5 ASR: 600M streaming, 17x more concurrent streams vs Parakeet RNNT 1.1B. 👁️ Vision & VLMs → PaddleOCR-VL-1.6: SOTA document parsing at 1B params, Apache 2.0. → Baidu NAVA: 6.3B joint audio-video gen, best-in-class A/V sync, Apache 2.0. 🎬 Video, 3D & World Models → NVIDIA Cosmos3-Super: 64B omnimodal world model coupling action trajectories with video+audio gen, for Physical AI. → JD JoyAI-Echo: up to 5-min multi-shot text-to-video on LTX-2.3. → ByteDance Bernini-R + VAST TripoSplat (single-image-to-3D Gaussian splats, MIT).

Victor M

518,855 просмотров • 4 дней назад

🚀 The Segment Anything Model (SAM) has been upgraded to SAM2, featuring an efficient image encoder for segmenting images and videos. But does SAM2 outperform SAM1 in medical image and video segmentation? We're thrilled to present our paper "Segment Anything in Medical Images and Videos: Benchmark and Deployment"! We comprehensively benchmark SAM2 across 11 medical image modalities and videos. 📄 Paper: 💻 Code: **Highlights:** 1. SAM2 doesn’t always outperform SAM1 in 2D medical images, but excels in video segmentation, making it more accurate and efficient for 3D images, such as CT and MR scans. 2. MedSAM still outperforms SAM2 on most 2D modalities, but SAM2 surpasses MedSAM for 3D image segmentation in a slice-by-slice approach. 3. Segmentation performance varies with model size; sometimes the smallest model outperforms larger ones. 4. Fine-tuning SAM2 significantly boosts its performance for medical image segmentation. While SAM2 may struggle with challenging objects that have unclear boundaries or low contrast, it excels in generating good initial segmentation masks for common medical images and videos. However, the official interface doesn’t support medical data formats and has limitations on video length. To address this, we've developed a 3D Slicer Plugin and Gradio API for efficient 3D medical image and video segmentation. We invite you to try them out and provide feedback! 🔧 Deployment: - 3D Slicer Plugin: - Gradio API: (Note: Due to GPU limitations, the online API is available for only 12 hours and may be slow. We highly recommend deploying the Gradio API with your own computing resources: A big shoutout to Jun Ma (JunMa) who recently joined our UHN AI hub (UHN AI Hub) as Machine Learning Lead, and kudos to all co-authors: Sumin Kim, Feifei Li, Mohammed Baharoon (Mohammed Baharoon), Reza Asakereh, and Hongwei Lyu! This is true teamwork! Looking forward to collaborating with the community to advance 3D medical image and video segmentation foundation models! University Health Network U of T Department of Computer Science Department of Laboratory Medicine & Pathobiology Temerty Centre for AI in Medicine (T-CAIREM) Vector Institute #MedTech #AIinHealthcare #DeepLearning #MedicalImaging #SAM2 #MedSAM #AIResearch

Bo Wang

178,419 просмотров • 1 год назад