Uploaded: 2025-06-12T01:25:12.000Z
Duration: PT18.777S
Channel: Sakana AI

We’re excited to introduce Text-to-LoRA: a Hypernetwork that generates... task-specific LLM adapters (LoRAs) based on a text description of the task. Catch our presentation at #ICML2025! Paper: Code: Biological systems are capable of rapid adaptation, given limited sensory cues. For example, our human visual system can quickly adapt and tune its light sensitivity to our surroundings. While modern LLMs exhibit a wide variety of capabilities and knowledge, they remain rigid when adding task-specific capabilities. Traditionally, customizing these models requires gathering large datasets and performing often expensive, time-consuming fine-tuning for specific applications. To bypass these limitations, Text-to-LoRA (T2L) meta-learns a “hypernetwork” that takes in a text description of a desired task, as a prompt, and generates a task-specific LoRA that performs well on the task. In our experiments, we show that T2L can encode hundreds of existing LoRA adapters. While the compression is lossy, T2L maintains the performance of task-specifically tuned LoRA adapters. We also show that T2L can even generalize to unseen tasks given a natural language description of the tasks. Importantly, Text-to-LoRA is parameter-efficient. It generates LoRAs in a single, inexpensive step, based solely on a simple text description of the task. This approach is a step towards dramatically lowering the technical and computational barriers, allowing non-technical users to specialize foundation models using plain language, rather than needing deep technical expertise or large compute resources.show more

Sakana AI

402,987 次观看 • 1 年前

Text-to-image diffusion transformer models learn to align text and... show more

Alec Helbling

94,095 次观看 • 6 个月前

Alright, now that we know *what* an agent is,... how does it actually work? When you ask for help on a task, the agent plans a series of steps and executes them directly in the application on your behalf, using the tools it has access to. Say you are booking a local service or trying to organize your inbox (which typically takes multiple steps): the AI model first plans how to achieve the task using its existing knowledge and then interacts with your inbox to execute the task. The agent will continue until it is confident the task has been successfully completed.show more

Google AI

22,487 次观看 • 6 个月前

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers paper... page: Recent advances in generative AI have significantly enhanced image and video editing, particularly in the context of text prompt control. State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks. However, the computational demands of diffusion-based methods are substantial, often necessitating large-scale paired datasets for training, and therefore challenging the deployment in practical applications. This study addresses this challenge by breaking down the text-based video editing process into two separate stages. In the first stage, we leverage an existing text-to-image diffusion model to simultaneously edit a few keyframes without additional fine-tuning. In the second stage, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers and specializes in frame interpolation between the keyframes, benefiting from structural guidance provided by intermediate frames. Our comprehensive set of experiments illustrates the efficacy and efficiency of MaskINT when compared to other diffusion-based methodologies. This research offers a practical solution for text-based video editing and showcases the potential of non-autoregressive masked generative transformers in this domain.show more

AK

25,449 次观看 • 2 年前

Microsoft presents Windows Agent Arena Evaluating Multi-Modal OS Agents... at Scale discuss: Large language models (LLMs) show remarkable potential to act as computer agents, enhancing human productivity and software accessibility in multi-modal tasks that require planning and reasoning. However, measuring agent performance in realistic environments remains a challenge since: (i) most benchmarks are limited to specific modalities or domains (e.g. text-only, web navigation, Q&A, coding) and (ii) full benchmark evaluations are slow (on order of magnitude of days) given the multi-step sequential nature of tasks. To address these challenges, we introduce the Windows Agent Arena: a reproducible, general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real Windows OS and use the same wide range of applications, tools, and web browsers available to human users when solving tasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverse Windows tasks across representative domains that require agent abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized in Azure for a full benchmark evaluation in as little as 20 minutes. To demonstrate Windows Agent Arena's capabilities, we also introduce a new multi-modal agent, Navi. Our agent achieves a success rate of 19.5% in the Windows domain, compared to 74.5% performance of an unassisted human. Navi also demonstrates strong performance on another popular web-based benchmark, Mind2Web. We offer extensive quantitative and qualitative analysis of Navi's performance, and provide insights into the opportunities for future research in agent development and data generation using Windows Agent Arena.show more

AK

19,684 次观看 • 1 年前

We are rolling out access to Runway Aleph, our... show more

Runway

99,594 次观看 • 10 个月前

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation paper page:... Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.show more

AK

375,090 次观看 • 3 年前

Visualizer of our MultiAgentRouter 🤖 The MultiAgentRouter is an... show more

swarms

32,313 次观看 • 1 年前

“We know today that we are but a single... show more

Chad Crowley

14,005 次观看 • 2 个月前

Boom! Grok Tasks Make It One Of The Most... POWERFUL Real-Time AI Systems In The World. — My How to Use Grok Tasks With Hidden Tools For Powerful Daily Output. Grok Tasks are customizable AI workflows that integrate a variety of tools to streamline daily activities, from research and analysis to creative planning and problem-solving. I have been using them for quite sometime and because of the vital heartbeat of news and first person data on X, it is the most powerful AI platform available. By combining Tasks with tools like web searches, X platform interactions, code execution, and media viewers, you can build efficient, automated processes. These tasks work by prompting Grok with a clear description of what you want to achieve, and Grok will intelligently call the necessary tools in sequence or parallel to deliver results. Here's a step-by-step guide to creating and using Grok Tasks: Step 1: Define Your Task Start by clearly outlining the daily activity or goal. Consider what inputs you have (e.g., a URL, a query, or an attachment) and what output you need (e.g., a summary, calculation, or visual analysis). Break it down into subtasks to identify tool needs. For example, if your task involves researching current events, note that you'll need search and browsing capabilities. Step 2: Review Available Tools Familiarize yourself with the tools Grok can access. Here's a quick overview: - Code Execution: Run Python code for calculations, data processing, or simulations using libraries like numpy, pandas, or sympy. - Browse Page: Fetch and summarize content from any website URL with custom instructions. - Web Search: Perform general internet searches, returning results with optional operators like site:. - Web Search With Snippets: Get quick, detailed excerpts from search results for fact-checking. - X Keyword Search: Advanced search for X posts using operators like from:, since:, or filter:. - X Semantic Search: Find semantically related X posts based on a query, with filters for dates or users. - X User Search: Locate X users by name or handle. - X Thread Fetch: Retrieve a full X post thread, including context like replies and parents. - View Image: Analyze an image from a URL or conversation ID. - View X Video: Extract frames and subtitles from an X-hosted video. - Search PDF Attachment: Query a PDF file for relevant pages using keyword or regex modes. - Browse PDF Attachment: View specific pages of a PDF with text and screenshots. Select tools that align with your task. Aim for a mix to handle data gathering, processing, and visualization. Step 3: Craft Your Prompt Write a detailed prompt to Grok describing the task. Include: - The overall goal. - Specific steps or subtasks. - References to tools if you want to guide the process (e.g., "Use web_search to find sources, then code_execution to analyze data"). - Any constraints, like dates or limits. Example prompt: "Create a Grok Task for my morning routine: Search recent X posts about tech news using x_keyword_search, fetch a key thread with x_thread_fetch, and summarize with browse_page on linked articles." Step 4: Submit and Interact Send your prompt to Grok. It will process the task by calling tools as needed, often in parallel for efficiency. Review the output and refine with follow-up prompts if required (e.g., "Expand on that using view_image for visuals"). Iterate to fine-tune the workflow for reuse. Step 5: Save and Reuse Once refined, note the prompt as a template for future use. You can adapt it for similar tasks, making Grok Tasks a habitual part of your day. Finding Grok Tasks To discover existing Grok Tasks or inspiration for new ones, use X searches with tools like x_keyword_search or x_semantic_search (e.g., query: "Grok Tasks examples" with mode: Latest). Browse community-shared threads via x_thread_fetch, or web_search for tutorials on xAI features. Prompt Grok directly: "Show me popular Grok Tasks for productivity." 1 of 3show more

Brian Roemmele

152,242 次观看 • 5 个月前

Robotics keeps hitting the same wall. Single task RL... works, but... it does not scale to hundreds of tasks or new embodiments. This new paper looks like a real step toward fixing that. The team introduces MMBench, a benchmark with 200 tasks across many domains and robots, and Newt, a language conditioned world model trained online across all 200 tasks at once. The simple idea behind Newt: The model learns from demos to get the right priors It trains across many tasks through online interaction It uses language to ground the goal It adapts fast when a new task shows up What stood out to me: ✅ One model trained on 200 tasks at the same time ✅ Language conditioned control for both states and RGB ✅ Better data efficiency than strong baselines ✅ Strong open loop control ✅ Fast adaptation to new tasks and embodiments ✅ Full release of 200 checkpoints, 4000 demos, code, and benchmark This is a good push toward general control instead of one model per task. If you want the full paper: Project page: —- Weekly robotics and AI insights. Subscribe free:show more

Ilir Aliu

70,090 次观看 • 6 个月前

Break-A-Scene: Extracting Multiple Concepts from a Single Image introduce... the task of textual scene decomposition: given a single image of a scene that may contain several concepts, we aim to extract a distinct text token for each concept, enabling fine-grained control over the generated scenes. To this end, we propose augmenting the input image with masks that indicate the presence of target concepts. These masks can be provided by the user or generated automatically by a pre-trained segmentation model. We then present a novel two-phase customization process that optimizes a set of dedicated textual embeddings (handles), as well as the model weights, striking a delicate balance between accurately capturing the concepts and avoiding overfitting. We employ a masked diffusion loss to enable handles to generate their assigned concepts, complemented by a novel loss on cross-attention maps to prevent entanglement. We also introduce union-sampling, a training strategy aimed to improve the ability of combining multiple concepts in generated images. We use several automatic metrics to quantitatively compare our method against several baselines, and further affirm the results using a user study. Finally, we showcase several applications of our method paper page:show more

AK

154,507 次观看 • 3 年前

I trained this LoRA exclusively on real images extracted... show more

Lovis Odin

12,160 次观看 • 1 年前

Fine-tune DeepSeek-OCR on your own language! (100% local) DeepSeek-OCR... is a 3B-parameter vision model that achieves 97% precision while using 10× fewer vision tokens than text-based LLMs. It handles tables, papers, and handwriting without killing your GPU or budget. Why it matters: Most vision models treat documents as massive sequences of tokens, making long-context processing expensive and slow. DeepSeek-OCR uses context optical compression to convert 2D layouts into vision tokens, enabling efficient processing of complex documents. The best part? You can easily fine-tune it for your specific use case on a single GPU. I used Unsloth to run this experiment on Persian text and saw an 88.26% improvement in character error rate. ↳ Base model: 149% character error rate (CER) ↳ Fine-tuned model: 60% CER (57% more accurate) ↳ Training time: 60 steps on a single GPU Persian was just the test case. You can swap in your own dataset for any language, document type, or specific domain you're working with. I've shared the complete guide in the next tweet - all the code, notebooks, and environment setup ready to run with a single click. Everything is 100% open-source!show more

Akshay 🚀

126,036 次观看 • 7 个月前

Getting close to a ‘subtle upscale’ version of the... show more

Ingi Erlingsson 🪄

33,460 次观看 • 2 个月前

Qualia has been selected for the Google DeepMind Robotics... Program. We train embodied models that put a robot on a real manual task and make it work, on the floor, not in a demo. Foundation models and reasoning are where robotics is heading, and doing that work alongside DeepMind, who are pushing this frontier, is exactly where we want to be. If you are a company looking to see how a new generation of robots can help your manual tasks, contact us at [email protected] More soonshow more

Qualia

85,925 次观看 • 12 天前

We are investing in the frontiers of agentic capabilities... show more

Sundar Pichai

218,968 次观看 • 1 年前

According to German Vice-Chancellor Robert Habeck, regulating X "is... show more

Wide Awake Media

221,588 次观看 • 1 年前

Show-o One Single Transformer to Unify Multimodal Understanding and... Generation discuss: We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks including visual question-answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, it demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters tailored for understanding or generation. This significantly highlights its potential as a next-generation foundation model.show more

AK

124,042 次观看 • 1 年前

GraphRAG-UI GraphRAG-UI is a user-friendly interface for GraphRAG, a... show more

AK

27,249 次观看 • 1 年前

Live Cam