正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

TimeFusion is built on a new multimodal encoding and decoding framework based on Universal Tokens. This approach can easily extend to a wide range of modalities beyond time-series data and language. Here’s a preview of our ongoing work using Universal Tokens to unify images, sensor data, and language in... show more

Ivan Poupyrev

4,919 subscribers

36,361 次观看 • 7 个月前 •via X (Twitter)

教育科学技术

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

AK

125,319 次观看 • 3 年前

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

AK

23,958 次观看 • 1 年前

Our vision is to encode the entire physical world using a variety of sensor data, to go beyond text and images to radars, accelerometers, and other sensors. This will allow us to transcend the limitations of human perception and help humanity to make sense of the world around us.

Our vision is to encode the entire physical world using a variety of sensor data, to go beyond text and images to radars, accelerometers, and other sensors. This will allow us to transcend the limitations of human perception and help humanity to make sense of the world around us.

Archetype AI

10,396 次观看 • 2 年前

mapping out the visual language of film using a multimodal llm: i fed frames of a short film to a vision-language model and mapped out its ratings of surrealism and presence of human figure in each moment along the timeline. the result is an interactive playback interface based on these 2 dimensions:

mapping out the visual language of film using a multimodal llm: i fed frames of a short film to a vision-language model and mapped out its ratings of surrealism and presence of human figure in each moment along the timeline. the result is an interactive playback interface based on these 2 dimensions:

Kat ⊷ the Poet Engineer

94,701 次观看 • 1 年前

New short course: Building Multimodal Search and RAG", by Weaviate AI Database's Sebastia(N_) Witalec ✊🏽✊🏾✊🏿. Contrastive learning is used to train models to map vectors into an embedding space by pulling similar concepts closer together and pushing dissimilar concepts away from each other. This technique is also used to train multimodal embedding models that capture semantic similarity across different modalities like text, images, and audio. These multimodal embeddings can be used to build multimodal search and RAG systems. In this course, you'll learn how contrastive learning works, and how to add multimodality to RAG – so your models can draw on diverse, relevant context to answer questions. For example, a query about a financial report might synthesize information from text snippets, graphs, tables, and slides. You will also learn how visual instruction tuning lets you integrate image understanding into language models, and build a multi-vector recommender system using Weaviate’s open-source vector database. Please sign up here:

New short course: Building Multimodal Search and RAG", by Weaviate AI Database's Sebastia(N_) Witalec ✊🏽✊🏾✊🏿. Contrastive learning is used to train models to map vectors into an embedding space by pulling similar concepts closer together and pushing dissimilar concepts away from each other. This technique is also used to train multimodal embedding models that capture semantic similarity across different modalities like text, images, and audio. These multimodal embeddings can be used to build multimodal search and RAG systems. In this course, you'll learn how contrastive learning works, and how to add multimodality to RAG – so your models can draw on diverse, relevant context to answer questions. For example, a query about a financial report might synthesize information from text snippets, graphs, tables, and slides. You will also learn how visual instruction tuning lets you integrate image understanding into language models, and build a multi-vector recommender system using Weaviate’s open-source vector database. Please sign up here:

Andrew Ng

104,371 次观看 • 2 年前

This week, Co-Founder and Head of Research @neurmality dives into the core primitives powering smarter, more dynamic agent-to-agent interactions on Theoriq's Protocol. ⚡️ He demos a live example of a “Flex Agent” that adapts in real time based on natural language, subscribing to statistical pulses, processing data & responding with a customized volatility status. This flexibility showcases how agents can easily communicate, interact, and adapt—key traits for powering the Agentic Economy. More below. 👇

This week, Co-Founder and Head of Research @neurmality dives into the core primitives powering smarter, more dynamic agent-to-agent interactions on Theoriq's Protocol. ⚡️ He demos a live example of a “Flex Agent” that adapts in real time based on natural language, subscribing to statistical pulses, processing data & responding with a customized volatility status. This flexibility showcases how agents can easily communicate, interact, and adapt—key traits for powering the Agentic Economy. More below. 👇

Theoriq

63,308 次观看 • 1 年前

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

AK

290,517 次观看 • 3 年前

Our newest model, π0.7, has some interesting emergent capabilities: it can control a new robot to fold shirts for which we had no shirt folding data, figure out how to use an appliance with language-based coaching, and perform a wide range of dexterous tasks all in one model!

Our newest model, π0.7, has some interesting emergent capabilities: it can control a new robot to fold shirts for which we had no shirt folding data, figure out how to use an appliance with language-based coaching, and perform a wide range of dexterous tasks all in one model!

Physical Intelligence

463,918 次观看 • 3 个月前

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

Andrew Ng

107,825 次观看 • 1 年前

Tony Blair and Oracle co-founder Larry Ellison plan to use digital ID to "unify" all data on each country's citizens "so it can be consumed and used by" their AI models. "We have to take all of this data... and move it into a single... unified data platform." "When we want to ask a question, we've provided that AI model with all the data they need to understand our country." "We need to unify all of the national data, put it into a database where it's easily consumable by the AI model, and then ask whatever question you like."

Tony Blair and Oracle co-founder Larry Ellison plan to use digital ID to "unify" all data on each country's citizens "so it can be consumed and used by" their AI models. "We have to take all of this data... and move it into a single... unified data platform." "When we want to ask a question, we've provided that AI model with all the data they need to understand our country." "We need to unify all of the national data, put it into a database where it's easily consumable by the AI model, and then ask whatever question you like."

Wide Awake Media

54,084 次观看 • 8 个月前

VideoRF: Rendering Dynamic Radiance Fields as 2D Feature Video Streams paper page: Neural Radiance Fields (NeRFs) excel in photorealistically rendering static scenes. However, rendering dynamic, long-duration radiance fields on ubiquitous devices remains challenging, due to data storage and computational constraints. In this paper, we introduce VideoRF, the first approach to enable real-time streaming and rendering of dynamic radiance fields on mobile platforms. At the core is a serialized 2D feature image stream representing the 4D radiance field all in one. We introduce a tailored training scheme directly applied to this 2D domain to impose the temporal and spatial redundancy of the feature image stream. By leveraging the redundancy, we show that the feature image stream can be efficiently compressed by 2D video codecs, which allows us to exploit video hardware accelerators to achieve real-time decoding. On the other hand, based on the feature image stream, we propose a novel rendering pipeline for VideoRF, which has specialized space mappings to query radiance properties efficiently. Paired with a deferred shading model, VideoRF has the capability of real-time rendering on mobile devices thanks to its efficiency. We have developed a real-time interactive player that enables online streaming and rendering of dynamic scenes, offering a seamless and immersive free-viewpoint experience across a range of devices, from desktops to mobile phones.

VideoRF: Rendering Dynamic Radiance Fields as 2D Feature Video Streams paper page: Neural Radiance Fields (NeRFs) excel in photorealistically rendering static scenes. However, rendering dynamic, long-duration radiance fields on ubiquitous devices remains challenging, due to data storage and computational constraints. In this paper, we introduce VideoRF, the first approach to enable real-time streaming and rendering of dynamic radiance fields on mobile platforms. At the core is a serialized 2D feature image stream representing the 4D radiance field all in one. We introduce a tailored training scheme directly applied to this 2D domain to impose the temporal and spatial redundancy of the feature image stream. By leveraging the redundancy, we show that the feature image stream can be efficiently compressed by 2D video codecs, which allows us to exploit video hardware accelerators to achieve real-time decoding. On the other hand, based on the feature image stream, we propose a novel rendering pipeline for VideoRF, which has specialized space mappings to query radiance properties efficiently. Paired with a deferred shading model, VideoRF has the capability of real-time rendering on mobile devices thanks to its efficiency. We have developed a real-time interactive player that enables online streaming and rendering of dynamic scenes, offering a seamless and immersive free-viewpoint experience across a range of devices, from desktops to mobile phones.

AK

38,686 次观看 • 2 年前

“Math is code. And code is math.” Carina Hong CEO Axiom on why that idea matters for AI: “For centuries, mathematicians reasoned in natural language. But because of the Curry–Howard correspondence, a proof can also be a program.” “You can translate natural-language math into formal code using systems like Lean. Lean is both a theorem-proving language and a programming language.” “That means you can generate solutions as code and then formally verify them. You can even use the same system to verify programs themselves.” “This creates a flywheel of generation and verification, where AI can both write solutions and prove that they’re correct.”

“Math is code. And code is math.” Carina Hong CEO Axiom on why that idea matters for AI: “For centuries, mathematicians reasoned in natural language. But because of the Curry–Howard correspondence, a proof can also be a program.” “You can translate natural-language math into formal code using systems like Lean. Lean is both a theorem-proving language and a programming language.” “That means you can generate solutions as code and then formally verify them. You can even use the same system to verify programs themselves.” “This creates a flywheel of generation and verification, where AI can both write solutions and prove that they’re correct.”

Forward Future

24,729 次观看 • 5 个月前

Tomorrow, we kick off the Fisherman's Art Workshop and Civic Engagement in collaboration with Coster Ojwang, watendawili and Willis Raburu. This will be a civic space where we use art, as a mirror of our society, to deliberate on not just where our country is, but which direction we want it to head in, and how we can all collectively ensure we achieve that vision. We welcome all talented, passionate Kenyans in and around Bondo to join this space and take part in this conversation. This is the start of civic education done differently; through the universal language of humanity.

Tomorrow, we kick off the Fisherman's Art Workshop and Civic Engagement in collaboration with Coster Ojwang, watendawili and Willis Raburu. This will be a civic space where we use art, as a mirror of our society, to deliberate on not just where our country is, but which direction we want it to head in, and how we can all collectively ensure we achieve that vision. We welcome all talented, passionate Kenyans in and around Bondo to join this space and take part in this conversation. This is the start of civic education done differently; through the universal language of humanity.

Faith Odhiambo

31,775 次观看 • 1 个月前

Larry Ellison—owner of Oracle, CBS, and now TikTok—tells Tony Blair about his plan to use digital ID to "unify" all data on each country's citizens "so it can be consumed and used by" his AI models. "We have to take all of this data... and move it into a single, if you will, unified data platform." "When we want to ask a question, we've provided that AI model with all the data they need to understand our country." "We need to unify all of the national data, put it into a database where it's easily consumable by the AI model, and then ask whatever question you like."

Larry Ellison—owner of Oracle, CBS, and now TikTok—tells Tony Blair about his plan to use digital ID to "unify" all data on each country's citizens "so it can be consumed and used by" his AI models. "We have to take all of this data... and move it into a single, if you will, unified data platform." "When we want to ask a question, we've provided that AI model with all the data they need to understand our country." "We need to unify all of the national data, put it into a database where it's easily consumable by the AI model, and then ask whatever question you like."

Wide Awake Media

144,324 次观看 • 6 个月前

AI + robotics research is starting to pick up steam. We can now instruct Spot using natural language using Language-guided Skill Coordination (LSC). A user provides a natural language instruction: "Bring me the chocolates box, cereal box, and pill bottle, and put them on the bedroom table" and the robot navigates to the location of the target objects and places them on the room table. Current versions of Spot rely on a fixed-set vocabulary that cannot generalize to diverse instructions. In this demo, the researchers presented a method that uses large language models to receive a free-form natural language instruction for object rearrangement, which it then executes. It combines: -A voice-to-text model that processes the instructions into text -An LLM that converts natural language instruction to call a library of skills -A perception module that provides ground truth locations and visual object detection And what your left with is a glimpse into what our future will hold: robots executing tasks from human voice prompts.

AI + robotics research is starting to pick up steam. We can now instruct Spot using natural language using Language-guided Skill Coordination (LSC). A user provides a natural language instruction: "Bring me the chocolates box, cereal box, and pill bottle, and put them on the bedroom table" and the robot navigates to the location of the target objects and places them on the room table. Current versions of Spot rely on a fixed-set vocabulary that cannot generalize to diverse instructions. In this demo, the researchers presented a method that uses large language models to receive a free-form natural language instruction for object rearrangement, which it then executes. It combines: -A voice-to-text model that processes the instructions into text -An LLM that converts natural language instruction to call a library of skills -A perception module that provides ground truth locations and visual object detection And what your left with is a glimpse into what our future will hold: robots executing tasks from human voice prompts.

AI Breakfast

33,722 次观看 • 3 年前

Built a small prototype using #GPT3 - every app working with text should make language styling as accessible as text formatting. This improves accessibility for non-native speakers like myself, and allows to focus on the content while generating auxiliary data like subject etc.

Built a small prototype using #GPT3 - every app working with text should make language styling as accessible as text formatting. This improves accessibility for non-native speakers like myself, and allows to focus on the content while generating auxiliary data like subject etc.

Sash Zats

199,352 次观看 • 3 年前

just watched this lady’s video where she argues that the language surrounding marketing roles is going through a rebrand/process of masculinisation in order to make them sound new/data driven and “serious” for men

just watched this lady’s video where she argues that the language surrounding marketing roles is going through a rebrand/process of masculinisation in order to make them sound new/data driven and “serious” for men

yammi

724,126 次观看 • 3 个月前

This is the future of interacting with information stored in databases. I want this on every interface that accesses data. Imagine giving a smart agent access to your data. No explanation or documentation needed. The agent can immediately answer any questions in natural language. It doesn't get simpler than that! This is the work of mindshub. I've been collaborating with them for a long time, and their product gives us a glimpse into what the future looks like. Go to to try this yourself.

This is the future of interacting with information stored in databases. I want this on every interface that accesses data. Imagine giving a smart agent access to your data. No explanation or documentation needed. The agent can immediately answer any questions in natural language. It doesn't get simpler than that! This is the work of mindshub. I've been collaborating with them for a long time, and their product gives us a glimpse into what the future looks like. Go to to try this yourself.

Santiago

83,750 次观看 • 1 年前

one of my favorite aspects of this new work is that it is generative, not based on an arbitrary random number, but based on the chaos of real life. the physical sculpture is constantly pulling in news and real-time data from all over the world and the NFT is updated daily as well with this hectic collage of information overload... i believe in the future more work will be dynamic like this allowing us to reimagine real-time data in new ways based on different events happening in the world...

one of my favorite aspects of this new work is that it is generative, not based on an arbitrary random number, but based on the chaos of real life. the physical sculpture is constantly pulling in news and real-time data from all over the world and the NFT is updated daily as well with this hectic collage of information overload... i believe in the future more work will be dynamic like this allowing us to reimagine real-time data in new ways based on different events happening in the world...

beeple

155,750 次观看 • 1 年前