Martin Ziqiao Ma's banner

Martin Ziqiao Ma

@ziqiao_ma • 4,520 subscribers

technical staff @thinkymachines; less technical stuff @aclmentorship; phd @umich; views are my own

Shorts

The term "continual learning" has become overloaded if you see it as an ML problem. One classic thread is about memorization: regularization-based continual learning methods, such as EWC, MAS, and SI, estimate which parameters mattered for previous tasks and resist changing them too much. One modern thread is about adaptation: test-time training and inference-time learning methods, such as TTT, adapt part of the model on the incoming test stream before making predictions. These are sometimes discussed as separate threads. But in modern scalable architectures, I think they are better seen as complementary constraints: a model that learns quickly at test time also benefits from a mechanism for deciding what not to forget. In our #ECCV2026 paper, we study this in large-scale 4D reconstruction: how to build fast spatial memory that can adapt over long observation streams while reducing collapse and forgetting. Instead of using fully plastic test-time updates, we stabilize fast-weight adaptation with an elastic prior that balances adaptation and memory. Key ideas: - Elastic Test-Time Training: Fisher-weighted consolidation for fast-weight updates - EMA anchor weights that provide a moving reference for stability - Chunk-by-chunk inference for long 3D/4D observation streams We show that this scales across large 3D/4D pretraining settings, including both LRM-style and LVSM-style models, and improves reconstruction across benchmarks including Stereo4D, NVIDIA, and DL3DV-140. We release model checkpoints across different design choices: resolution, post-training curriculum, and whether the model uses an explicit 4DGS intermediate representation. - Homepage: - Paper: - Code: - Models: This work is co-led with Xueyang Yu, contributed by Haoyu Zhen Yuncong Yang, and advised by Michigan SLED Lab Chuang Gan.

The term "continual learning" has become overloaded if you see it as an ML problem. One classic thread is about memorization: regularization-based continual learning methods, such as EWC, MAS, and SI, estimate which parameters mattered for previous tasks and resist changing them too much. One modern thread is about adaptation: test-time training and inference-time learning methods, such as TTT, adapt part of the model on the incoming test stream before making predictions. These are sometimes discussed as separate threads. But in modern scalable architectures, I think they are better seen as complementary constraints: a model that learns quickly at test time also benefits from a mechanism for deciding what not to forget. In our #ECCV2026 paper, we study this in large-scale 4D reconstruction: how to build fast spatial memory that can adapt over long observation streams while reducing collapse and forgetting. Instead of using fully plastic test-time updates, we stabilize fast-weight adaptation with an elastic prior that balances adaptation and memory. Key ideas: - Elastic Test-Time Training: Fisher-weighted consolidation for fast-weight updates - EMA anchor weights that provide a moving reference for stability - Chunk-by-chunk inference for long 3D/4D observation streams We show that this scales across large 3D/4D pretraining settings, including both LRM-style and LVSM-style models, and improves reconstruction across benchmarks including Stereo4D, NVIDIA, and DL3DV-140. We release model checkpoints across different design choices: resolution, post-training curriculum, and whether the model uses an explicit 4DGS intermediate representation. - Homepage: - Paper: - Code: - Models: This work is co-led with Xueyang Yu, contributed by Haoyu Zhen Yuncong Yang, and advised by Michigan SLED Lab Chuang Gan.

33,047 просмотров

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

Do Vision-Language Models represent space, and how? Spatial terms like "left" or "right" may not be enough to match images with spatial descriptions, as we often overlook the different frames of reference (FoR) used by speakers and listeners. See Figure 1 for examples! Introducing the COnsistent Multilingual Frame Of Reference Test (COMFORT), an evaluation protocol to assess the spatial reasoning capabilities of VLMs. COMFORT includes systematically designed datasets and metrics that evaluate model performance, and their deeper linguistic competence, specifically the spatial knowledge encoded in their internal representations. Find out more in the video teaser! Almost all VLMs prefer the egocentric relative FoR with reflected transform, similar to English. Yet, we reveal significant shortcomings of VLMs: notably, the models (1) exhibit poor robustness and consistency, (2) lack the flexibility to accommodate multiple FoRs, and (3) fail to adhere to language-specific or culture-specific conventions in cross-lingual tests, as English tends to dominate other languages. A shortened version will appear in Pluralistic Alignment Workshop Pluralistic Alignment Workshop #NeurIPS2024. It seems that the ArXiv moderators put it on hold and are eager to give it a thorough read first🤣! So here is the Paper/Code/Data: This collaboration turns out to be amazing, jointly led by Brian Zheyuan Zhang, @Hu_FY_ Jayjun Lee, with so many contributions and insights from Freda Shi, Parisa Kordjamshidi Michigan SLED Lab. With a growing effort to align vision-language models with human cognitive intuitions, we call for more attention to the ambiguous nature and cross-cultural diversity of spatial reasoning!

Do Vision-Language Models represent space, and how? Spatial terms like "left" or "right" may not be enough to match images with spatial descriptions, as we often overlook the different frames of reference (FoR) used by speakers and listeners. See Figure 1 for examples! Introducing the COnsistent Multilingual Frame Of Reference Test (COMFORT), an evaluation protocol to assess the spatial reasoning capabilities of VLMs. COMFORT includes systematically designed datasets and metrics that evaluate model performance, and their deeper linguistic competence, specifically the spatial knowledge encoded in their internal representations. Find out more in the video teaser! Almost all VLMs prefer the egocentric relative FoR with reflected transform, similar to English. Yet, we reveal significant shortcomings of VLMs: notably, the models (1) exhibit poor robustness and consistency, (2) lack the flexibility to accommodate multiple FoRs, and (3) fail to adhere to language-specific or culture-specific conventions in cross-lingual tests, as English tends to dominate other languages. A shortened version will appear in Pluralistic Alignment Workshop Pluralistic Alignment Workshop #NeurIPS2024. It seems that the ArXiv moderators put it on hold and are eager to give it a thorough read first🤣! So here is the Paper/Code/Data: This collaboration turns out to be amazing, jointly led by Brian Zheyuan Zhang, @Hu_FY_ Jayjun Lee, with so many contributions and insights from Freda Shi, Parisa Kordjamshidi Michigan SLED Lab. With a growing effort to align vision-language models with human cognitive intuitions, we call for more attention to the ambiguous nature and cross-cultural diversity of spatial reasoning!

Martin Ziqiao Ma

35,565 просмотров • 1 год назад

Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at any time to any view at any other time? Introducing 4D-LRM: a Large Space-Time Reconstruction Model that ... 🔹 Predicts 4D Gaussian primitives directly from multi-view tokens (no motion vectors, no HexPlane); 🔹 Uses a clean, minimal Transformer backbone; 🔹 Generalizes with fast, high-quality feedforward rendering at any view and infinite frame rate. Check out more interactive demos and scaling behaviors on our homepage/paper. 👉Website: 👉Paper:

Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at any time to any view at any other time? Introducing 4D-LRM: a Large Space-Time Reconstruction Model that ... 🔹 Predicts 4D Gaussian primitives directly from multi-view tokens (no motion vectors, no HexPlane); 🔹 Uses a clean, minimal Transformer backbone; 🔹 Generalizes with fast, high-quality feedforward rendering at any view and infinite frame rate. Check out more interactive demos and scaling behaviors on our homepage/paper. 👉Website: 👉Paper:

Martin Ziqiao Ma

21,811 просмотров • 1 год назад

Vision-Language Models (VLMs) can describe the environment, but can they refer within it? Our findings reveal a critical gap: VLMs fall short of pragmatic optimality. We identify 3 key failures of pragmatic competence in referring expression generation with VLMs: (1) cannot uniquely refer to the referent, (2) include excessive or irrelevant information, and (3) misalign with human pragmatic preferences. We introduce RefOI, a new dataset of 1.5k objects, each with 3 written and 2 spoken human-produced referring expressions. We also release RefOI-TLHF, a large dataset of token-level human feedback for 10.6k referring expressions. 👀 📄 Excited to colead this project with the amazing JaneDing, and huge thanks to the dream team Xuejun Zhang Dezhi Luo Diiikee @6SihanXu Yuchen Huang Roihn Run Peng 彭润 Michigan SLED Lab.

Vision-Language Models (VLMs) can describe the environment, but can they refer within it? Our findings reveal a critical gap: VLMs fall short of pragmatic optimality. We identify 3 key failures of pragmatic competence in referring expression generation with VLMs: (1) cannot uniquely refer to the referent, (2) include excessive or irrelevant information, and (3) misalign with human pragmatic preferences. We introduce RefOI, a new dataset of 1.5k objects, each with 3 written and 2 spoken human-produced referring expressions. We also release RefOI-TLHF, a large dataset of token-level human feedback for 10.6k referring expressions. 👀 📄 Excited to colead this project with the amazing JaneDing, and huge thanks to the dream team Xuejun Zhang Dezhi Luo Diiikee @6SihanXu Yuchen Huang Roihn Run Peng 彭润 Michigan SLED Lab.

Martin Ziqiao Ma

20,636 просмотров • 1 год назад

Больше нет контента для загрузки