Загрузка видео...

Не удалось загрузить видео

На главную

💫It's fascinating that a single feed-forward pass through an LLM can replace a complex rendering pipeline, like Blender! Just feed it 3D shapes, xyz positions, and poses as tokens, and it spits out the image token-by-token. The dual, aka scene reconstruction, is also possible! 👇

44,553 просмотров • 1 год назад •via X (Twitter)

Комментарии: 7

Фото профиля Georgia Gkioxari
Georgia Gkioxari1 год назад

The dual: Image goes in as tokens, and 3D shapes, xyz positions and poses come out token-by-token.

Фото профиля Georgia Gkioxari
Georgia Gkioxari1 год назад

Read more here:

Фото профиля Mike Roberts
Mike Roberts1 год назад

This is exciting!! Congrats Georgia and team 🥳 Silly naive question: How should I think about this work in relation to that recent RenderFormer paper from a few weeks ago?

Фото профиля Georgia Gkioxari
Georgia Gkioxari1 год назад

RenderFormer is awesome! Very very similar in spirit -- use a neural net to predict the rendered image of an input 3D asset. Many differences in algorithm and in scope: * our model is fully autoregressive and performs next-token prediction. This token can an image token, shape token, or text token. RenderFormer is vision-transformer style. * We can do 3D-to-image (rendering), image-to-3D (reconstruction, recognition), image + 3D-to-image + 3D (instruction-following), all with the same framework enabled by the unified token-wise model. It's why we love tokens! * For the rendering task, we emphasize on compositionality (scenes composed of many objects) with control over the locations, poses and object types/shapes -- all specified in the input. Our model at the end is dirt simple, just an LLM, but we found some things to be very critical: (1) how to best encode numbers to specific 3D locations and poses, (2) how to discretize/tokenize 3D shapes, which are inherently continuous, and (3) how to fuse the modalities.

Фото профиля Bisheshwor Neupane
Bisheshwor Neupane1 год назад

Is it faster?

Фото профиля Krish Mehta
Krish Mehta1 год назад

Wow, feels like it could be applied to better part segmentation too?

Фото профиля honasu-san
honasu-san1 год назад

According to apple it must be an illusion.

Похожие видео