Загрузка видео...
Не удалось загрузить видео
💫It's fascinating that a single feed-forward pass through an LLM can replace a complex rendering pipeline, like Blender! Just feed it 3D shapes, xyz positions, and poses as tokens, and it spits out the image token-by-token. The dual, aka scene reconstruction, is also possible! 👇
44,553 просмотров • 1 год назад •via X (Twitter)
Комментарии: 7

The dual: Image goes in as tokens, and 3D shapes, xyz positions and poses come out token-by-token.

Read more here:

This is exciting!! Congrats Georgia and team 🥳 Silly naive question: How should I think about this work in relation to that recent RenderFormer paper from a few weeks ago?

RenderFormer is awesome! Very very similar in spirit -- use a neural net to predict the rendered image of an input 3D asset. Many differences in algorithm and in scope: * our model is fully autoregressive and performs next-token prediction. This token can an image token, shape token, or text token. RenderFormer is vision-transformer style. * We can do 3D-to-image (rendering), image-to-3D (reconstruction, recognition), image + 3D-to-image + 3D (instruction-following), all with the same framework enabled by the unified token-wise model. It's why we love tokens! * For the rendering task, we emphasize on compositionality (scenes composed of many objects) with control over the locations, poses and object types/shapes -- all specified in the input. Our model at the end is dirt simple, just an LLM, but we found some things to be very critical: (1) how to best encode numbers to specific 3D locations and poses, (2) how to discretize/tokenize 3D shapes, which are inherently continuous, and (3) how to fuse the modalities.

Is it faster?

Wow, feels like it could be applied to better part segmentation too?

According to apple it must be an illusion.
