Loading video...
Video Failed to Load
Idefics3-Llama is out! 💥 It's a multimodal model based on Llama 3.1 that accepts arbitrary number of interleaved images with text with a huge context window (10k tokens!) 😍 Link to demo and model in the next one 😏
28,014 views • 1 year ago •via X (Twitter)
10 Comments

Link to model: Try the demo right away: Use the model with @huggingface transformers 🤗

I will release fine-tuning scripts and quantized versions tomorrow, don't fret 😄

Wow merve, how cool is this 😍

@mervenoyann The model seems great! More of a general question: when finetuning, any efficient strategies you can recommend that will preserve the original capabilities of the model?

currently I'm trying to finetune but there's a small bug we're trying to fix 🥲 I feel like if you want to preserve original model a low rank adapter would work better than fully finetuning

Do you have teaining scripts for lora finetuning it?

I will release sometime tomorrow 😊 along with quantized checkpoints

Does it accept only one image per user input (which will be resized to 384x384)?

no I think you can provide multiple images, but provide as many image tokens explicitly @HugoLaurencon knows better in this demo this isn't the case though

wow this looks amazing for image captioning it gave a really good caption did you investigate which prompt for this task?

