Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

first step was getting llama cpp to play nice with electron js so that we can run the model I fine tuned on the client, a couple errors but eventually got it wired up with node-llama-cpp bindings. this way the model + app can be shipped to the user... show more

anton

47,937 subscribers

16,499 views • 2 years ago •via X (Twitter)

Education Science & Technology

Anya Rossi• Live Now

Private livecam show

10 Comments

anton2 years ago

the nice thing about llama cpp is the user will be able to run inference on CPU or GPU (cuda + metal for mac) in case they have either

anton2 years ago

stack is electron-vite, react, llama-cpp using the node-llama-cpp bindings and model is still tbd but currently working with a fine tuned qwen2 500M

Stocko 👊🤖2 years ago

wow, that’s amazingly fast

Alloy🐍🍀2 years ago

Isn't this going to be a massive download or are you downloading the model within the client app and then working "offline"?

anton2 years ago

the app will ship without the model, which will be downloaded after you install it. how big is the app (w/o the model file)? it is 227mb (will work on bundle size later honestly)

Yam Peleg2 years ago

very nice work! an integration like this done well has amazing potential

nigh8w0lf2 years ago

Looks nice! Lamafile but with a cleaner JS interface.

Caleb2 years ago

I’m really intrigued at using transformers js to do code autocomplete or something in the browser. Excited to follow along on this

el2 years ago

what machine is this on?

Ravi Chandra Veeramachaneni2 years ago

@abacaj Have you tried or considered swift for the purpose. Lately been seeing lots of apps coming out of the swift native and its cross platform bindings.

Related Videos

You can now try Llama 3.1 405B for free (link below)! This is the largest open-source model out there, and for the first time, an open model is competitive with closed models. This time around, Meta did something new: Llama 3.1 has a license that allows developers to use it to enhance other models. For the first time, you can distill Llama 3.1 405B's capabilities into a smaller, more practical model for your use case. First, here is the link where you can play with Llama 3.1 for free: The model is hosted in Tune Studio, an end-to-end platform for developing applications using Large Language Models. They are sponsoring this post. Take a look at the attached video. It will show you how you can fine-tune a simple model using Llama 3.1 without leaving the platform: 1. You can create an empty dataset 2. Use the playground to generate and record interactions with Llama 3.1 3. Modify the dataset directly using the playground 4. Export the data and fine-tune a smaller model Fast and easy! As long as you have a web browser, you can start experimenting with fine-tuning and Llama 3.1. That's all it takes!

You can now try Llama 3.1 405B for free (link below)! This is the largest open-source model out there, and for the first time, an open model is competitive with closed models. This time around, Meta did something new: Llama 3.1 has a license that allows developers to use it to enhance other models. For the first time, you can distill Llama 3.1 405B's capabilities into a smaller, more practical model for your use case. First, here is the link where you can play with Llama 3.1 for free: The model is hosted in Tune Studio, an end-to-end platform for developing applications using Large Language Models. They are sponsoring this post. Take a look at the attached video. It will show you how you can fine-tune a simple model using Llama 3.1 without leaving the platform: 1. You can create an empty dataset 2. Use the playground to generate and record interactions with Llama 3.1 3. Modify the dataset directly using the playground 4. Export the data and fine-tune a smaller model Fast and easy! As long as you have a web browser, you can start experimenting with fine-tuning and Llama 3.1. That's all it takes!

Santiago

55,609 views • 1 year ago

I got Llama 3 running in my browser using only my GPU with my Wi-Fi switched OFF completely client-side WebGPU is a new feature in browsers where JS can use the GPU of the device and apparently you can run LLMs on it too, and it's fast! My end goal for 🧠 is a locally run client side 100% private conversation where nothing gets sent to, processed in or stored in the cloud This will be helpful because I prefer making it web based instead of coding for 4 different native platforms (iOS, Android, MacOS, Windows) Two challenges still: 1) the user has to download the LLM model first and the smallest model is still ~3GB, but they only have to download it once 2) not all devices have fast GPUs yet for LLMs, but it's starting to become a default in most devices e.g.s smartphones

I got Llama 3 running in my browser using only my GPU with my Wi-Fi switched OFF completely client-side WebGPU is a new feature in browsers where JS can use the GPU of the device and apparently you can run LLMs on it too, and it's fast! My end goal for 🧠 is a locally run client side 100% private conversation where nothing gets sent to, processed in or stored in the cloud This will be helpful because I prefer making it web based instead of coding for 4 different native platforms (iOS, Android, MacOS, Windows) Two challenges still: 1) the user has to download the LLM model first and the smallest model is still ~3GB, but they only have to download it once 2) not all devices have fast GPUs yet for LLMs, but it's starting to become a default in most devices e.g.s smartphones

@levelsio

261,352 views • 2 years ago

train YOLOv9 on your dataset tutorial - run inference with a pre-trained COCO model - fine-tune model on custom dataset - evaluate the trained model - run inference with a fine-tuned model blogpost: ↓ read more

train YOLOv9 on your dataset tutorial - run inference with a pre-trained COCO model - fine-tune model on custom dataset - evaluate the trained model - run inference with a fine-tuned model blogpost: ↓ read more

SkalskiP

111,792 views • 2 years ago

$I just added the new Llama 3.2 1B and 3B models to LitGPT, the open-source LLM library I help develop (focused on efficiency and code readability). LitGPT allows you to fine-tune and use these models on the cloud or a laptop. So, if you are looking for something to play with this weekend: # 1) Finetune the model litgpt finetune_lora meta-llama/Llama-3.2-1B \ --data JSON \ --data.json_path my_custom_dataset.json \ --train.epochs 1 \ --out_dir out/llama-3.2-finetuned \ --precision bf16-true # 2) Chat with the model litgpt chat out/llama-3.2-finetuned/final # 3) Serve the model via an API endpoint litgpt serve out/llama-3.2-finetuned/final$

I just added the new Llama 3.2 1B and 3B models to LitGPT, the open-source LLM library I help develop (focused on efficiency and code readability). LitGPT allows you to fine-tune and use these models on the cloud or a laptop. So, if you are looking for something to play with this weekend: # 1) Finetune the model litgpt finetune_lora meta-llama/Llama-3.2-1B \ --data JSON \ --data.json_path my_custom_dataset.json \ --train.epochs 1 \ --out_dir out/llama-3.2-finetuned \ --precision bf16-true # 2) Chat with the model litgpt chat out/llama-3.2-finetuned/final # 3) Serve the model via an API endpoint litgpt serve out/llama-3.2-finetuned/final

Sebastian Raschka

65,529 views • 1 year ago

Our universe is a model with twenty or so carefully fine-tuned parameters that generate all the content inside. Using these parameters and a reduced-scale model you can simulate the history of the cosmos. With the full-scale model, you get to be part of the simulation.

Our universe is a model with twenty or so carefully fine-tuned parameters that generate all the content inside. Using these parameters and a reduced-scale model you can simulate the history of the cosmos. With the full-scale model, you get to be part of the simulation.

Andrew Côté

90,707 views • 2 years ago

1/ From the very first day, we were promised that an #Ethereum node could run on any consumer device. In 2024, you can run on a $200 ARM64 board using ~10w: - 1 Archive node + a validator client - A Supernode (L1+L2 on the same device) This is called decentralization. +info👇

1/ From the very first day, we were promised that an #Ethereum node could run on any consumer device. In 2024, you can run on a $200 ARM64 board using ~10w: - 1 Archive node + a validator client - A Supernode (L1+L2 on the same device) This is called decentralization. +info👇

Ethereum on ARM (and RISC-V) 🦇🔊🐼👉👈🐼

89,523 views • 2 years ago

Wild. Kimi K2 Thinking just released and it's insane. It's an AI model that can run by itself for hours on end and make HUNDREDS of tool calls It's the 1st model I think that can replace humans In this video I show why it's so special and how to use it to build your first app

Wild. Kimi K2 Thinking just released and it's insane. It's an AI model that can run by itself for hours on end and make HUNDREDS of tool calls It's the 1st model I think that can replace humans In this video I show why it's so special and how to use it to build your first app

Alex Finn

70,004 views • 7 months ago

$Fuck it! You can now run *any* GGUF on the Hugging Face Hub directly with ollama 🔥 This has been a constant ask from the community, starting today you can point to any of the 45,000 GGUF repos on the Hub* *Without any changes whatsoever! ⚡ All you need to do is: ollama run hf. co/{username}/{reponame}:latest For example, to run the Llama 3.2 1B, you can run: ollama run hf. co/bartowski/Llama-3.2-1B-Instruct-GGUF:latest If you want to run a specific quant, all you need to do is specify the Quant type: ollama run hf. co/bartowski/Llama-3.2-1B-Instruct-GGUF:Q8_0 That's it! We'll work closely with Ollama to continue developing this further! ⚡$

Fuck it! You can now run any GGUF on the Hugging Face Hub directly with ollama 🔥 This has been a constant ask from the community, starting today you can point to any of the 45,000 GGUF repos on the Hub* *Without any changes whatsoever! ⚡ All you need to do is: ollama run hf. co/{username}/{reponame}:latest For example, to run the Llama 3.2 1B, you can run: ollama run hf. co/bartowski/Llama-3.2-1B-Instruct-GGUF:latest If you want to run a specific quant, all you need to do is specify the Quant type: ollama run hf. co/bartowski/Llama-3.2-1B-Instruct-GGUF:Q8_0 That's it! We'll work closely with Ollama to continue developing this further! ⚡

Vaibhav (VB) Srivastav

317,372 views • 1 year ago

In LiveView, we keep the state in one place - the server - and it simplifies a lot. But sometimes the state on the client is preferred, or even necessary, and we can't use LiveView for that part. Unless we have a LiveView that runs on the client 👀 Using Popcorn, we managed to run LiveView fully in the browser – here's a POC 🎉 The approach is to have a fully client-side LiveView rendered by a server-side one, and they can communicate with each other. Try it out:

In LiveView, we keep the state in one place - the server - and it simplifies a lot. But sometimes the state on the client is preferred, or even necessary, and we can't use LiveView for that part. Unless we have a LiveView that runs on the client 👀 Using Popcorn, we managed to run LiveView fully in the browser – here's a POC 🎉 The approach is to have a fully client-side LiveView rendered by a server-side one, and they can communicate with each other. Try it out:

Elixir by Software Mansion

13,509 views • 6 months ago

ReZero: A small model that learns to search - it never gives up 🔥 ReZero trains with synthetic search engines that force the model to retry, refine, and persist until it finds a better answer (never give up 💪). It's built on Meta's Llama 3.2B. Instead of optimizing for recall or speed, we train the model to retry when it's wrong - using reinforcement learning to build persistence into the search process. - Model: - Code: Thanks to AI at Meta for the Llama 3.2B base, Unsloth AI for AutoDidact (the framework we built on), and Colin Kealty for quantizing the model!

ReZero: A small model that learns to search - it never gives up 🔥 ReZero trains with synthetic search engines that force the model to retry, refine, and persist until it finds a better answer (never give up 💪). It's built on Meta's Llama 3.2B. Instead of optimizing for recall or speed, we train the model to retry when it's wrong - using reinforcement learning to build persistence into the search process. - Model: - Code: Thanks to AI at Meta for the Llama 3.2B base, Unsloth AI for AutoDidact (the framework we built on), and Colin Kealty for quantizing the model!

Menlo Research

13,948 views • 1 year ago

If you have a mac, it’s super easy to run any LLM on it now. Basically your own private jail broken chatgpt. Runs offline. Literally a 2 step process. 1) go to and download the app 2) open terminal -> run the model 3) there's no step 3

If you have a mac, it’s super easy to run any LLM on it now. Basically your own private jail broken chatgpt. Runs offline. Literally a 2 step process. 1) go to and download the app 2) open terminal -> run the model 3) there's no step 3

sphinx

49,162 views • 2 years ago

This nice older couple that runs a YouTube channel called “Bronco Mustang Lifestyle” celebrated their 45th anniversary by buying their first EV: a 2026 Tesla Model Y RWD with FSD (Supervised). "We thought this would be a cool car that we can drive around and let it drive us. We are newbys to the EV world and are watching lots of videos to get us up to speed. In a way, it’s like getting a new computer or phone. I was surprised at how big and roomy the Model Y is." The full video of them picking up the car is linked below:

This nice older couple that runs a YouTube channel called “Bronco Mustang Lifestyle” celebrated their 45th anniversary by buying their first EV: a 2026 Tesla Model Y RWD with FSD (Supervised). "We thought this would be a cool car that we can drive around and let it drive us. We are newbys to the EV world and are watching lots of videos to get us up to speed. In a way, it’s like getting a new computer or phone. I was surprised at how big and roomy the Model Y is." The full video of them picking up the car is linked below:

Sawyer Merritt

237,814 views • 4 months ago

Apple built a large foundation model and fine-tuned it on multiple tasks. But they are doing something very clever: They load a single model in memory and use different adapters to specialize the model on the fly. I recorded a video to show you how to write the code to do the same thing Apple is doing. I explain everything step by step. Here is what I'll show you in the video: 1. We'll load two datasets 2. Then load a large model 3. Then, we'll fine-tune the model on both datasets I'll use LoRA to fine-tune the model. This process creates two small adapters, each specializing in solving one of the datasets. The base model's original parameters will remain unchanged. From here: 4. We'll generate a list of tasks 5. We'll load the correct adapter to solve each task The large model I'm using needs 346 MB of memory, but I only need to load it once. Each adapter is only 2.7 MB. I only need to load the base model once and pair it with any of the fine-tuned adapters. Minimum memory footprint and I can solve multiple tasks. Hope this helps!

Apple built a large foundation model and fine-tuned it on multiple tasks. But they are doing something very clever: They load a single model in memory and use different adapters to specialize the model on the fly. I recorded a video to show you how to write the code to do the same thing Apple is doing. I explain everything step by step. Here is what I'll show you in the video: 1. We'll load two datasets 2. Then load a large model 3. Then, we'll fine-tune the model on both datasets I'll use LoRA to fine-tune the model. This process creates two small adapters, each specializing in solving one of the datasets. The base model's original parameters will remain unchanged. From here: 4. We'll generate a list of tasks 5. We'll load the correct adapter to solve each task The large model I'm using needs 346 MB of memory, but I only need to load it once. Each adapter is only 2.7 MB. I only need to load the base model once and pair it with any of the fine-tuned adapters. Minimum memory footprint and I can solve multiple tasks. Hope this helps!

Santiago

84,747 views • 1 year ago

First time i show a model like this Tenna 3D Model made for Jakeneutron 🔜 MAGfest2026 i had so much fun doing this model and i am super happy with the results, it was rushed but i did my best to work just fine for the animators!

First time i show a model like this Tenna 3D Model made for Jakeneutron 🔜 MAGfest2026 i had so much fun doing this model and i am super happy with the results, it was rushed but i did my best to work just fine for the animators!

Lukasz 🇧🇷

205,009 views • 8 months ago

"Somebody got one of the small versions of Llama to run on Windows 98...We could've been talking to our computers in English for the last 30 years" - Marc Andreessen 🇺🇸 It was me! I got Llama running on a Pentium II machine with 128MB RAM running Windows 98. Details below.

"Somebody got one of the small versions of Llama to run on Windows 98...We could've been talking to our computers in English for the last 30 years" - Marc Andreessen 🇺🇸 It was me! I got Llama running on a Pentium II machine with 128MB RAM running Windows 98. Details below.

Alex Cheema

765,804 views • 1 year ago

My Latest AI short was made with Seedance 2.0 the model that has been firing up the internet over the past couple of days. I made this in two and half days and it's an original short that is based on HP Lovecraft's work which is now Public Domain (No Copyright infringement here!). If you are not into period romances then bare with it as it gets a more little action packed in the second half! The model is brilliant at both performance and action, you can now make convincing dialogue and action sequences for AI cinema. With this model we have now reached the stage where we can tell any story we want with AI, the tools are capable. This is only 720p at the moment but I suspect it will be upgraded to 1080p very soon. We stand at the threshold of having broadcast quality tools capable of creating anything we can imagine available to everyone, so lets do new an interesting things with it. Thank you to CapCut for whitelisting me and to Dreamina AI AI for giving me access as part of their CPP. Thanks for watching! 👀

My Latest AI short was made with Seedance 2.0 the model that has been firing up the internet over the past couple of days. I made this in two and half days and it's an original short that is based on HP Lovecraft's work which is now Public Domain (No Copyright infringement here!). If you are not into period romances then bare with it as it gets a more little action packed in the second half! The model is brilliant at both performance and action, you can now make convincing dialogue and action sequences for AI cinema. With this model we have now reached the stage where we can tell any story we want with AI, the tools are capable. This is only 720p at the moment but I suspect it will be upgraded to 1080p very soon. We stand at the threshold of having broadcast quality tools capable of creating anything we can imagine available to everyone, so lets do new an interesting things with it. Thank you to CapCut for whitelisting me and to Dreamina AI AI for giving me access as part of their CPP. Thanks for watching! 👀

Uncanny Harry AI

76,694 views • 4 months ago

Eagles WR A.J. Brown: "Me personally, I truly believe we've got so many good players on this team and at times you can feel like we're being conservative and I don't think it should be like that...Let your killers do their thing and play fast and play aggressive. Not saying that we haven't been, but me personally that's what I would like. Obviously, we're going to run the ball and we're going to set up the run off the pass and the pass off the run, but we have a lot of good players and we should just let them go."

Eagles WR A.J. Brown: "Me personally, I truly believe we've got so many good players on this team and at times you can feel like we're being conservative and I don't think it should be like that...Let your killers do their thing and play fast and play aggressive. Not saying that we haven't been, but me personally that's what I would like. Obviously, we're going to run the ball and we're going to set up the run off the pass and the pass off the run, but we have a lot of good players and we should just let them go."

SPORTSRADIO 94WIP

909,010 views • 9 months ago

You can now fine-tune Llama 3 without writing a single line of code! We are moving at breakneck speed. I recorded a video to show you how to fine-tune any open-source model in a few minutes. I'm using a GPT capable of taking a problem and turning it into a fine-tuned model that will solve it. You don't have to write any code. You only need to explain to a GPT what problem you want to solve and tell it you want to use Llama 3. For example, "fine-tune Llama 3" or "deploy zephyr." It feels magic. The system will recommend a dataset and fine-tune the model for you. I'm using Monster API, a platform that specializes in making fine-tuning and deploying open-source models easy and fast. Their stack is well-optimized to maximize fine-tuning efficiency using techniques like Q-Lora and vLLM. They are behind the GPT. Here is what you need to do: 1. Create an account at 2. Load the GPT with the link below This is as simple as it gets. When you are done, you can click a button to deploy the model and start using it. I have 10,000 free credits for anyone using the code "SANTIAGO" in the dashboard. You can use these credits to access, fine-tune, and deploy these open-source models. You can also keep up with their latest updates, and get free credits and special offers on their Discord server:

You can now fine-tune Llama 3 without writing a single line of code! We are moving at breakneck speed. I recorded a video to show you how to fine-tune any open-source model in a few minutes. I'm using a GPT capable of taking a problem and turning it into a fine-tuned model that will solve it. You don't have to write any code. You only need to explain to a GPT what problem you want to solve and tell it you want to use Llama 3. For example, "fine-tune Llama 3" or "deploy zephyr." It feels magic. The system will recommend a dataset and fine-tune the model for you. I'm using Monster API, a platform that specializes in making fine-tuning and deploying open-source models easy and fast. Their stack is well-optimized to maximize fine-tuning efficiency using techniques like Q-Lora and vLLM. They are behind the GPT. Here is what you need to do: 1. Create an account at 2. Load the GPT with the link below This is as simple as it gets. When you are done, you can click a button to deploy the model and start using it. I have 10,000 free credits for anyone using the code "SANTIAGO" in the dashboard. You can use these credits to access, fine-tune, and deploy these open-source models. You can also keep up with their latest updates, and get free credits and special offers on their Discord server:

Santiago

324,578 views • 2 years ago