Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Fully local Code Assistant running on NVIDIA GPU! In this tutorial, I'll show you how to run Llama3 using TensorRT and Nvidia's Triton Inference Server to use it as a Code Assistant in VSCode In this thread 🧵, I'll walk you through the integration process, explaining each step simply... show more

Daniel San

32,935 subscribers

42,154 views • 2 years ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

11 Comments

Daniel San2 years ago

To get started, we need a @nvidia GPU 🤩 In this case, we will use the following hardware 💻

Daniel San2 years ago

We need to have Docker and CUDA installed Follow the guides below for installing both tools Docker: CUDA: and then run the following commands to confirm everything is set up correctly

Daniel San2 years ago

Download the llama3-8B model from @huggingface

Daniel San2 years ago

Now, Run TensorRT to compile the model using the Docker container Clone the TensorRT repository and move the model folder

Daniel San2 years ago

You should now be able to test the compiled model

Daniel San2 years ago

Perfect! We have the model now, let's deploy it on Triton Inference Server

Daniel San2 years ago

The server is up and ready to connect with CodeGPT via the custom connection Open CodeGPT in VSCode, select Custom as the provider, and enter "ensemble" for the model

Daniel San2 years ago

That's all! I'm sharing the link to the full article with all the details of the tutorial

Alexander Mia1 year ago

INTRODUCING: Agentic Security - LLM Security Scanner! 🔍 🔑 Features: Scans for prompt injections, jailbreaking & more. Provides detailed reports & options to customize attack rules. 🔗access the GitHub Link ↓

₣rancisco Trillo2 years ago

Or just use Continue and Ollama with whatever brand GPU 🤷‍♂️ that’s open source

Daniel San2 years ago

you can also use CodeGPT with Ollama Check this link:

Related Videos

Llama3-70b and phi-3-128k as Copilot in VSCode powered by NVIDIA AI 🤯 Now you can use these two models within VSCode using the NVIDIA AI API In this thread 🧵, I'll walk you through the integration process, explaining each step simply and clearly👇

Llama3-70b and phi-3-128k as Copilot in VSCode powered by NVIDIA AI 🤯 Now you can use these two models within VSCode using the NVIDIA AI API In this thread 🧵, I'll walk you through the integration process, explaining each step simply and clearly👇

Daniel San

99,941 views • 2 years ago

NVIDIA AI now lets you run Deepseek R1 in VSCode as a code assistant! 😱 With the CodeGPT extension, you can connect NVIDIA AI, then choose the Deepseek R1 model. Then select your project files to use them as context 👇

NVIDIA AI now lets you run Deepseek R1 in VSCode as a code assistant! 😱 With the CodeGPT extension, you can connect NVIDIA AI, then choose the Deepseek R1 model. Then select your project files to use them as context 👇

Daniel San

262,504 views • 1 year ago

Llama 3.1 Now Available in VSCode as a Code Assistant via Groq Inc 🚨 You can now use this new AI at Meta model directly in VSCode using the CodeGPT extension My first impression: The model is incredible 🚀

Llama 3.1 Now Available in VSCode as a Code Assistant via Groq Inc 🚨 You can now use this new AI at Meta model directly in VSCode using the CodeGPT extension My first impression: The model is incredible 🚀

Daniel San

263,015 views • 2 years ago

It'll suprise you how easy this is to make Using our AI Assistant and the unlimited generations of Google Veo 2 Prompts, tips, and tutorial in thread 👇🧵

It'll suprise you how easy this is to make Using our AI Assistant and the unlimited generations of Google Veo 2 Prompts, tips, and tutorial in thread 👇🧵

Freepik

18,545 views • 1 year ago

NVIDIA just dropped free API keys for every top AI model You don't need your own GPU and you don't pay per token. GLM-5.2, MiniMax, Kimi, DeepSeek, OpenAI, all running on NVIDIA's servers, called through a normal API. Link: How to use one: 1. Create a free NVIDIA account. 2. Pick a Free Endpoint model and open its Build tab. You'll see ready-to-copy code with the base URL 3. Hit Generate API Key, copy it and paste that base URL and key into Claude Code, Cursor, or Cline. Bonus: NVIDIA also dropped 237 official skills that install into Claude Code and Codex in one command. Bookmark this.

NVIDIA just dropped free API keys for every top AI model You don't need your own GPU and you don't pay per token. GLM-5.2, MiniMax, Kimi, DeepSeek, OpenAI, all running on NVIDIA's servers, called through a normal API. Link: How to use one: 1. Create a free NVIDIA account. 2. Pick a Free Endpoint model and open its Build tab. You'll see ready-to-copy code with the base URL 3. Hit Generate API Key, copy it and paste that base URL and key into Claude Code, Cursor, or Cline. Bonus: NVIDIA also dropped 237 official skills that install into Claude Code and Codex in one command. Bookmark this.

Yarchi

62,951 views • 22 days ago

This robot assistant from the NVIDIA CES Keynote on Monday is going viral. Nader Khalil🍊 explains all the hottest emerging AI trends in one demo: AI applications in 2026 will be multi-model, multi-modal, hybrid cloud/local, use open source models as well as proprietary models, control robots and embedded devices in the physical world, and have voice interfaces. (And the demo had a cute robot *and* a cute dog. Gold.) The demo was built with Pipecat AI. NVIDIA posted a really nice technical walk-through and complete code. The Reachy Mini robot from Hugging Face is open source hardware. (You can order it now, I have one!). You can run the assistant locally on your own hardware, in the cloud, or both.

This robot assistant from the NVIDIA CES Keynote on Monday is going viral. Nader Khalil🍊 explains all the hottest emerging AI trends in one demo: AI applications in 2026 will be multi-model, multi-modal, hybrid cloud/local, use open source models as well as proprietary models, control robots and embedded devices in the physical world, and have voice interfaces. (And the demo had a cute robot and a cute dog. Gold.) The demo was built with Pipecat AI. NVIDIA posted a really nice technical walk-through and complete code. The Reachy Mini robot from Hugging Face is open source hardware. (You can order it now, I have one!). You can run the assistant locally on your own hardware, in the cloud, or both.

kwindla

49,010 views • 6 months ago

llama3 8B (not quantized) running on an heterogeneous home cluster made of: - iPhone 15 Pro Max - iPad Pro (not sure which version XD) - MacBook Pro ( M1 Max ) - NVIDIA GeForce 3080 (not visible in video) - 2x NVIDIA Titan X Pascal Very soon also supporting Android (I *have* to also add my NVIDIA Shield GPU!!!!!). Single code base, single model format (reduced and optimally distributed to every node to save space). Everything (including iOS code) is open here ... it would be really nice, with the help of the community, taking this project to the next level in terms of optimization and support. My vision is about a distributed inference server that can run any model on any backend in any cluster topology - let's fight programmed obsolescence and democratize inference!

llama3 8B (not quantized) running on an heterogeneous home cluster made of: - iPhone 15 Pro Max - iPad Pro (not sure which version XD) - MacBook Pro ( M1 Max ) - NVIDIA GeForce 3080 (not visible in video) - 2x NVIDIA Titan X Pascal Very soon also supporting Android (I have to also add my NVIDIA Shield GPU!!!!!). Single code base, single model format (reduced and optimally distributed to every node to save space). Everything (including iOS code) is open here ... it would be really nice, with the help of the community, taking this project to the next level in terms of optimization and support. My vision is about a distributed inference server that can run any model on any backend in any cluster topology - let's fight programmed obsolescence and democratize inference!

Simone Margaritelli

304,072 views • 2 years ago

Llama 3 as a Copilot in VSCode 🤩 Let me show you how to connect this amazing model that Meta released today! Here is a step-by-step tutorial! 🧵

Llama 3 as a Copilot in VSCode 🤩 Let me show you how to connect this amazing model that Meta released today! Here is a step-by-step tutorial! 🧵

Daniel San

371,539 views • 2 years ago

I built a FREE AI Agent that can browse the web, code websites, and automate tasks WITHOUT any technical setup I literally have my own AI assistant that works 24/7 In this video I'll show you how to easily set it up No coding experience required (Trust me, you want to bookmark this)

I built a FREE AI Agent that can browse the web, code websites, and automate tasks WITHOUT any technical setup I literally have my own AI assistant that works 24/7 In this video I'll show you how to easily set it up No coding experience required (Trust me, you want to bookmark this)

Julian Goldie SEO

47,804 views • 1 year ago

Earlier this year we announced that telecom leaders are building AI grids using NVIDIA AI infrastructure to optimize inference on distributed networks But what actually is an AI Grid? In this video, Amogh Dendukuri takes us back to basics. Watch now to see him break down the top 5 things you need to know about AI grids.

Earlier this year we announced that telecom leaders are building AI grids using NVIDIA AI infrastructure to optimize inference on distributed networks But what actually is an AI Grid? In this video, Amogh Dendukuri takes us back to basics. Watch now to see him break down the top 5 things you need to know about AI grids.

NVIDIA

27,166 views • 1 month ago

CMU PhD who built the kernels NVIDIA now ships in TensorRT-LLM explained fast attention in 68 minutes - better than $1200 GPU programming courses. pick the attention pattern -> generate a fused CUDA kernel -> drop it into vLLM/SGLang -> same GPU, way more tokens per second. That loop is why FlashInfer now powers inference at NVIDIA, vLLM, SGLang, and half the serving stacks you use. FlashInfer + Triton + JIT-compiled kernels + paged-KV attention - that's the stack.

CMU PhD who built the kernels NVIDIA now ships in TensorRT-LLM explained fast attention in 68 minutes - better than $1200 GPU programming courses. pick the attention pattern -> generate a fused CUDA kernel -> drop it into vLLM/SGLang -> same GPU, way more tokens per second. That loop is why FlashInfer now powers inference at NVIDIA, vLLM, SGLang, and half the serving stacks you use. FlashInfer + Triton + JIT-compiled kernels + paged-KV attention - that's the stack.

h100envy

32,605 views • 27 days ago

Now you can recreate any TV show with AI. I'll show you how to do it, using Flow and the prompts included 🧵👇 (save this for later)

Now you can recreate any TV show with AI. I'll show you how to do it, using Flow and the prompts included 🧵👇 (save this for later)

TechHalla

69,411 views • 1 year ago

AMD might have disrupted Nvidia's entire cloud GPU rental business. In January at CES, AMD CEO Lisa Su demonstrated a $1,499 mini PC running the same class of AI model that currently costs companies $2,500 to $3,000 every month to rent from Nvidia-powered cloud servers. AMD's own branded version opened pre-orders this month at $3,999. Third party manufacturers have been selling the same chip since 2025 starting at $1,499. Here is exactly why this is dangerous for Nvidia. Nvidia's $75 billion quarterly revenue is built almost entirely on one business model, companies rent access to Nvidia GPUs through cloud providers like AWS and Lambda Labs to run AI. They pay monthly. Nvidia gets paid every time someone runs an AI model in the cloud. That recurring rental income is what turned Nvidia into a $5 trillion company. The AMD box eliminates that monthly fee permanently. One AI consultant switched from $2,800 per month in Nvidia cloud rental costs to $8 per month in electricity. The hardware paid for itself in 11 days. Over 8 months he generated $47,000 running the same AI workloads that previously left him paying Nvidia's ecosystem $2,800 every single month. Multiply that across thousands of enterprise customers and the revenue erosion becomes structural. Every business that buys this box stops paying cloud rental fees forever. Lawyers, doctors, banks, accountants, and financial advisors, businesses with sensitive data that cannot legally go to a cloud server represent billions in annual cloud GPU fees that Nvidia is now at risk of losing permanently. The threat is also closing in from the top. Google signed deals worth tens of billions with Anthropic and Meta to replace Nvidia with its own chips. Amazon built its own AI chips across AWS. Apple trained its AI on Google's chips, not Nvidia's. Custom silicon has grown from 21% of the AI chip market in 2025 to 28% in 2026. Nvidia's rental model only worked because serious AI compute had no alternative.

AMD might have disrupted Nvidia's entire cloud GPU rental business. In January at CES, AMD CEO Lisa Su demonstrated a $1,499 mini PC running the same class of AI model that currently costs companies $2,500 to $3,000 every month to rent from Nvidia-powered cloud servers. AMD's own branded version opened pre-orders this month at $3,999. Third party manufacturers have been selling the same chip since 2025 starting at $1,499. Here is exactly why this is dangerous for Nvidia. Nvidia's $75 billion quarterly revenue is built almost entirely on one business model, companies rent access to Nvidia GPUs through cloud providers like AWS and Lambda Labs to run AI. They pay monthly. Nvidia gets paid every time someone runs an AI model in the cloud. That recurring rental income is what turned Nvidia into a $5 trillion company. The AMD box eliminates that monthly fee permanently. One AI consultant switched from $2,800 per month in Nvidia cloud rental costs to $8 per month in electricity. The hardware paid for itself in 11 days. Over 8 months he generated $47,000 running the same AI workloads that previously left him paying Nvidia's ecosystem $2,800 every single month. Multiply that across thousands of enterprise customers and the revenue erosion becomes structural. Every business that buys this box stops paying cloud rental fees forever. Lawyers, doctors, banks, accountants, and financial advisors, businesses with sensitive data that cannot legally go to a cloud server represent billions in annual cloud GPU fees that Nvidia is now at risk of losing permanently. The threat is also closing in from the top. Google signed deals worth tens of billions with Anthropic and Meta to replace Nvidia with its own chips. Amazon built its own AI chips across AWS. Apple trained its AI on Google's chips, not Nvidia's. Custom silicon has grown from 21% of the AI chip market in 2025 to 28% in 2026. Nvidia's rental model only worked because serious AI compute had no alternative.

Bull Theory

26,668 views • 1 month ago

Deepseek running locally and privately for autocompletion in VSCode! 🙌 In less than a minute, I'll show you how to download Deepseek-coder and set it as the autocompletion model in VSCode. You’ll need to use ollama to download the model and CodeGPT to select it as the autocompletion model. Enjoy the best models running locally with :)

Deepseek running locally and privately for autocompletion in VSCode! 🙌 In less than a minute, I'll show you how to download Deepseek-coder and set it as the autocompletion model in VSCode. You’ll need to use ollama to download the model and CodeGPT to select it as the autocompletion model. Enjoy the best models running locally with :)

Daniel San

991,697 views • 1 year ago

Create compositions like a PRO by combining 3D with AI! I'll show how to take full control of your scene and generate images in any style you want. Breaking it all down for you in this thread. 🧵👇

Create compositions like a PRO by combining 3D with AI! I'll show how to take full control of your scene and generate images in any style you want. Breaking it all down for you in this thread. 🧵👇

TechHalla

158,069 views • 1 year ago