Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

$TETSUO Update Setup flask embeddings service using a local model for the vector dbs that the reinforcement learning component uses to measure responses with its previous storyline 🚀

25,536 Aufrufe • vor 1 Jahr •via X (Twitter)

9 Kommentare

Profilbild von tetsuo.ai
tetsuo.aivor 1 Jahr

LMFAO 😭

Profilbild von hitaro
hitarovor 1 Jahr

What a wonderful dev , TITSUUUUOO! 🇨🇦🇺🇸

Profilbild von tetsuo.ai
tetsuo.aivor 1 Jahr

TITSUUUUOO!!!

Profilbild von Paschamo
Paschamovor 1 Jahr

Need also AI Agent for my physical Art studio :) lazy Artist here 😂🎨

Profilbild von tetsuo.ai
tetsuo.aivor 1 Jahr

lol, lazy dev here. nice to meet you.

Profilbild von makintosh
makintoshvor 1 Jahr

$TETSUO ready to make history

Profilbild von Jia Zhen
Jia Zhenvor 1 Jahr

TITS UP TETSUOOOOOOO

Profilbild von tetsuo.ai
tetsuo.aivor 1 Jahr

😂

Profilbild von RPS_Crypto
RPS_Cryptovor 1 Jahr

$TETSUO 💎

Ähnliche Videos

New Course: Reinforcement Fine-Tuning LLMs with GRPO! Learn to use reinforcement learning to improve your LLM performance in this short course, built in collaboration with Predibase by Rubrik, and taught by Travis Addair, its Co-Founder and CTO, and Arnav Garg, its Senior Engineer and Machine Learning Lead. Reasoning models have been one of the most important developments in LLMs. Reinforcement Fine-Tuning (RFT) uses rewards to encourage LLMs to find solutions to multi-step reasoning tasks such as solving math problems and debugging code - without needing pre-existing training examples like in traditional supervised fine-tuning. Group Relative Policy Optimization (GRPO) is a reinforcement fine-tuning algorithm gaining rapid adoption. Developed by the DeepSeek team and used to train the R1 reasoning model, GRPO uses reward functions that you can write in Python to assign rewards to model responses. It’s beneficial for tasks with verifiable outcomes and can work well even with fewer than 100 training examples. It can also significantly improve the reasoning ability of smaller LLMs, making applications faster and more cost effective. In this course, you’ll take a technical deep dive into RFT with GRPO. You’ll learn to build reward functions that you can use in the GRPO training process to guide an LLM toward better performance on multi-step reasoning tasks. In detail, you’ll: - Learn when reinforcement fine-tuning is a better fit than supervised fine-tuning, especially for tasks involving multi-step reasoning or limited labeled data. - Understand how GRPO uses programmable reward functions as a more scalable alternative to the human feedback required for other reinforcement learning algorithms, such as RLHF and DPO. - Frame the Wordle game as a reinforcement fine-tuning problem and see how an LLM can learn to plan, analyze feedback, and improve its strategy over time. - Design reward functions that power the reinforcement fine-tuning process. - Learn techniques for evaluating more subjective tasks, such as rating the quality of a text summary, using an LLM as a judge. - Understand why reward hacking happens and how to avoid it by adding penalty functions to discourage undesirable behaviors. - Learn the four key components of the loss calculation in the GRPO algorithm: token probability distribution ratios, advantages, clipping, and KL-divergence. - Launch reinforcement fine-tuning jobs using Predibase’s hosted training services. By the end of this course, you’ll be able to build and fine-tune LLMs using reinforcement learning to improve reasoning without relying on large labeled datasets or subjective human feedback. Please sign up here:

Andrew Ng

86,457 Aufrufe • vor 1 Jahr

[RLHF] by Hand ✍️ Yesterday, Jan Leike (Jan Leike) announced he is joining #Anthropic to lead their "super-alignment" mission. He is the co-inventor of Reinforcement Learning with Human Feedback (#RLHF). How does RLHF work? [1] Given ↳ Reward Model (RM) ↳ Large Language Model (LLM) ↳ Two (Prompt, Next) Pairs 🟪 TRAIN RM Goal: Learn to give higher rewards to winners [2] Preferences ↳ A human reviews the two pairs and picks a "winner" ↳ (doc is, him) Embeddings ↳ This prompt has never received human feedback directly ↳ [S] is the special start symbol [11] Transformer ↳ Attention (yellow) ↳ Feed Forward (4x2 weight and bias matrix) ↳ Output: 3 "transformed" feature vector, one per position ↳ More details in my previous post 8. Transformer [] [12] Output Probabilities ↳ Apply a linear layer to map each transformed feature vector to a probability distribution over the vocabulary. [13] Sample ↳ Apply the greedy method, which is to pick the word with the highest score ↳ For output 1 and 2, the model accurately predicts the next word ↳ For 3rd output position, the model's predicts "him" [14] Reward Model ↳ The new pair (CEO is, him) is fed to the reward model ↳ The process is same as [3]-[6] ↳ Output: Reward = 3 [15] Loss Gradient ↳ We set the loss as the negative of the reward. ↳ The loss gradient is simply a constant -1. ↳ Run backpropagation and gradient descent to update LLM's weights and biases (red border)

Tom Yeh

79,758 Aufrufe • vor 2 Jahren