#anthropic

ProfTomYeh's profile picture

[RLHF] by Hand ✍️ Yesterday, Jan Leike (Jan Leike) announced he is joining #Anthropic to lead their "super-alignment" mission. He is the co-inventor of Reinforcement Learning with Human Feedback (#RLHF). How does RLHF work? [1] Given ↳ Reward Model (RM) ↳ Large Language Model (LLM) ↳ Two (Prompt, Next) Pairs 🟪 TRAIN RM Goal: Learn to give higher rewards to winners [2] Preferences ↳ A human reviews the two pairs and picks a "winner" ↳ (doc is, him) Embeddings ↳ This prompt has never received human feedback directly ↳ [S] is the special start symbol [11] Transformer ↳ Attention (yellow) ↳ Feed Forward (4x2 weight and bias matrix) ↳ Output: 3 "transformed" feature vector, one per position ↳ More details in my previous post 8. Transformer [] [12] Output Probabilities ↳ Apply a linear layer to map each transformed feature vector to a probability distribution over the vocabulary. [13] Sample ↳ Apply the greedy method, which is to pick the word with the highest score ↳ For output 1 and 2, the model accurately predicts the next word ↳ For 3rd output position, the model's predicts "him" [14] Reward Model ↳ The new pair (CEO is, him) is fed to the reward model ↳ The process is same as [3]-[6] ↳ Output: Reward = 3 [15] Loss Gradient ↳ We set the loss as the negative of the reward. ↳ The loss gradient is simply a constant -1. ↳ Run backpropagation and gradient descent to update LLM's weights and biases (red border)

Tom Yeh

79,758 Aufrufe • vor 2 Jahren

Keine weiteren Inhalte verfügbar