An Interactive Introduction To Gradient Descent (How LLMs Learn)

fun

Motivating Gradient Descent by hand.

Author

Shon Czinner

Published

July 4, 2026

At the root of Large Language Models, deep learning, neural networks, and artificial intelligence is “Gradient Descent”. These models work by minimizing the error between a model’s predictions and known outputs. Computers do this using derivatives from calculus, but in this post I just want to illustrate this by hand. The goal is to build some intuition about what is “descending”.

To start, let’s start by just minimizing a function. Use the slider to move the point to the lowest value of the given function. Note the plot beneath shows the value of the function as you move the slider.

Move slider to minimize f(x):

Next we can illustrate fitting a line to points (in this case estimating home prices vs. home size). Importantly, we need to quantify how good a line is. In our case, we’ll add up all the distances between points and the line, indicated by the vertical bars.

Now we’re finding the best model by minimizing a function which represents the error between a model’s predictions and known outputs.

Move slider to minimize the sum of absolute errors:

slope (m):

intercept (b):

Instead of a home price, LLMs predict the probability of every subsequent token (e.g., words, or rather fragments of words and collections of characters). For example, given the context “I pet the…”, an LLM has learned to output a higher value for “cat” than for “plane”, indicating a higher probability of “cat” and a lower probability of “plane”. The model then selects the token with the highest value, in this case “cat”, and then reruns with “I pet the cat”.

In practice, the next token is not always the highest one, especially when several tokens have similar values. For example, “cat” might have a value of 1.0 and “dog” might have 0.98. While “cat” is more likely, sometimes “dog” is chosen instead.

When the context is “I flew the…”, the LLM with the same weights needs to also learn to output a higher value for “plane” than for “cat”.

For a taste, try moving these four sliders to give “cat” a value of 1.0 and “plane” a value of 0 for context “I pet the…” and to give “plane” a value of 1.0 and “cat” a value of 0 for context “I flew the…”. Once again, you can see the error update as you move the sliders.

Move slider to minimize total dataset error:

w₁₁ (Feature 1 → Cat):

w₁₂ (Feature 2 → Cat):

w₂₁ (Feature 1 → Plane):

w₂₂ (Feature 2 → Plane):

Training Input (Prompt)	Target Output	Model Prediction	Error
“I pet the…”	Cat (1.0), Plane (0.0)	Cat: , Plane:	0.0000
“I flew the…”	Cat (0.0), Plane (1.0)	Cat: , Plane:	0.0000

Instead of a few sliders, LLMs have billions or trillions of weights and are trained on billions of samples of texts from books, Wikipedia, social media sites, and code repositories.

For a given prompt and conversation history, the model uses these weights to calculate a value for tens of thousands of different tokens, choosing the next one based on which values are highest.

That is where Gradient Descent comes in. Humans can’t manually tune a trillion sliders to get those token values right. But using calculus, the computer can instantly calculate exactly which direction to nudge all one trillion sliders simultaneously to make the model just a tiny bit smarter. Repeat that process billions of times, and you get an AI that can write essays, code software, and chat with you just like a human.