FiledApril 22, 2026·1 min read·By The Editor

Gradient Descent, By Hand

Working a single step of gradient descent on paper, with the dignity it deserves.

optimizationintuition

Gradient descent is the workhorse of modern machine learning, but the visualization that gets shown — a marble rolling down a smooth bowl — does it a disservice. The marble is doing physics. Gradient descent is doing arithmetic. Let us do the arithmetic, once, by hand.

The setup

Suppose we have a single data point, $x = 2$, $y = 5$, and a model $\hat{y} = w \cdot x$. We start with $w = 1$. The squared loss is:

L(w) = (y - w·x)^2 = (5 - 2w)^2

At $w = 1$, the loss is $(5 - 2)^2 = 9$.

One step

The derivative is $\frac{dL}{dw} = -2x(y - wx) = -4 \cdot (5 - 2w)$. At $w = 1$, that's $-12$. With learning rate $\eta = 0.1$:

w_new = w - η · dL/dw = 1 - 0.1 · (-12) = 2.2

The new loss: $(5 - 2 \cdot 2.2)^2 = 0.36$. We dropped from 9 to 0.36 in one step. Not because the math is magic, but because the gradient at that point happens to be steep, and the optimum is close.

What this teaches

The gradient knows the direction; the learning rate is a guess.
Linear models with squared loss are a best case. The gradient is exact, the loss is convex, and one step gets you most of the way there.
Real models are non-convex. The intuition is the same; the patience is different.

That's the whole idea. Everything else — momentum, Adam, schedules — is a refinement of these three lines.

← Archive— end of essay —Front page →