← Back
FiledApril 22, 2026·1 min read·By The Editor

Gradient Descent, By Hand

Working a single step of gradient descent on paper, with the dignity it deserves.

optimizationintuition

Gradient descent is the workhorse of modern machine learning, but the visualization that gets shown — a marble rolling down a smooth bowl — does it a disservice. The marble is doing physics. Gradient descent is doing arithmetic. Let us do the arithmetic, once, by hand.

The setup

Suppose we have a single data point, $x = 2$, $y = 5$, and a model $\hat{y} = w \cdot x$. We start with $w = 1$. The squared loss is:

L(w) = (y - w·x)^2 = (5 - 2w)^2

At $w = 1$, the loss is $(5 - 2)^2 = 9$.

One step

The derivative is $\frac{dL}{dw} = -2x(y - wx) = -4 \cdot (5 - 2w)$. At $w = 1$, that's $-12$. With learning rate $\eta = 0.1$:

w_new = w - η · dL/dw = 1 - 0.1 · (-12) = 2.2

The new loss: $(5 - 2 \cdot 2.2)^2 = 0.36$. We dropped from 9 to 0.36 in one step. Not because the math is magic, but because the gradient at that point happens to be steep, and the optimum is close.

What this teaches

  • The gradient knows the direction; the learning rate is a guess.
  • Linear models with squared loss are a best case. The gradient is exact, the loss is convex, and one step gets you most of the way there.
  • Real models are non-convex. The intuition is the same; the patience is different.

That's the whole idea. Everything else — momentum, Adam, schedules — is a refinement of these three lines.

← Archive— end of essay —Front page →