Gradient Descent, By Hand
Working a single step of gradient descent on paper, with the dignity it deserves.
Gradient descent is the workhorse of modern machine learning, but the visualization that gets shown — a marble rolling down a smooth bowl — does it a disservice. The marble is doing physics. Gradient descent is doing arithmetic. Let us do the arithmetic, once, by hand.
The setup
Suppose we have a single data point, $x = 2$, $y = 5$, and a model $\hat{y} = w \cdot x$. We start with $w = 1$. The squared loss is:
L(w) = (y - w·x)^2 = (5 - 2w)^2
At $w = 1$, the loss is $(5 - 2)^2 = 9$.
One step
The derivative is $\frac{dL}{dw} = -2x(y - wx) = -4 \cdot (5 - 2w)$. At $w = 1$, that's $-12$. With learning rate $\eta = 0.1$:
w_new = w - η · dL/dw = 1 - 0.1 · (-12) = 2.2
The new loss: $(5 - 2 \cdot 2.2)^2 = 0.36$. We dropped from 9 to 0.36 in one step. Not because the math is magic, but because the gradient at that point happens to be steep, and the optimum is close.
What this teaches
- The gradient knows the direction; the learning rate is a guess.
- Linear models with squared loss are a best case. The gradient is exact, the loss is convex, and one step gets you most of the way there.
- Real models are non-convex. The intuition is the same; the patience is different.
That's the whole idea. Everything else — momentum, Adam, schedules — is a refinement of these three lines.