Compare how SGD, Momentum, RMSProp, and Adam navigate loss surfaces. Switch between 2D view (1 weight) and 3D view (2 weights) to see how difficulty scales with dimensionality.
1 weight, 1D search. The optimizer moves a single value w along the curve. Every update is: w ← w − lr(t) × ∂L/∂w. Clean and easy to follow step by step.
2 weights, 2D search. Now the optimizer must navigate a full surface. Gradients interact across both dimensions — ridges, saddle points, and curved valleys all appear. Naive SGD can zigzag badly while Adam cuts more directly. Real networks have millions of dimensions; this is just two.