Gradient Descent
Overview
Imagine standing on a foggy mountain and wanting to reach the lowest valley. Without seeing the full landscape, you feel the slope beneath your feet and take a step in the steepest downhill direction. You repeat this process until the ground levels out. This intuitive concept, formalized mathematically in 1847 by Cauchy, is gradient descent.
In machine learning, the "mountain" is a high-dimensional loss surface, and the "valley" represents optimal model parameters that minimize prediction error. While conceptually simple, its practical implementation requires careful tuning, variant selection, and architectural awareness.
Mathematical Foundation
Given a differentiable objective function $J(\theta)$, where $\theta$ represents the model parameters, gradient descent updates parameters iteratively using the rule:
- θt: Parameters at iteration $t$
- α: Learning rate (step size)
- ∇θ J(θt): Gradient (vector of partial derivatives)
Standard & Stochastic Variants
The choice of variant depends on dataset size, computational resources, and convergence requirements.
| Variant | Data Used | Speed | Convergence Quality |
|---|---|---|---|
| Batch GD | Full dataset | Slow | Stable, precise |
| Stochastic (SGD) | 1 sample | Fast | Noisy, may oscillate |
| Mini-batch | Batches (32–512) | Balanced | Efficient & stable |
Pseudocode: Mini-batch Gradient Descent
Adaptive Methods (Adam, RMSProp)
Standard gradient descent treats all parameters equally. Adaptive algorithms maintain per-parameter learning rates based on historical gradient magnitudes, significantly improving convergence in sparse or ill-conditioned landscapes.
Adam (Adaptive Moment Estimation) combines ideas from RMSProp and Momentum. It tracks exponential moving averages of both gradients (first moment) and squared gradients (second moment), enabling robust performance across diverse architectures without extensive hyperparameter tuning.
vt = β₂vt-1 + (1-β₂)gt²
θt+1 = θt − α·(m̂t / √v̂t + ε)
Hyperparameters
Two factors dominate optimization behavior:
- Learning Rate (α): Too high → divergence. Too low → slow convergence or trapping in local minima. Common practice: cosine annealing or warmup + decay.
- Batch Size: Larger batches provide stable gradients but require more memory and may generalize worse (the "generalization gap").
Challenges & Mitigations
| Challenge | Description | Mitigation |
|---|---|---|
| Local Minima | Convergence to suboptimal points | Momentum, stochasticity, basin-hopping |
| Saddle Points | Flat regions with zero gradient | Adaptive methods, Hessian-free optimization |
| Vanishing Gradients | Signals disappear in deep networks | Residual connections, proper initialization |
| Learning Rate Sensitivity | Highly architecture-dependent | Learning rate schedulers, gradient clipping |
Applications
Gradient descent is foundational to virtually all trainable AI systems:
- Training deep neural networks (CNNs, Transformers, RNNs)
- Linear & logistic regression optimization
- Recommendation systems & matrix factorization
- Reinforcement learning policy updates
- Generative models (GANs, Diffusion models)
References
- [1] Cauchy, A.-L. (1847). "Méthode générale pour la résolution des systèmes d'équations simultanées". Comptes rendus hebdomadaires des séances de l'Académie des Sciences, 25, 536-538.
- [2] Robbins, H., & Monro, S. (1951). "A stochastic approximation method". The Annals of Mathematical Statistics, 22(3), 400-407.
- [3] Kingma, D. P., & Ba, J. (2014). "Adam: A Method for Stochastic Optimization". ICLR 2015.
- [4] Bottou, L., Curtis, F. E., & Nocedal, J. (2018). "Optimization Methods for Large-Scale Machine Learning". SIAM Review, 60(2), 223-311.