tochastic Gradient Descent (SGD)#

16. Stochastic Gradient Descent (SGD)#

Stochastic Gradient Descent (SGD) is an optimization algorithm used to minimize an objective function in machine learning and deep learning. It is a variant of the gradient descent algorithm but differs in the way it computes the gradient.

In standard Gradient Descent (GD), the gradient of the objective function is calculated using the entire dataset, which can be computationally expensive when the dataset is large. On the other hand, in Stochastic Gradient Descent, the gradient is computed using a single data point (or a small batch of data points) at each iteration, which significantly reduces the computational cost and makes it more efficient for large datasets.

16.1 Stochastic Gradient Descent (SGD) Algorithm#

The basic update rule for SGD is:

x_{k + 1} = x_{k} - η \nabla f (x_{k}, x_{i})

Where:
- $x_{k}$ is the current parameter vector at iteration $k$ ,
- $η$ is the learning rate (step size),
- $\nabla f (x_{k}, x_{i})$ is the gradient of the objective function with respect to the $k$ -th data point $x_{i}$ ,
- $x_{k + 1}$ is the updated parameter vector.

Instead of computing the gradient using the entire dataset, SGD computes the gradient for a single data point $x_{i}$ at each iteration.

16.2 Advantages of SGD#

Efficiency:
SGD is much faster than regular gradient descent when the dataset is large, as it updates the parameters after looking at only one sample at a time.
Online Learning:
SGD can be used for online learning, where the model is trained incrementally as new data comes in.
Convergence:
While SGD can be noisy due to the randomness introduced by using a single data point at a time, this noise can sometimes help escape local minima in non-convex optimization problems (e.g., in neural networks).
Memory Efficiency:
Since SGD only needs one data point at a time, it requires less memory compared to batch gradient descent, which needs to store the entire dataset for each gradient calculation.

16.3 Challenges with SGD#

Noisy Convergence:
Because of the randomness, the algorithm tends to oscillate around the optimal solution and may not converge smoothly to the global minimum. However, this can be mitigated by using techniques like learning rate schedules or momentum.
Choosing a Proper Learning Rate:
A large learning rate can cause the algorithm to overshoot the minimum, while a small learning rate may result in slow convergence. Learning rate decay is often used to adapt the learning rate over time.
Local Minima:
SGD can get stuck in local minima, especially in highly non-convex objective functions like those encountered in deep learning models.

16.4 Variants of SGD#

Mini-batch Gradient Descent:
A compromise between batch gradient descent and stochastic gradient descent. Instead of computing the gradient for a single data point, the gradient is computed for a small batch of data points. This often leads to faster convergence while still retaining the benefits of stochastic updates.
Momentum:
Momentum helps accelerate SGD in the relevant direction and dampens oscillations. The update rule with momentum is:

[
\mathbf{v}{k+1} = \beta \mathbf{v}_k + (1 - \beta) \nabla f(\mathbf{x}_k)
]
[
\mathbf{x}} = \mathbf{xk - \eta \mathbf{v}
]

Where $β$ is a momentum coefficient. This method helps in smoother convergence by "remembering" the previous gradients.

Adam (Adaptive Moment Estimation):
Adam combines the advantages of both momentum and adaptive learning rate methods. It computes adaptive learning rates for each parameter by considering both the first and second moments of the gradients.

The update rule is:

[
\mathbf{m}{k+1} = \beta_1 \mathbf{m}_k + (1 - \beta_1) \nabla f(\mathbf{x}_k)
]
[
\mathbf{v}} = \beta_2 \mathbf{vk + (1 - \beta_2) \nabla f(\mathbf{x}_k)^2
]
[
\mathbf{x}} = \mathbf{xk - \eta \frac{\mathbf{m}}}{\sqrt{\mathbf{v}_{k+1}} + \epsilon
]

Where $β_{1}$ and $β_{2}$ are hyperparameters controlling the decay rates of the moving averages of the gradient and its square, and $ϵ$ is a small constant to prevent division by zero.

16.5 Example of SGD: Minimizing a Simple Function#

Let’s consider a simple example of minimizing the quadratic function $f (x) = x^{2}$ using Stochastic Gradient Descent.

Objective Function: $f (x) = x^{2}$
Gradient: $\nabla f (x) = 2 x$
Learning Rate: $η = 0.1$
Initial Guess: $x_{0} = 3$

Iteration 1:
- Compute the gradient: $\nabla f (3) = 2 \times 3 = 6$
- Update: $x_{1} = 3 - 0.1 \times 6 = 2.4$

Iteration 2:
- Compute the gradient: $\nabla f (2.4) = 2 \times 2.4 = 4.8$
- Update: $x_{2} = 2.4 - 0.1 \times 4.8 = 1.92$

And so on. After several iterations, $x$ converges to 0, which is the global minimum.

16.6 Practice Problem: SGD#

Use SGD to minimize the function $f (x) = x^{4} - 3 x^{3} + 2 x^{2}$ starting from $x_{0} = 2$ with a learning rate $η = 0.05$ .

17. Dimensionality Reduction and PCA (Principal Component Analysis)#

Dimensionality reduction refers to techniques used to reduce the number of features in a dataset while retaining as much information as possible. It's especially useful for visualizing high-dimensional data or speeding up machine learning algorithms.

Principal Component Analysis (PCA) is one of the most commonly used dimensionality reduction techniques. PCA is a linear transformation that projects data onto a smaller subspace (lower dimension), maximizing the variance in the data.

17.1 PCA Overview#

PCA works by finding a new set of orthogonal axes (principal components) in the data space. The goal is to project the data onto these axes, such that the first principal component captures the most variance in the data, the second captures the second most, and so on.

Covariance Matrix:
The first step in PCA is to compute the covariance matrix of the dataset, which tells us how the features of the data vary together.

[
\mathbf{C} = \frac{1}{n-1} \sum_{i=1}^{n} (\mathbf{x}_i - \mathbf{\mu})(\mathbf{x}_i - \mathbf{\mu})^T
]
Where $μ$ is the mean of the dataset, and $x_{i}$ is a data point.

Eigenvalues and Eigenvectors:
The next step is to compute the eigenvalues and eigenvectors of the covariance matrix. The eigenvectors represent the principal components, and the eigenvalues represent the amount of variance captured by each principal component.
Projection:
Finally, the data is projected onto the principal components, reducing its dimensionality.

17.2 Example of PCA#

Let’s consider a dataset with two features (dimensions):

X = [\begin{matrix} 1 & 2 \\ 2 & 3 \\ 3 & 4 \\ 4 & 5 \end{matrix}]

Compute the covariance matrix.
Compute the eigenvalues and eigenvectors of the covariance matrix.
Project the data onto the first principal component.

17.3 Practice Problem: PCA#

Perform PCA on a dataset with three features:
[
\mathbf{X} = $[\begin{matrix} 1 & 2 & 3 4 & 5 & 6 7 & 8 & 9 10 & 11 & 12 \end{matrix}]$
]
and reduce it to 2 dimensions.

Would you like solutions for these problems, or would you like to continue with Optimization for Support Vector Machines?