Gradient Descent#

12. Gradient Descent#

Gradient Descent is one of the most widely used optimization algorithms in machine learning and deep learning. It is a first-order iterative optimization technique used to minimize a function by iteratively moving towards the steepest descent direction (i.e., the direction of the negative gradient).

12.1 Basic Idea#

The main idea behind gradient descent is to update the parameters of a function in such a way that the function’s value decreases with each iteration. The update is made in the direction of the negative gradient of the function, as the gradient indicates the direction of the steepest ascent.

For a function $f (x)$ , the update rule in gradient descent is:
[
\mathbf{x}_{k+1} = \mathbf{x}_k - \eta \nabla f(\mathbf{x}_k)
]
Where:
- $x_{k}$ is the current parameter (or point),
- $\nabla f (x_{k})$ is the gradient of $f$ at $x_{k}$ ,
- $η$ is the learning rate (a scalar that controls the step size),
- $x_{k + 1}$ is the updated parameter.

12.2 Steps in Gradient Descent#

Initialize Parameters: Start with an initial guess for the parameters $x_{0}$ .
Compute Gradient: Calculate the gradient $\nabla f (x_{k})$ at the current parameter $x_{k}$ .
Update Parameters: Move in the direction of the negative gradient by updating the parameters using the formula $x_{k + 1} = x_{k} - η \nabla f (x_{k})$ .
Repeat: Repeat the process until convergence, meaning the change in function value or parameters becomes negligible.

12.3 Learning Rate (Step Size)#

The learning rate $η$ determines the size of the steps taken in the gradient descent process. Choosing an appropriate learning rate is crucial:
- If $η$ is too large, the algorithm might overshoot the minimum, causing divergence.
- If $η$ is too small, convergence will be slow, and it may get stuck in local minima.

In practice, adaptive methods like Adam, RMSProp, and AdaGrad adjust the learning rate during training for better convergence.

12.4 Convergence Criteria#

The algorithm stops when:
- The change in the function value $| f (x_{k + 1}) - f (x_{k}) |$ is smaller than a predefined tolerance.
- The change in parameters $‖ x_{k + 1} - x_{k} ‖$ is smaller than a tolerance.

Alternatively, a maximum number of iterations can be set.

12.5 Types of Gradient Descent#

Batch Gradient Descent:
In batch gradient descent, the entire dataset is used to compute the gradient at each iteration.
It has high computational cost, especially for large datasets, as it requires computing the gradient for the entire dataset before making an update.
It is guaranteed to converge to the global minimum for convex functions.

Update Rule:
[
\mathbf{x}{k+1} = \mathbf{x}_k - \eta \frac{1}{m} \sum)}^{m} \nabla f(\mathbf{x}_k, \mathbf{x_i
]
Where $m$ is the number of data points.

Stochastic Gradient Descent (SGD):
In stochastic gradient descent, only one data point is used to compute the gradient at each iteration.
It’s faster than batch gradient descent but introduces noise due to the randomness in the choice of data points.
It can escape local minima but may not converge smoothly.

Update Rule:
[
\mathbf{x}_{k+1} = \mathbf{x}_k - \eta \nabla f(\mathbf{x}_k, \mathbf{x_i})
]
Where $x_{i}$ is a randomly selected data point.

Mini-Batch Gradient Descent:
A compromise between batch and stochastic gradient descent, where a subset of the data (mini-batch) is used to compute the gradient at each iteration.
It is more efficient than batch gradient descent and reduces the variance of parameter updates compared to SGD.

Update Rule:
[
\mathbf{x}{k+1} = \mathbf{x}_k - \eta \frac{1}{b} \sum)}^{b} \nabla f(\mathbf{x}_k, \mathbf{x_i
]
Where $b$ is the mini-batch size.

12.6 Example: Gradient Descent for a Quadratic Function#

Let’s consider a simple quadratic function:
[
f(x) = x^2 + 2x + 1
]
The derivative (gradient) of this function is:
[
\nabla f(x) = 2x + 2
]
Now, apply gradient descent with a learning rate of $η = 0.1$ , starting from an initial guess $x_{0} = 3$ .

Iteration 1:
[
x_1 = x_0 - \eta \nabla f(x_0) = 3 - 0.1 \cdot (2 \cdot 3 + 2) = 3 - 0.1 \cdot 8 = 2.2
]
Iteration 2:
[
x_2 = x_1 - \eta \nabla f(x_1) = 2.2 - 0.1 \cdot (2 \cdot 2.2 + 2) = 2.2 - 0.1 \cdot 6.4 = 1.56
]
Iteration 3:
[
x_3 = x_2 - \eta \nabla f(x_2) = 1.56 - 0.1 \cdot (2 \cdot 1.56 + 2) = 1.56 - 0.1 \cdot 5.12 = 1.048
]

The updates continue, and you will eventually converge to the minimum point $x = - 1$ , which is the global minimum of the function.

12.7 Applications of Gradient Descent#

Gradient descent is a foundational optimization technique, widely used in various fields such as:
- Machine Learning: For training models, including linear regression, logistic regression, and deep neural networks.
- Deep Learning: Used in backpropagation for optimizing weights in neural networks.
- Computer Vision: In image reconstruction, segmentation, and optimization tasks.
- Natural Language Processing: For training word embeddings, transformers, etc.

12.8 Practice Problems#

Implement gradient descent to minimize the function $f (x) = 2 x^{2} - 3 x + 4$ , starting from an initial guess $x_{0} = 3$ with a learning rate of 0.1.
Given the function $f (x, y) = (x^{2} + y^{2}) + (2 x - 3 y)$ , perform one iteration of gradient descent starting at $(x_{0}, y_{0}) = (1, 1)$ with a learning rate $η = 0.1$ .
Explain why stochastic gradient descent is preferred over batch gradient descent when dealing with large datasets.

Would you like solutions for these problems, or should we proceed to Unconstrained Optimization?

13. Unconstrained Optimization#

Unconstrained optimization refers to the process of minimizing or maximizing a function without any restrictions or constraints on the decision variables. The objective is to find the values of the variables that optimize (minimize or maximize) the given function.

In the context of machine learning, unconstrained optimization is commonly used to optimize loss functions (or cost functions) during model training.

13.1 Problem Setup#

The general unconstrained optimization problem can be stated as:

[
\min_{\mathbf{x}} f(\mathbf{x})
]
Where:
- $x = (x_{1}, x_{2}, \dots, x_{n})$ is the vector of decision variables (parameters),
- $f (x)$ is the objective function that we want to minimize.

We aim to find the minimum of the function, or more formally, the value of $x$ that satisfies:

\nabla f (x) = 0

Where $\nabla f (x)$ is the gradient of $f (x)$ . The points where $\nabla f (x) = 0$ are called critical points.

13.2 First-Order Optimality Conditions#

To solve unconstrained optimization problems, we use first-order optimality conditions, which state that for a function $f (x)$ to have a local minimum at a point $x^{*}$ , the gradient must vanish at that point:
[
\nabla f(\mathbf{x^}) = 0
]
This means that the point $x^{*}$ is a **stationary point* of the function.

For unconstrained optimization, this is the first step in finding the optimal solution. The second step is to verify if the point is a minimum or not, which can be done using the second-order conditions (involving the Hessian matrix).

13.3 Second-Order Optimality Conditions#

For a point $x^{*}$ to be a local minimum of the function $f (x)$ , the following second-order condition must be satisfied:
[
H(\mathbf{x^}) = \nabla^2 f(\mathbf{x^}) \succ 0
]
Where $H (x^{*})$ is the Hessian matrix (the matrix of second derivatives of $f$ ) and $≻ 0$ denotes that the Hessian is positive definite. A positive definite Hessian ensures that the function is concave up (and hence has a minimum at $x^{*}$ ).

If $H (x^{*})$ is positive semi-definite, $x^{*}$ is a saddle point (neither a minimum nor a maximum).
If $H (x^{*})$ is negative definite, $x^{*}$ is a local maximum.

13.4 Methods for Unconstrained Optimization#

Several algorithms are used to solve unconstrained optimization problems. Some of the most common methods are:

Gradient Descent:
As discussed earlier, gradient descent is a first-order optimization method that uses the gradient of the function to iteratively move towards the minimum.
For smooth convex functions, gradient descent guarantees convergence to a global minimum, given an appropriate learning rate.
Newton’s Method:
Newton’s method is a second-order optimization method that uses both the gradient and the Hessian to update the parameters more efficiently than gradient descent.
The update rule for Newton's method is:
[
\mathbf{x}_{k+1} = \mathbf{x}_k - H^{-1}(\mathbf{x_k}) \nabla f(\mathbf{x_k})
]
Where $H (x_{k})$ is the Hessian matrix at $x_{k}$ .
While this method often converges faster than gradient descent, it requires computing the Hessian matrix at each step, which can be computationally expensive.
Quasi-Newton Methods:
Quasi-Newton methods are a family of methods that approximate the Hessian matrix to avoid the costly computation of second-order derivatives.
One of the most well-known quasi-Newton methods is BFGS (Broyden-Fletcher-Goldfarb-Shanno), which updates an approximation to the inverse Hessian matrix at each iteration.
Conjugate Gradient Method:
The conjugate gradient method is used primarily for large-scale optimization problems where computing the Hessian is infeasible.
It is an iterative method that minimizes the quadratic approximation of the function, and it is particularly useful when dealing with sparse data and large systems.

13.5 Example: Minimizing a Quadratic Function#

Consider the quadratic function:
[
f(x) = x^2 + 4x + 4
]

Step 1: Compute the gradient and Hessian:
[
\nabla f(x) = 2x + 4
]
[
\nabla^2 f(x) = 2
]
The gradient is linear, and the Hessian is a constant.
Step 2: Solve for the minimum:
Set the gradient to zero:
[
2x + 4 = 0 \quad \Rightarrow \quad x = -2
]
The second derivative (Hessian) is positive ( $\nabla^{2} f (x) = 2$ ), so $x = - 2$ is a local minimum.
Step 3: Check the second-order condition:
Since the Hessian is positive, $x = - 2$ is a global minimum.

13.6 Steepest Descent vs. Newton’s Method#

Steepest Descent:
Based on first-order derivatives (gradients).
Simpler, more computationally efficient, but may require more iterations to converge.
Does not use the curvature of the function.
Newton’s Method:
Based on both first and second-order derivatives (gradients and Hessians).
Typically converges faster than steepest descent but requires computing the Hessian at each step, which can be computationally expensive.
Takes into account the curvature of the function for faster convergence.

13.7 Practice Problems#

Minimization of a Quadratic Function:
Minimize the function $f (x) = 3 x^{2} + 4 x + 5$ using gradient descent. Start with an initial guess $x_{0} = 2$ and a learning rate $η = 0.1$ . Compute the first few iterations.
Second-Order Conditions:
Given the function $f (x, y) = x^{2} + y^{2} + 2 x y - 4 x - 6 y + 5$ , find the critical points and classify them using the second-order optimality conditions.
Newton’s Method:
Apply Newton’s method to minimize $f (x) = x^{4} - 4 x^{3} + 6 x^{2}$ . Start with an initial guess of $x_{0} = 1$ , and compute the first few iterations.

Would you like solutions to these problems, or would you prefer to move on to Constrained Optimization?

14. Constrained Optimization#

Constrained optimization refers to the process of minimizing or maximizing an objective function subject to one or more constraints. These constraints are typically expressed as equalities or inequalities that restrict the set of possible solutions.

The general form of a constrained optimization problem is:

[
\min_{\mathbf{x}} f(\mathbf{x})
]
subject to:
[
g_i(\mathbf{x}) \leq 0, \quad i = 1, 2, \dots, m \quad \text{(inequality constraints)}
]
[
h_j(\mathbf{x}) = 0, \quad j = 1, 2, \dots, p \quad \text{(equality constraints)}
]

Where:
- $f (x)$ is the objective function we want to minimize,
- $g_{i} (x)$ are inequality constraints,
- $h_{j} (x)$ are equality constraints,
- $x$ is the vector of decision variables.

In many cases, constrained optimization problems are more challenging to solve than unconstrained ones because we need to account for the effects of the constraints.

14.1 Types of Constraints#

Equality Constraints: These constraints are of the form $h_{j} (x) = 0$ . The solution must satisfy these constraints exactly.
Inequality Constraints: These constraints are of the form $g_{i} (x) \leq 0$ . The solution must lie within the feasible region defined by these inequalities.

14.2 Lagrange Multipliers Method (for Equality Constraints)#

The Lagrange multipliers method is a powerful tool for solving constrained optimization problems where the constraints are equalities. The method introduces a new set of variables, called Lagrange multipliers, that allow us to incorporate the constraints into the objective function.

14.2.1 Problem Setup#

Given the optimization problem:
[
\min_{\mathbf{x}} f(\mathbf{x})
]
subject to:
[
h_j(\mathbf{x}) = 0, \quad j = 1, 2, \dots, p
]

We form the Lagrangian function by adding the constraints multiplied by their corresponding Lagrange multipliers $λ_{j}$ :
[
\mathcal{L}(\mathbf{x}, \lambda_1, \lambda_2, \dots, \lambda_p) = f(\mathbf{x}) + \sum_{j=1}^{p} \lambda_j h_j(\mathbf{x})
]

The Lagrange multipliers $λ_{j}$ represent the sensitivity of the objective function to changes in the constraint $h_{j} (x)$ .

14.2.2 First-Order Conditions#

To find the optimal solution, we compute the gradient of the Lagrangian with respect to $x$ and the Lagrange multipliers $λ_{j}$ , and set it equal to zero:

Gradient with respect to $x$ :
[
\nabla_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \lambda_1, \lambda_2, \dots, \lambda_p) = 0
]
Gradient with respect to $λ_{j}$ :
[
h_j(\mathbf{x}) = 0 \quad \text{for each} \quad j = 1, 2, \dots, p
]

14.2.3 Steps to Solve:#

Set up the Lagrangian function.
Compute the gradients and set them to zero.
Solve the system of equations to find the values of $x$ and $λ_{j}$ .

14.3 KKT Conditions (Karush-Kuhn-Tucker) for Inequality Constraints#

For constrained optimization problems with inequality constraints, the KKT conditions provide a set of necessary conditions for a solution to be optimal.

Given the optimization problem:
[
\min_{\mathbf{x}} f(\mathbf{x})
]
subject to:
[
g_i(\mathbf{x}) \leq 0, \quad i = 1, 2, \dots, m
]
[
h_j(\mathbf{x}) = 0, \quad j = 1, 2, \dots, p
]

The KKT conditions are as follows:

Stationarity:
[
\nabla f(\mathbf{x}) + \sum_{i=1}^{m} \lambda_i \nabla g_i(\mathbf{x}) + \sum_{j=1}^{p} \mu_j \nabla h_j(\mathbf{x}) = 0
]
Where $λ_{i}$ are the Lagrange multipliers for the inequality constraints and $μ_{j}$ are the Lagrange multipliers for the equality constraints.
Primal feasibility:
[
g_i(\mathbf{x}) \leq 0, \quad i = 1, 2, \dots, m
]
[
h_j(\mathbf{x}) = 0, \quad j = 1, 2, \dots, p
]
Dual feasibility:
[
\lambda_i \geq 0, \quad i = 1, 2, \dots, m
]
This condition ensures that the Lagrange multipliers corresponding to the inequality constraints are non-negative.
Complementary slackness:
[
\lambda_i g_i(\mathbf{x}) = 0, \quad i = 1, 2, \dots, m
]
This condition ensures that either the constraint is active (i.e., $g_{i} (x) = 0$ ) and the corresponding multiplier $λ_{i}$ is positive, or the constraint is inactive (i.e., $g_{i} (x) < 0$ ) and the multiplier $λ_{i}$ is zero.

14.4 Example: Constrained Optimization with Lagrange Multipliers#

Consider the problem of minimizing the function $f (x, y) = x^{2} + y^{2}$ , subject to the constraint $x + y = 1$ .

Step 1: Form the Lagrangian:
[
\mathcal{L}(x, y, \lambda) = x^2 + y^2 + \lambda (x + y - 1)
]
Step 2: Compute the gradients:
[
\nabla_{x} \mathcal{L}(x, y, \lambda) = 2x + \lambda = 0
]
[
\nabla_{y} \mathcal{L}(x, y, \lambda) = 2y + \lambda = 0
]
[
\nabla_{\lambda} \mathcal{L}(x, y, \lambda) = x + y - 1 = 0
]
Step 3: Solve the system of equations:
From the first two equations:
[
2x + \lambda = 0 \quad \Rightarrow \quad \lambda = -2x
]
[
2y + \lambda = 0 \quad \Rightarrow \quad \lambda = -2y
]
Equating the two expressions for $λ$ , we get:
[
-2x = -2y \quad \Rightarrow \quad x = y
]
Substituting $x = y$ into the constraint $x + y = 1$ :
[
2x = 1 \quad \Rightarrow \quad x = \frac{1}{2}, \quad y = \frac{1}{2}
]

Thus, the solution is $x = \frac{1}{2}$ and $y = \frac{1}{2}$ , and the minimum value of the objective function is:
[
f\left( \frac{1}{2}, \frac{1}{2} \right) = \left( \frac{1}{2} \right)^2 + \left( \frac{1}{2} \right)^2 = \frac{1}{2}
]

14.5 Practice Problems#

Minimization with Equality Constraint:
Minimize $f (x, y) = x^{2} + y^{2}$ subject to the constraint $x + y = 3$ using the Lagrange multiplier method.
Optimization with Inequality Constraint:
Minimize $f (x, y) = x^{2} + y^{2}$ subject to the constraint $x - y \leq 1$ using the KKT conditions.
Quadratic Programming Problem:
Solve the following constrained optimization problem using the Lagrange multiplier method:
[
\min_{x, y} f(x, y) = x^2 + y^2
]
subject to:
[
x + y = 1, \quad x - y \geq 0
]

Would you like solutions for these problems or to move to Nonlinear Optimization?

15. Nonlinear Optimization#

Nonlinear optimization refers to the optimization of a function that is nonlinear with respect to its decision variables, and may also involve nonlinear constraints. Nonlinear optimization problems can be more complex to solve compared to linear optimization problems, as the objective function or constraints may have non-linearities, which may introduce multiple local minima or maxima, saddle points, and complex geometries.

The general form of a nonlinear optimization problem is:

[
\min_{\mathbf{x}} f(\mathbf{x})
]
subject to:
[
g_i(\mathbf{x}) \leq 0, \quad i = 1, 2, \dots, m \quad \text{(inequality constraints)}
]
[
h_j(\mathbf{x}) = 0, \quad j = 1, 2, \dots, p \quad \text{(equality constraints)}
]

Where:
- $f (x)$ is a nonlinear objective function,
- $g_{i} (x)$ are nonlinear inequality constraints,
- $h_{j} (x)$ are nonlinear equality constraints.

Nonlinear optimization problems can have various challenges, such as the presence of non-convexity, multiple local minima or saddle points, and the difficulty in finding global optima. Therefore, different strategies are employed to handle nonlinear optimization effectively.

15.1 Unconstrained Nonlinear Optimization#

An unconstrained nonlinear optimization problem is of the form:

min_{x} f (x)

The techniques for solving unconstrained nonlinear optimization problems are similar to those discussed in unconstrained optimization (like gradient descent, Newton's method, etc.), but now the objective function $f (x)$ is nonlinear.

15.1.1 Gradient Descent for Nonlinear Optimization#

For unconstrained nonlinear optimization, gradient descent is a commonly used iterative method, especially when the function $f (x)$ is smooth (differentiable).

Algorithm:
Initialize $x_{0}$ (the initial guess),
At each iteration $k$ , update $x_{k}$ by moving in the direction of the negative gradient:

[
\mathbf{x}_{k+1} = \mathbf{x}_k - \eta \nabla f(\mathbf{x}_k)
]
Where:
- $η$ is the learning rate (step size),
- $\nabla f (x_{k})$ is the gradient of $f (x_{k})$ .

Convergence Criteria:
The algorithm continues until the gradient $\nabla f (x_{k})$ is sufficiently small (i.e., close to zero), or a predetermined number of iterations is reached.

15.1.2 Newton’s Method for Nonlinear Optimization#

Newton's method uses both the gradient and the Hessian to find the optimal solution. For nonlinear optimization, the update rule is:

x_{k + 1} = x_{k} - H^{- 1} (x_{k}) \nabla f (x_{k})

Where $H (x_{k})$ is the Hessian matrix, which is the matrix of second derivatives of $f (x)$ .

This method converges faster than gradient descent when the Hessian is available, but it can be computationally expensive due to the calculation and inversion of the Hessian matrix.

15.2 Constrained Nonlinear Optimization#

For constrained nonlinear optimization problems, we need to deal with both the objective function and the constraints. The most common methods used to solve these problems are:

Lagrange Multipliers Method:
Used when dealing with equality constraints, the Lagrangian function is formed by adding the constraints to the objective function, as discussed earlier in Constrained Optimization.
KKT (Karush-Kuhn-Tucker) Conditions:
The KKT conditions are a set of necessary conditions for a solution to be optimal when inequality constraints are present. They generalize the Lagrange multipliers method for problems involving inequality constraints.

15.3 Trust-Region Methods#

Trust-region methods are an important class of algorithms for solving nonlinear optimization problems, particularly when dealing with non-convex functions. These methods are based on approximating the objective function using a model that is valid in a local region around the current iterate.

Trust-Region Algorithm:
In each iteration, the algorithm approximates the objective function by a quadratic model and solves a subproblem to minimize the model within a "trust region" — a region where the model is assumed to be an accurate representation of the original function.
Steps:
Model Step: Construct a quadratic model $m (x)$ of the objective function.
Trust-Region Subproblem: Solve the subproblem within a predefined region, subject to the model’s validity.
Update Step: Accept or reject the step based on how well the model approximates the objective function.

Trust-region methods are often used for nonlinear optimization problems with constraints where the model and objective function may exhibit sharp changes.

15.4 Sequential Quadratic Programming (SQP)#

Sequential Quadratic Programming is a method for solving nonlinear optimization problems with nonlinear constraints. It involves solving a sequence of quadratic approximation problems, where each subproblem is a quadratic programming problem with linearized constraints.

SQP Algorithm:
At each step, SQP approximates the objective and constraints by quadratic models and solves the resulting quadratic programming problem.
The solutions from each subproblem are used to update the estimate for the decision variables.

15.5 Nonlinear Programming Example#

Consider a nonlinear optimization problem:

[
\min_{\mathbf{x}} f(x, y) = x^2 + y^2
]
subject to:
[
g(x, y) = x^2 + y^2 - 1 = 0 \quad \text{(unit circle constraint)}
]

Step 1: Form the Lagrangian:
[
\mathcal{L}(x, y, \lambda) = x^2 + y^2 + \lambda (x^2 + y^2 - 1)
]
Step 2: Compute the gradients:
[
\nabla_x \mathcal{L} = 2x + 2\lambda x = 0
]
[
\nabla_y \mathcal{L} = 2y + 2\lambda y = 0
]
[
\nabla_\lambda \mathcal{L} = x^2 + y^2 - 1 = 0
]
Step 3: Solve the system of equations:
From $2 x (1 + λ) = 0$ , we have two cases:
- Case 1: $x = 0$ , then $y^{2} = 1$ , so $y = \pm 1$ ,
- Case 2: $1 + λ = 0$ , so $λ = - 1$ and the constraint $x^{2} + y^{2} = 1$ is trivially satisfied.

Thus, the critical points are $(x, y) = (0, 1)$ and $(0, - 1)$ , both satisfying the constraint.

15.6 Practice Problems#

Unconstrained Nonlinear Optimization:
Minimize the function $f (x) = x^{4} - 4 x^{3} + 6 x^{2}$ using gradient descent. Start with an initial guess $x_{0} = 1$ and a learning rate $η = 0.1$ . Calculate the first few iterations.
Constrained Nonlinear Optimization:
Minimize $f (x, y) = x^{2} + y^{2}$ subject to $x^{2} + y^{2} = 4$ using the Lagrange multiplier method.
Trust-Region Problem:
Solve a simple trust-region subproblem: minimize $f (x) = x^{2}$ subject to $| x | \leq 1$ .

Would you like solutions to these problems or further exploration of Stochastic Gradient Descent or Dimensionality Reduction techniques?