Optimization is ubiquitous in the realms of computer science, physics, mathematics, and economics. It stands as an essential tool for AI and machine learning (ML) professionals, applicative in diverse domains including decision-making, route planning, and learning parameters in ML models, such as Support Vector Machines (SVM) and neural networks. The most general form of optimization is finding a minimum/maximum of a function with respect to its independent variables, which can be achieved by applying basic concepts of differential calculus. Mathematically, at these extremities the slope (first derivative) of a function is zero, referred to as stationary points. Determining whether such a point represents a maxima or a minima is done by evaluating the curvature (second derivative).
Taking this a step further, we can add constraints to the optimization problem that define a specific domain in space where the function is to be optimized. Consequently, instead of determining the maximum and minimum of a function in all of real (or complex) space, the optimization is now confined to this specific domain. The conventional approach of calculating stationary points is no longer a solution, as these points may fall outside the boundary set by the constraints. In the coming sections, we will analyze the intricacies of constrained optimization problems and explore strategies for their resolution.
Optimization problems with equality constraints are of the form
where f(x) is the function which we seek to minimize, and the constraint g(x) = 0 defines the domain within which the minimization is to be carried out. In these instances, the focus of minimization is inherently confined to the specific domain defined by the constraint. Nonetheless, as previously noted, the conventional application of differential calculus to determine stationary points does not account for the constraint, necessitating an alternative approach.
Lagrangian function
Given that this is a minimization problem, one approach to adapt the conventional method is to assign a value of infinity to the function outside the specified domain. To achieve this, we introduce a new function f’(x) characterized by the following expression:
Such modification eliminates the possibility of minima to occur outside the domain, thereby ensuring that the optimal point occurs within it. Consequently, we can now reformulate the constrained optimization into an unconstrained optimization problem.
However, there is a challenge that comes with this approach. Using differential calculus to optimize the above problem is not possible, since the function f’(x) is not differentiable due to a sudden discontinuity at the at the boundary of the domain. Here is where Lagrangian comes into play. Rather than defining the function f’(x) as in (2), we formulate it as a maximization problem.
The expression on the RHS is called the Lagrangian function and the new variable ???? is the Lagrange multiplier. It is evident from (4) that that at regions where {g(x)<0, g(x)>0}, ???? can take the values {-∞, ∞} to maximize the expression to ∞.
As a result, the optimization in (3) takes the following form.
It’s worth noting that the the problem of non-differentiability still exists as the inner maximization results in the same discontinuous function. However, with the Lagrangian representation, we can use the max-min inequality to convert the max-min problem to the min-max problem to get over this issue.
Here, we first optimize with respect to the independent variable x and then with respect to the Lagrange multiplier ????.
We’ll now analyze the scenarios when the constraint is not a equation but an inequality. Such optimizations are of the form:
We can solve this using a similar approach: we define f’(x) to be the same as f(x) within the domain defined by the constraints and infinite elsewhere:
And correspondingly, the Lagrangian function is defined as:
The Lagrange multipliers corresponding to inequality constrains are denoted by ????. Equation (9) is different in that it also has constrains on the Lagrange multipliers, which was not in (4). Now the optimization problem in (7) takes the form
Applying min-max inequality,
KKT (Karush-Kuhn-Tucker) conditions
The optimization in (10) is called the primal version and (11) is its dual version. According to min-max inequality, the dual version lower bounds the primal version, suggesting that the two versions are not necessarily equal. However, there are conditions where the primal and dual versions are equal, which is called the regularity condition. Assuming regularity, for (x*, ????*) to be the solution point it has to satisfy the following KKT conditions:
- Primal Feasibility
It follows from the problem definition.
2. Dual Feasibility
The dual feasibility follows from (9).
3. Stationarity
This is an interesting property. Since ????* is a zero or a positive, the stationarity condition essentially implies that at the optimal point, the gradients of f(x) and g(x) must be oriented in opposite directions. The rationale behind this is as follows: if the gradients of f(x) and g(x) were aligned in the same direction at the point x = x*, then both f(x) and g(x) would simultaneously decrease in a direction opposite to their gradients. This scenario would permit f(x) to continue decreasing beyond the value f(x*) without violating the constraint, in which case x* no longer qualifies as the optimal point. Therefore for a point to be the optimum, the stationarity property must hold.
4. Complementary Slackness
This is another interesting property which directly follows from equation (9). When the constraint g(x*) < 0, the Lagrange multiplier ????* must equal to zero. Since the the Lagrange multiplier also indicates how sensitive our solution is to the associated constraint, a value of ????* = 0 signifies that the associated constraint has no influence on determining the solution. In other words, whether we consider the solution with or without the constraint, the outcome remains unaltered. One straightforward example is when f(x) has a global minima in the domain where g(x) ≤ 0. For the other example, consider the minimization of the function f(x) subject to two constraints: g¹(x) < 5 and g²(x) < -1. In this case, the Lagrange multiplier ????²* corresponding to the constraint g² is zero, as g¹ already covers the conditions of g², rendering g² insignificant as a constraint.
Application: Support Vector Machine (SVM)
An example of constrained optimization with inequality constraints in machine learning is the Support Vector Machine (SVM). When given a dataset of data points {(x¹, y¹), (x², y²), …} with y ∈ {-1, 1} representing the two classes, the objective is to identify a classifier that maximizes the margin between the classes. Specifically, we formulate SVM as the following minimization problem:
The term ||w|| in the equation represents the inverse of the margin. It’s evident that there are numerous inequality constraints: in fact we have a constrained tied to each data point. However, in practice, the solution is only guided by a few data points that lie in proximity to the classifier boundary; these are referred to as support vectors. As we discussed in complementary slackness, only the Lagrange multipliers corresponding to the constraints linked to the support vectors possess non-zero values. For all other data points, their associated constraints bear Lagrange multiplier values of zero, rendering them insignificant in determining the classifier boundary.
Conclusion
We started with a brief introduction on unconstrained optimization problem and gradually expanding it to incorporate the equality and inequality constraints. Moreover, we discussed how Lagrangian function solves the challenges introduced by the constraints. Delving into the optimality of the Lagrangian, we gained insights into the KKT conditions. Lastly, we provided a succinct overview of how SVMs are formulated as constrained optimization problems and briefly discussed on its solutions.