Courage to Learn ML: Demystifying L1 & L2 Regularization (part 3) | by Amy Ma

I’m glad you brought up this question. To get straight to the point, we typically avoid p values less than 1 because they lead to non-convex optimization problems. Let me illustrate this with an image showing the shape of Lp norms for different p values. Take a close look at when p=0.5; you’ll notice that the shape is decidedly non-convex.

The shape of Lp norms for different p value. Source: https://lh5.googleusercontent.com/EoX3sngY7YnzCGY9CyMX0tEaNuKD3_ZiF4Fp3HQqbyqPtXks2TAbpTj5e4tiDv-U9PT0MAarRrPv6ClJ06C0HXQZKHeK40ZpVgRKke8-Ac0TAqdI7vWFdCXjK4taR40bdSdhGkWB

This becomes even clearer when we look at a 3D representation, assuming we’re optimizing three weights. In this case, it’s evident that the problem isn’t convex, with numerous local minima appearing along the boundaries.

Source: https://ekamperi.github.io/images/lp_norms_3d.png

The reason why we typically avoid non-convex problems in machine learning is their complexity. With a convex problem, you’re guaranteed a global minimum — this makes it generally easier to solve. On the other hand, non-convex problems often come with multiple local minima and can be computationally intensive and unpredictable. It’s exactly these kinds of challenges we aim to sidestep in ML.

When we use techniques like Lagrange multipliers to optimize a function with certain constraints, it’s crucial that these constraints are convex functions. This ensures that adding them to the original problem doesn’t alter its fundamental properties, making it more difficult to solve. This aspect is critical; otherwise, adding constraints could add more difficulties to the original problem.

You questions touches an interesting aspect of deep learning. While it’s not that we prefer non-convex problems, it’s more accurate to say that we often encounter and have to deal with them in the field of deep learning. Here’s why:

Nature of Deep Learning Models leads to a non-convex loss surface: Most deep learning models, particularly neural networks with hidden layers, inherently have non-convex loss functions. This is due to the complex, non-linear transformations that occur within these models. The combination of these non-linearities and the high dimensionality of the parameter space typically results in a loss surface that is non-convex.
Local Minima are no longer a problem in deep learning: In high-dimensional spaces, which are typical in deep learning, local minima are not as problematic as they might be in lower-dimensional spaces. Research suggests that many of the local minima in deep learning are close in value to the global minimum. Moreover, saddle points — points where the gradient is zero but are neither maxima nor minima — are more common in such spaces and are a bigger challenge.
Advanced optimization techniques exist that are more effective in dealing with non-convex spaces. Advanced optimization techniques, such as stochastic gradient descent (SGD) and its variants, have been particularly effective in finding good solutions in these non-convex spaces. While these solutions might not be global minima, they often are good enough to achieve high performance on practical tasks.

Even though deep learning models are non-convex, they excel at capturing complex patterns and relationships in large datasets. Additionally, research into non-convex functions is continually progressing, enhancing our understanding. Looking ahead, there’s potential for us to handle non-convex problems more efficiently, with fewer concerns.

Recall the image we discussed earlier showing the shapes of Lp norms for various values of p. As p increases, the Lp norm’s shape evolves. For example, at p = 3, it resembles a square with rounded corners, and as p nears infinity, it forms a perfect square.

In our optimization problem’s context, consider higher norms like L3 or L4. Similar to L2 regularization, where the loss function and constraint contours intersect at rounded edges, these higher norms would encourage weights to approximate zero, just like L2 regularization. (If this part isn’t clear, feel free to revisit Part 2 for a more detailed explanation.) Based on this statement, we can talk about the two crucial reasons why L3 and L4 norms aren’t commonly used:

L3 and L4 norms demonstrate similar effects as L2, without offering significant new advantages (make weights close to 0). L1 regularization, in contrast, zeroes out weights and introduces sparsity, useful for feature selection.
Computational complexity is another vital aspect. Regularization affects the optimization process’s complexity. L3 and L4 norms are computationally heavier than L2, making them less feasible for most machine learning applications.

To sum up, while L3 and L4 norms could be used in theory, they don’t provide unique benefits over L1 or L2 regularization, and their computational inefficiency makes them less practical choice.

Yes, it is indeed possible to combine L1 and L2 regularization, a technique often referred to as Elastic Net regularization. This approach blends the properties of both L1 (lasso) and L2 (ridge) regularization together and can be useful while challenging.

Elastic Net regularization is a linear combination of the L1 and L2 regularization terms. It adds both the L1 and L2 norm to the loss function. So it has two parameters to be tuned, lambda1 and lambda2

Elastic Net regularization. Source: https://wikimedia.org/api/rest_v1/media/math/render/svg/a66c7bfcf201d515eb71dd0aed5c8553ce990b6e

By combining both regularization techniques, Elastic Net can improve the generalization capability of the model, reducing the risk of overfitting more effectively than using either L1 or L2 alone.

Let’s break it down its advantages:

Elastic Net provides more stability than L1. L1 regularization can lead to sparse models, which is useful for feature selection. But it can also be unstable in certain situations. For example, L1 regularization can select features arbitrarily among highly correlated variables (while make others’ coefficients become 0). While Elastic Net can distribute the weights more evenly among those variables.
L2 can be more stable than L1 regularization, but it doesn’t encourage sparsity. Elastic Net aims to balance these two aspects, potentially leading to more robust models.

However, Elastic Net regularization introduces an extra hyperparameter that demands meticulous tuning. Achieving the right balance between L1 and L2 regularization and optimal model performance involves increased computational effort. This added complexity is why it’s not frequently used.

Source link

What's Hot

No Train, All Gain: Enhancing Deep Frozen Representations with Self-Supervised Gradients

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

A Practical Framework for Data Analysis: 6 Essential Principles | by Pararawendy Indarjo | Nov, 2024

Courage to Learn ML: Demystifying L1 & L2 Regularization (part 3) | by Amy Ma | Nov, 2023

A Practical Framework for Data Analysis: 6 Essential Principles | by Pararawendy Indarjo | Nov, 2024

How I Created a Data Science Project Following CRISP-DM Lifecycle | by Gustavo Santos | Nov, 2024

Increase Trust in Your Regression Model The Easy Way | by Jonte Dancker | Nov, 2024

Leave A Reply Cancel Reply

How ML AI Can Help Businesses Reduce Overhead Costs

How the AI Surge May Help Current WFH Employees

The ultimate contact center automation guide

Top 5AI Development Companies To Transform Your Business | by Amyra Sheldon

No Train, All Gain: Enhancing Deep Frozen Representations with Self-Supervised Gradients

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

A Practical Framework for Data Analysis: 6 Essential Principles | by Pararawendy Indarjo | Nov, 2024

How I Created a Data Science Project Following CRISP-DM Lifecycle | by Gustavo Santos | Nov, 2024

Our Picks

No Train, All Gain: Enhancing Deep Frozen Representations with Self-Supervised Gradients

Meta AI Researchers Introduce Mixture-of-Transformers (MoT): A Sparse Multi-Modal Transformer Architecture that Significantly Reduces Pretraining Computational Costs

A Practical Framework for Data Analysis: 6 Essential Principles | by Pararawendy Indarjo | Nov, 2024

What's Hot

Courage to Learn ML: Demystifying L1 & L2 Regularization (part 3) | by Amy Ma | Nov, 2023

Related Posts

Leave A Reply Cancel Reply