eBook › Chapter 7 · Regression

Section 7.4

Regularization

Having discussed the source of the overfitting problem, we now discuss methods to alleviate overfitting. The method we focus on here is regularization. Regularization means that instead of seeking the model parameters by minimizing the training loss alone, we add a penalty term to force the parameters to “behave better”. As a preview of the technique, we change the original training loss

\calE_{\text{train}}(\vtheta) = \underset{\textcolor{blue}{\text{data fidelity}}}{\underbrace{\sum_{n=1}^N \bigg(y_n - \sum_{p=0}^{d-1} \theta_p \phi_p(x_n)\bigg)^2}},

which consists of only the data fidelity term, to a modified training loss

\calE_{\text{train}}(\vtheta) = \underset{\textcolor{blue}{F(\vtheta), \; \text{data fidelity}}}{\underbrace{\sum_{n=1}^N \bigg(y_n - \sum_{p=0}^{d-1} \theta_p \phi_p(x_n)\bigg)^2}} + \underset{\textcolor{blue}{\lambda \cdot R(\vtheta), \; \text{regularization}}}{\underbrace{ \lambda \cdot \sum_{p=0}^{d-1} \theta_p^2}}.

Putting this into the matrix form, we define the data fidelity term as

F(\vtheta) = \|\mX\vtheta - \vy\|^2.

The newly added term $R(\vtheta)$ is called the regularization function or the penalty function. It can take a variety of forms, e.g.,

sep0ex
Ridge regression: $R(\vtheta) = \sum_{p=0}^{d-1} \theta_p^2 = \|\vtheta\|^2$.
LASSO regression: $R(\vtheta) = \sum_{p=0}^{d-1} |\theta_p| = \|\vtheta\|_1$.

In this section we aim to understand the role of the regularization functions by studying these two examples of $R(\vtheta)$.

7.4.1Ridge regularization

To explain the meaning of Eq. (7.34) we write it in terms of matrices and vectors:

\minimize{\vtheta \in \R^d}\;\; \|\mX\vtheta - \vy\|^2 + \lambda \|\vtheta\|^2,

where $\lambda$ is called the regularization parameter. It needs to be tuned by the user. We refer to Eq. (7.36) as the ridge regression. (In signal processing and optimization, Eq. (7.36) is called the Tikhonov regularization. We follow the statistics community in calling it the ridge regression.)

How can the regularization function help to mitigate the overfitting problem? First let's find the solution to this problem.

Practice Exercise 1

Prove that the solution to Eq. (7.36) is

\vthetahat = (\mX^T\mX + \lambda \mI)^{-1}\mX^T\vy.

Solution

Take the derivative with respect to $\vtheta$. (The solution here requires some basic matrix calculus. You may refer to the University of Waterloo's Matrix Cookbook https://www.math.uwaterloo.ca/ hwolkowi/matrixcookbook.pdf.) This yields

\begin{aligned} \nabla_{\vtheta}\bigg\{\|\mX\vtheta - \vy\|^2 + \lambda \|\vtheta\|^2\bigg\} = 2\mX^T(\mX\vtheta - \vy) + 2\lambda \vtheta = 0. \end{aligned}

Rearranging the terms gives

\begin{aligned} (\mX^T\mX+\lambda\mI)\vtheta = \mX^T\vy. \end{aligned}

Inverting the matrix on the left-hand side yields the solution.

Let us compare the ridge regression solution with the vanilla regression solutions:

\begin{aligned} \vthetahat_{\text{vanilla}} &= (\mX^T\mX)^{-1}\mX^T\vy,\\ \vthetahat_{\text{ridge}}(\lambda) &= (\mX^T\mX + \lambda \mI)^{-1}\mX^T\vy. \end{aligned}

Clearly, the only difference is the presence of the parameter $\lambda$:

sep0ex
If $\lambda \rightarrow 0$, then $\vthetahat_{\text{ridge}}(0) = \vthetahat_{\text{vanilla}}$. This is because
$\calE_{\text{train}}(\vtheta) = \|\mX\vtheta - \vy\|^2 + \underset{=0}{\underbrace{\lambda \|\vtheta\|^2}}.$
Hence, when $\lambda \rightarrow 0$, the regression problem goes back to the vanilla version, and so does the solution.
If $\lambda \rightarrow \infty$, then $\vthetahat_{\text{ridge}}(\infty) = 0$. This happens because
$\calE_{\text{train}}(\vtheta) = \underset{=0}{\underbrace{\frac{1}{\lambda}\|\mX\vtheta - \vy\|^2}} + \|\vtheta\|^2.$
Since we are now minimizing $\|\vtheta\|^2$, the solution will be $\vtheta = 0$ because zero is the smallest value a squared function can achieve.

For any $0 < \lambda < \infty$, the net effect of $(\mX^T\mX + \lambda \mI)$ is the constant $\lambda$ added to all the eigenvalues of $\mX^T\mX$. By taking the eigendecomposition of $\mX^T\mX$,

[\mU,\mS] = \texttt{eig}(\mX^T\mX),

we have that

\begin{aligned} \mX^T\mX + \lambda \mI &= \mU\mS\mU^T + \lambda \mI \\ &= \mU\mS\mU^T + \lambda \mU\mU^T = \mU(\mS+\lambda\mI)\mU^T. \end{aligned}

Therefore, if the eigenvalue matrix $\mS$ has a zero eigenvalue it will be offset by $\lambda$:

\begin{aligned} \mS = \begin{bmatrix} \clubsuit & & & \\ & \heartsuit & & \\ & & \spadesuit & \\ & & & 0 \\ \end{bmatrix} \;\;\; \longrightarrow \;\;\; \mS+\lambda\mI = \begin{bmatrix} \clubsuit+\lambda & & & \\ & \heartsuit+\lambda & & \\ & & \spadesuit +\lambda& \\ & & & \lambda \\ \end{bmatrix} \end{aligned}

As a result, even if $\mX^T\mX$ is not invertible (or close to not invertible), the new matrix $\mX^T\mX+\lambda\mI$ is guaranteed to be invertible.

Practice Exercise 2

You may be wondering what happens if $\mX^T\mX$ has a negative eigenvalue so that when we add a positive $\lambda$, the resulting matrix may have a zero eigenvalue. Prove that $\mX^T\mX$ will never have a negative eigenvalue, and $\mX^T\mX+\lambda\mI$ always has positive eigenvalues.

Solution

Eigenvalues of a matrix $\mA$ are nonnegative if and only if $\vv^T\mA\vv \ge 0$ for any $\vv$. Thus we need to check whether $\vv^T\mX^T\mX\vv \ge 0$ for all $\vv$. However, this is easy:

\begin{aligned} \vv^T\mX^T\mX\vv = \|\mX\vv\|^2, \end{aligned}

which must be nonnegative for any $\vv$. Matrices satisfying this property are called positive semidefinite. Therefore, $\mX^T\mX$ is positive semidefinite.

Implementation

Solving the ridge regression is easy. First, we observe that the regularization function $R(\vtheta) = \|\vtheta\|^2$ is a quadratic function. Therefore, it can be combined with the data fidelity term as

\begin{aligned} \vthetahat &= \argmin{\vtheta \in \R^d} \;\; \|\mX\vtheta - \vy\|^2 + \lambda \|\vtheta\|^2\\ &= \argmin{\vtheta \in \R^d} \;\; \|\mX\vtheta - \vy\|^2 + \|\sqrt{\lambda}\mI \vtheta - \vzero \|^2\\ &= \argmin{\vtheta \in \R^d} \;\; \left\|\begin{bmatrix}\mX \\ \sqrt{\lambda} \mI \end{bmatrix}\vtheta - \begin{bmatrix}\vy \\ \vzero \end{bmatrix}\right\|^2. \end{aligned}

Therefore, all we need to do is to concatenate the matrix $\mX$ with a $d \times d$ identity operator $\sqrt{\lambda}\mI$, and concatenate $\vy$ with a $d \times 1$ all-zero vector.

In MATLAB and Python, the implementation of the ridge regression is done by defining a new matrix A and a new vector b, as shown below:

MATLAB

% MATLAB command for ridge regression
    A = [X; sqrt(lambda)*eye(d)];
    b = [y(:); zeros(d,1)];
    theta = A\b;

Python

% MATLAB command for ridge regression
    A = np.vstack((X, np.sqrt(lambd)*np.eye(d)))
    b = np.hstack((y, np.zeros(d)))
    theta = np.linalg.lstsq(A, b, rcond=None)[0]

Example 7.6

Consider a dataset of $N = 20$ data points. These data points are constructed from the model

y_n = 0.5 - 2x_n - 3x_n^2 + 4x_n^3 + 6x_n^4 + e_n, \qquad n = 1,\ldots, N,

where $e_n \sim \text{Gaussian}(0,0.25^2)$ is the noise. Fit the data using

sep0ex
(a) Vanilla linear regression with a $4$th-order polynomial.
(b) Vanilla linear regression with a 20th-order polynomial.
(c) Ridge regression with a 20th-order polynomial, by considering three choices of $\lambda$: $\lambda = 10^{-6}$, $\lambda = 10^{-3}$, and $\lambda = 10$.

Solution

sep0ex
(a) We first fit the data using a 4th-order polynomial. This fitting is relatively straightforward. In the MATLAB / Python programs below, set $d = 4$ and $\lambda = 0$. The result is shown in Figure 7.16(a).

**Figure 7.16.** Overfitting occurs when the model is too complex for the number of training samples. When using a vanilla regression with a 20th-order polynomial, the curve overfits the data and causes a catastrophic fitting error.

sep0ex
(b) Suppose we use a 20th-order polynomial $g(x) = \sum_{p=0}^{20} \theta_p x^p$ to fit the data. We plot the result in Figure 7.16(b). Since the order of the polynomial is very high relative to the number of training samples, it comes as no surprise that the fitting is poor. This is overfitting, and we know the reason.

sep0ex
(c) Next, we consider a ridge regression using three choices of $\lambda$. The result is shown in Figure 7.17. If $\lambda$ is too small, we observe that some overfitting still occurs. If $\lambda$ is too large, then the curve underfits the data. For an appropriately chosen $\lambda$, it can be seen that the fitting is reasonably good.

**Figure 7.17.** Ridge regression addresses the overfitting problem by adding a regularization term to the training loss. Depending on the strength of the parameter $\lambda$, the fitted curve can vary from overfitting to underfitting.

The MATLAB and Python codes used to generate the above plots are shown below.

MATLAB

% MATLAB code to demonstrate a ridge regression example
    % Generate data
    N = 20;
    x = linspace(-1,1,N);
    a = [0.5, -2, -3, 4, 6];
    y = a(1)+a(2)*x(:)+a(3)*x(:).^2+a(4)*x(:).^3+a(5)*x(:).^4+0.25*randn(N,1);
    
    % Ridge regression
    lambda = 0.1;
    d = 20;
    X = zeros(N, d);
    for p=0:d-1
        X(:,p+1) = x(:).^p;
    end
    A = [X; sqrt(lambda)*eye(d)];
    b = [y(:); zeros(d,1)];
    theta = A\b;
    
    % Interpolate and display results
    t    = linspace(-1, 1, 500);
    Xhat = zeros(length(t), d);
    for p=0:d-1
        Xhat(:,p+1) = t(:).^p;
    end
    yhat = Xhat*theta;
    plot(x,y,   'ko','LineWidth',2, 'MarkerSize', 10); hold on;
    plot(t,yhat,'LineWidth',4,'Color',[0.2 0.2 0.9]);

Python

# Python code to demonstrate a ridge regression example
    import numpy as np
    import matplotlib.pyplot as plt
    from scipy.special import eval_legendre
    np.set_printoptions(precision=2, suppress=True)
    
    N = 20
    x = np.linspace(-1,1,N)
    a = np.array([0.5, -2, -3, 4, 6])
    y = a[0] + a[1]*x    + a[2]*x**2 + \
               a[3]*x**3 + a[4]*x**4 + 0.25*np.random.randn(N)
    d = 20
    X = np.zeros((N, d))
    for p in range(d):
      X[:,p] = x**p
    
    lambd = 0.1
    A = np.vstack((X, np.sqrt(lambd)*np.eye(d)))
    b = np.hstack((y, np.zeros(d)))
    theta = np.linalg.lstsq(A, b, rcond=None)[0]
    
    t    = np.linspace(-1, 1, 500)
    Xhat = np.zeros((500,d))
    for p in range(d):
      Xhat[:,p] = t**p
    yhat = np.dot(Xhat, theta)
    
    plt.plot(x,y,'o',markersize=12)
    plt.plot(t,yhat, linewidth=4)
    plt.show()

Why does ridge regression work?

sep0ex
The penalty term $\|\vtheta\|^2$ in
$\vthetahat_{\text{ridge}} = \argmin{\vtheta \in \R^d} \;\; \|\mX\vtheta - \vy\|^2 + \lambda\|\vtheta\|^2$
does not allow solutions with a very large $\|\vtheta\|^2$.
The penalty term adds a positive offset to the eigenvalues of $\mX^T\mX$.
Since the denominator in $(\mX^T\mX+\lambda\mI)^{-1}\mX^T\vy$ becomes larger than that of $(\mX^T\mX)^{-1}\mX^T\vy$, noise in $\vy$ is less amplified.

Choosing the parameter

How should we choose the parameter $\lambda$? The honest answer is that there is no answer because the optimal $\lambda$ can only be found if we have access to the testing samples. If we do, we can plot the MSE (the testing error) with respect to $\lambda$, as shown in Figure 7.18(a).

**Figure 7.18.** (a) Determining the optimal $\lambda$ requires knowledge of the testing samples. In practice, we can replace the testing samples with the validation samples, which are subsets of the training data. Then by plotting the validation error as a function of $\lambda$ we can determine the optimal $\lambda$. (b) The alternative is to plot $F(\vthetahat_{\lambda})$ versus $R(\vthetahat_{\lambda})$. The optimal $\lambda$ can be found by locating the elbow point.

Of course in reality we do not have access to the testing data. However, we can reserve a small portion of the training samples and treat them as validation samples. Then we run the ridge regression for different choices of $\lambda$. The $\lambda$ that minimizes the error on these validation samples is the one that you should deploy. If the training set is small, we can shuffle the validation samples randomly and compute the average. This scheme is known as cross-validation.

For some problems, there are “tactics” you may be able to employ for determining the optimal $\lambda$. The first approach is to ask yourself what would be the reasonable range of $\|\vtheta\|^2$ or $\|\mX\vtheta-\vy\|^2$? Are you expecting them to be large or small? Approximately in what order of magnitude? If you have some clues about this, then you can plot the function $F(\vthetahat_{\lambda}) = \|\mX\vthetahat_{\lambda}-\vy\|^2$ as a function of $R(\vthetahat_{\lambda}) = \|\vthetahat_{\lambda}\|^2$, where $\vthetahat_{\lambda}$ is a shorthand notation for $\vthetahat_{\text{ridge}}(\lambda)$, which is the estimated parameter using a specific value of $\lambda$. Figure 7.18(b) shows an example of such a plot. As you can see, by varying $\lambda$ we have different values of $F(\vthetahat_{\lambda})$ and $R(\vthetahat_{\lambda})$.

If you have some ideas about what $\|\vtheta\|^2$ should be, say you want $\|\vtheta\|^2 \le \tau$, you can go to the $F(\vthetahat_{\lambda})$ versus $R(\vthetahat_{\lambda})$ curve and find a point such that $R(\vthetahat_{\lambda}) \le \tau$. On the other hand, if you want $\|\mX\vtheta-\vy\|^2 \le \epsilon$, you can also go to the $F(\vthetahat_{\lambda})$ versus $R(\vthetahat_{\lambda})$ curve and find a point such that $\|\mX\vtheta-\vy\|^2 \le \epsilon$. In either case, you have the freedom to shift the difficulty of finding $\lambda$ to that of finding $\tau$ or $\epsilon$. Note that $\tau$ and $\epsilon$ have better physical interpretations. The quantity $\epsilon$ tells us the upper bound of the prediction error, and $\tau$ tells us the upper bound of the parameter magnitude. If you have been working on your dataset long enough, the historical data (and your experience) will help you determine these values.

Another feasible option suggested in the literature is finding the anchor point of the $F(\vthetahat_{\lambda})$ and $R(\vthetahat_{\lambda})$. The idea is that if the curve has a sharp elbow, the turning point would indicate a rapid increase/decrease in $F(\vthetahat_{\lambda})$ (or $R(\vthetahat_{\lambda})$).

How to determine $\lambda$

sep0ex
Cross-validation: Reserve a few training samples as validation samples. Check the prediction error w.r.t. these validation samples. The $\lambda$ that minimizes the validation error is the one you deploy.
$\|\vtheta\|^2 \le \tau$: Plot the $F(\vthetahat_{\lambda})$ and $R(\vthetahat_{\lambda})$. Then go along the $R$-axis to find the position where $R(\vthetahat_{\lambda}) \le \tau$.
$\|\mX\vtheta-\vy\|^2 \le \epsilon$: Plot the $F(\vthetahat_{\lambda})$ and $R(\vthetahat_{\lambda})$. Then go along the $F$-axis to find the position where $F(\vthetahat_{\lambda}) \le \epsilon$.
Find the elbow point of $F(\vthetahat_{\lambda})$ and $R(\vthetahat_{\lambda})$.

Bias and variance trade-off for ridge regression

We now discuss the bias and variance trade-off of the ridge regression.

Theorem 7.7

Let $\vy = \mX\vtheta+\ve$ be the training data, where $\ve$ is zero-mean and has a covariance $\sigma^2\mI$. Consider the ridge regression

\vthetahat_{\lambda} = \argmin{\vtheta \in \R^d} \;\; \|\mX\vtheta -\vy\|^2 + \lambda\|\vtheta\|^2.

Then the estimate has the properties that

\begin{aligned} \vthetahat_{\lambda} &= (\mX^T\mX+\lambda\mI)^{-1}\mX^T\mX\vtheta + (\mX^T\mX+\lambda\mI)^{-1}\mX^T\ve, \\ \E[\vthetahat_{\lambda}] &= (\mX^T\mX+\lambda\mI)^{-1}\mX^T\mX\vtheta = \mW_{\lambda}\vtheta, \\ \Cov[\vthetahat_{\lambda}] &= \sigma^2(\mX^T\mX+\lambda\mI)^{-1}\mX^T\mX(\mX^T\mX + \lambda\mI)^{-1}, \\ \text{MSE}(\vthetahat_{\lambda},\vtheta) &= \sigma^2\text{Tr}\bigg\{\mW_\lambda(\mX^T\mX)^{-1}\mW_\lambda^T\bigg\} + \vtheta^T(\mW_\lambda-\mI)^T(\mW_\lambda-\mI)\vtheta, \end{aligned}

where $\mW_\lambda = (\mX^T\mX+\lambda\mI)^{-1}\mX^T\mX$.

Proof. The proof of this theorem involves some tedious matrix operations that will be omitted here. If you are interested in the proof you can consult van Wieringen's “Lecture notes on ridge regression”, https://arxiv.org/pdf/1509.09169.pdf.

■

The results of this theorem provide a way to assess the bias and variance. Specifically, from the MSE we know that

\begin{aligned} \text{MSE}(\vthetahat_{\lambda},\vtheta) &= \E_{\ve} \left[\|\vthetahat_{\lambda} - \vtheta\|^2\right]\\ &= \|\E_{\ve}[\vthetahat_{\lambda}] - \vtheta\|^2 + \text{Tr}\left\{\Cov[\vthetahat_{\lambda}]\right\} \\ &= \underset{\text{bias}}{\underbrace{\vtheta^T(\mW_\lambda-\mI)^T(\mW_\lambda-\mI)\vtheta}} + \underset{\text{variance}}{\underbrace{\sigma^2\text{Tr}\bigg\{\mW_\lambda(\mX^T\mX)^{-1}\mW_\lambda^T\bigg\}}}. \end{aligned}

The bias and variance are defined respectively as

\begin{aligned} \text{Bias}(\vthetahat_{\lambda},\vtheta) &= \vtheta^T(\mW_\lambda-\mI)^T(\mW_\lambda-\mI)\vtheta,\\ \text{Var}(\vthetahat_{\lambda},\vtheta) &= \sigma^2\text{Tr}\bigg\{\mW_\lambda(\mX^T\mX)^{-1}\mW_\lambda^T\bigg\}. \end{aligned}

We can then plot the bias and variance as a function of $\lambda$. An example is shown in Figure 7.19.

The result in Figure 7.19 can be summarized in three points:

sep0ex
Bias $\uparrow$ as $\lambda \uparrow$. This is because a large $\lambda$ pushes the solution towards $\vtheta = 0$. Therefore, the bias with respect to the ground truth $\vtheta$ will increase.
Variance $\downarrow$ as $\lambda \uparrow$. Since variance is caused by noise, increasing $\lambda$ forces the solution $\vtheta$ to be small. Hence, it becomes less sensitive to noise.
MSE reaches a minimum point somewhere in the middle. The MSE is the sum of bias and variance. Therefore, it drops to the minimum and then rises again as $\lambda$ increases.

With appropriate choice of $\lambda$, the ridge regression can have a lower mean squared error than the vanilla regression. The following result is due to C. M. Theobald: (Theobald, C. M. (1974). Generalizations of mean square error applied to ridge regression. Journal of the Royal Statistical Society. Series B (Methodological), 36(1), 103-106.)

Theorem 7.8

For $\lambda < 2\sigma^2\|\vtheta\|^{-2}$,

\begin{aligned} \text{MSE}\left(\vthetahat_{\text{ridge}}(\lambda),\vtheta\right) < \text{MSE}\left(\vthetahat_{\text{vanilla}},\vtheta\right). \end{aligned}

This theorem says that as long as $\lambda$ is small enough, the ridge regression will have a lower MSE than the vanilla regression. Thus ridge regression is almost always helpful. Of course, the optimal $\lambda$ is not provided by the theorem, which only tells us where to search for a good $\lambda$.

Why does ridge regression reduce the testing error?

sep0ex
The regularization reduces the variance (see Figure 7.19 when $\lambda > 0$)
It pays the price of increasing the bias.
Usually, the drop in variance outweighs the increase in bias. So the overall MSE drops.
Bias is not always a bad thing.

7.4.2LASSO regularization

The ridge regression we discussed in the previous subsection is just one of the many possible ways of doing regularization. One alternative is to replace $\|\vtheta\|^2$ by $\|\vtheta\|_1$, where

\|\vtheta\|_1 = \sum_{p=0}^{d-1} |\theta_p|.

This change from the sum-squares to sum-absolute-values has been the main driving force in data science, machine learning, and signal processing for at least the past two decades. The optimization associated with $\|\vtheta\|_1$ is

\minimize{\vtheta \in \R^d} \;\; \|\mX\vtheta-\vy\|^2 + \lambda\|\vtheta\|_1,

\calE_{\text{train}}(\vtheta) = \underset{\textcolor{blue}{F(\vtheta), \; \text{data fidelity}}}{\underbrace{\sum_{n=1}^N \bigg(y_n - \sum_{p=0}^{d-1} \theta_p \phi_p(x_n)\bigg)^2}} + \underset{\textcolor{blue}{\lambda \cdot R(\vtheta), \; \text{regularization}}}{\underbrace{\lambda \cdot \sum_{p=0}^{d-1} |\theta_p|}}.

Seeking a sparse solution

To understand the choice of $\|\cdot\|_1$, we need to introduce the concept of sparsity.

Definition 7.1

A vector $\vtheta$ is called sparse if it has only a few non-zero elements.

As illustrated in Figure 7.20, a sparse $\vtheta$ ensures that only a very few columns of the data matrix $\mX$ are active. This is an attractive property because, in some of the regression problems, it is indeed possible to have just a few dominant factors. The LASSO regression says that if our problem possesses this sparse solution, then the $\|\cdot\|_1$ can help us find the sparse solution.

How can $\|\vtheta\|_1$ promote sparsity? If we consider the sets

\begin{aligned} \Omega_1 &= \{\vtheta \;|\; \|\vtheta\|_1 \le \tau \} = \{(\theta_1,\theta_2) \;|\; |\theta_1| + |\theta_2| \le \tau\},\\ \Omega_2 &= \{\vtheta \;|\; \|\vtheta\|^2 \le \tau \} = \{(\theta_1,\theta_2) \;|\; \theta_1^2 + \theta_2^2 \le \tau\}, \end{aligned}

we note that $\Omega_1$ has a diamond shape whereas $\Omega_2$ has a circular shape. Since the data fidelity term $\|\mX\vtheta-\vy\|^2$ is an ellipsoid, seeking the optimal value in the presence of the regularization term can be viewed as moving the ellipsoid until it touches the set defined by the regularization. As illustrated in Figure 7.21, since $\{\vtheta \;|\; \|\vtheta\|^2 \le \tau \}$ is a circle, the solution will be somewhere in the middle. On the other hand, since $\{\vtheta \;|\; \|\vtheta\|_1 \le \tau \}$ is a diamond, the solution will be one of the vertices. The difference between “somewhere in the middle” and “a vertex” is that the vertex is a sparse solution, since by the definition of a vertex one coordinate must be zero and the other coordinate must be non-zero. We can easily extrapolate this idea to the higher-dimensional spaces. In this case, we will see that the solution for the $\|\cdot\|_1$ problem has only a few non-zero entries.

**Figure 7.21.** Ridge regression defines a circle whereas LASSO regression defines a diamond. When we minimize the loss function, LASSO will return us a vertex which is sparse.

The optimization formulated in Eq. (7.41) is known as the least absolute shrinkage and selection operator (LASSO). LASSO problems are difficult, but over the past two decades we have increased our understanding of the problem. The most significant breakthrough is that we now have algorithms to solve the LASSO problem efficiently. This is important because, unlike the ridge regression, where we have a (very simple) closed-form solution, the LASSO problem can only be solved using iterative algorithms.

What is so special about LASSO?

sep0ex
LASSO regularization promotes a sparse solution.
If the underlying model has a sparse solution, e.g., you choose a 50th-order polynomial, but the underlying model is a third-order polynomial, then there should only be four non-zero regression coefficients in your 50th-order polynomial. LASSO will help in this case.
If the underlying model has a dense solution, then LASSO is of limited value. A ridge regression could be better.
While $\|\vtheta\|_1$ is not differentiable (at 0), there exist polynomial-time convex algorithms to solve the problem, e.g., interior-point methods.

Solving the LASSO problem

Today, there are many open-source packages to solve the LASSO problem. They are mostly developed in the convex optimization literature. One of the most user-friendly packages is the CVX package developed by S. Boyd and colleagues at Stanford University. (The MATLAB version is here: http://cvxr.com/cvx/. The Python version is here: https://cvxopt.org/. Follow the instructions to install the package.) Once you have downloaded and installed the package, solving the optimization can be done literally by typing in the data fidelity term and the regularization term. An example is given below.

MATLAB

cvx_begin
        variable theta(d)
        minimize(sum_square(X*theta-y) + lambda*norm(theta,1))
    cvx_end

As you can see, the program is extremely simple. You start by calling cvx_begin and end it with cvx_end. Inside the box we create a variable theta(d), where d denotes the dimension of the vector theta. The main command is minimize. However, this line is almost self-explanatory. As long as you follow the syntax given by the user guidelines, you will be able to set it up properly.

In Python, we can call the cvxpy library.

Python

import cvxpy as cvx
    theta     = cvx.Variable(d)
    objective = cvx.Minimize( cvx.sum_squares(X*theta-y) \
                                + lambd*cvx.norm1(theta) )
    prob      = cvx.Problem(objective)
    prob.solve()

To see a concrete example, we use the crime rate data obtained from https://web.stanford.edu/ hastie/StatLearnSparsity/data.html. A snapshot of the data is shown in the table below. In this dataset, the vector $\vy$ is the crime rate, which is the last column of the table. The feature/basis vectors are funding, hs, not-hs, college.

city	`crime rate`	`funding`	`hs`	`no-hs`	`college`
1	478	40	74	11	31
2	494	32	72	11	43
3	643	57	71	18	16
4	341	31	71	11	25

50	940	66	67	26	18

We consider two optimizations:

\begin{aligned} \widehat{\vtheta}_{1}(\lambda) &= \argmin{\vtheta} \;\; \calE_1(\vtheta) \bydef \|\mX\vtheta - \vy\|^2 + \lambda \|\vtheta\|_1,\\ \widehat{\vtheta}_{2}(\lambda) &= \argmin{\vtheta} \;\; \calE_2(\vtheta) \bydef \|\mX\vtheta - \vy\|^2 + \lambda \|\vtheta\|^2. \end{aligned}

As we have discussed, the first optimization uses the $\|\cdot\|_1$ regularized least squares, which is the LASSO problem. The second optimization is the standard $\|\cdot\|^2$ regularized least squares. Since both solutions depend on the parameter $\lambda$, we parameterize the solutions in terms of $\lambda$. Note that the optimal $\lambda$ for $\widehat{\vtheta}_{1}$ is not necessarily the optimal $\lambda$ for $\widehat{\vtheta}_{2}$.

One thing we would like to demonstrate in this example is visualizing the linear regression coefficients $\widehat{\vtheta}_{1}(\lambda)$ and $\widehat{\vtheta}_{2}(\lambda)$ as $\lambda$ changes. To solve the optimization, we use CVX; the MATLAB and Python implementation is shown below.

MATLAB

data = load('./dataset/data_crime.txt');
    y    = data(:,1);        % The observed crime rate
    X    = data(:,3:end);    % Feature vectors
    [N,d]= size(X);
    
    lambdaset = logspace(-1,8,50);
    theta_store   = zeros(d,50);
    for i=1:length(lambdaset)
        lambda = lambdaset(i);
        cvx_begin
            variable theta(d)
            minimize( sum_square(X*theta-y) + lambda*norm(theta,1) )
    %       minimize( sum_square(X*theta-y) + lambda*sum_square(theta) )
        cvx_end
        theta_store(:,i) = theta(:);
    end
    
    figure(1);
    semilogx(lambdaset, theta_store, 'LineWidth', 4);
    legend('funding','% high', '% no high', '% college', ...
            '% graduate', 'Location','NW');
    xlabel('lambda');
    ylabel('feature attribute');

Python

import cvxpy as cvx
    import numpy as np
    import matplotlib.pyplot as plt
    
    data = np.loadtxt("/content/data_crime.txt")
    y = data[:,0]
    X = data[:,2:7]
    N,d = X.shape
    
    lambd_set = np.logspace(-1,8,50)
    theta_store = np.zeros((d,50))
    for i in range(50):
      lambd = lambd_set[i]
      theta = cvx.Variable(d)
      objective   = cvx.Minimize( cvx.sum_squares(X*theta-y) \
                    + lambd*cvx.norm1(theta) )
    # objective   = cvx.Minimize( cvx.sum_squares(X*theta-y) \
                    + lambd*cvx.sum_squares(theta) )
      prob        = cvx.Problem(objective)
      prob.solve()
      theta_store[:,i] = theta.value
    
    for i in range(d):
      plt.semilogx(lambd_set, theta_store[i,:])

Figure 7.22 shows some interesting differences between the two regression models.

**Figure 7.22.** Ridge and LASSO regression on the crime-rate dataset. (a) The LASSO regression suggests that there are only a few active components as we change $\lambda$. (b) The ridge regression returns a set of dense solutions for all choices of $\lambda$.

sep0ex
Trajectory. For the $\|\cdot\|^2$ estimate $\vthetahat_2(\lambda)$, the trajectory of the regression coefficients is smooth. This is attributable to the fact that the training loss $\calE_2(\vtheta)$ is continuously differentiable in $\vtheta$, and so the solution trajectory is smooth. By contrast, the $\|\cdot\|_1$ estimate $\vthetahat_1(\lambda)$ has a more disruptive trajectory.
Active members. For the LASSO problem, $\vthetahat_1(\lambda)$ switches the active member as $\lambda$ changes. For example, the feature high-school is the first one being activated when $\lambda \downarrow$. This implies that if we limit ourselves to only one feature, then high-school is the feature we should select. The ridge regression does not have this feature-selection property. How about when $\lambda = 10^6$? In this case, the LASSO has two active members: funding and high-school. This suggests that if there are two contributing factors, funding and high-school are the two. When $\lambda = 10^4$, we see that in LASSO, the green curve goes to zero but then the red curve rises. This means a correlation between high school and no high school, which should not be a surprise because they are complementary to each other.
Magnitude of solutions. The magnitude of the solutions does not necessarily convey a clear conclusion because the feature vectors (e.g., high school) and the observable crime rate have different units.
Limiting solutions. As $\lambda \rightarrow 0$, both $\vthetahat_1(\lambda)$ and $\vthetahat_2(\lambda)$ reach the same solution, because the training losses are identical when $\lambda = 0$.

LASSO for overfitting

Does LASSO help to mitigate the overfitting problem? Not always, but it often does. In Figure 7.23 we consider fitting a dataset of $N=20$ data points. The ground truth model we use is

$$y_n = L_0(x_n) + 0.5L_1(x_n) + 0.5L_2(x_n) + 1.5L_3(x_n) + L_4(x_n) + e_n,$$

where $e_n \sim \text{Gaussian}(0,\sigma^2)$ for $\sigma = 0.25$. When fitting the data, we purposely choose a 20th-order Legendre polynomial as the regression model. With only $N=20$ data points, we can be almost certain that there is overfitting.

The MATLAB and Python codes for solving this LASSO problem are shown below.

MATLAB

% MATLAB code to demonstrate overfitting and LASSO
    % Generate data
    N = 20;
    x = linspace(-1,1,N)';
    a = [1, 0.5, 0.5, 1.5, 1];
    y = a(1)*legendreP(0,x)+a(2)*legendreP(1,x)+a(3)*legendreP(2,x)+ ...
        a(4)*legendreP(3,x)+a(5)*legendreP(4,x)+0.25*randn(N,1);
    
    % Solve LASSO using CVX
    d = 20;
    X = zeros(N, d);
    for p=0:d-1
        X(:,p+1) = reshape(legendreP(p,x),N,1);
    end
    lambda = 2;
    cvx_begin
        variable theta(d)
        minimize( sum_square( X*theta - y ) + lambda * norm(theta , 1) )
    cvx_end
    
    % Plot results
    t    = linspace(-1, 1, 200);
    Xhat = zeros(length(t), d);
    for p=0:d-1
        Xhat(:,p+1) = reshape(legendreP(p,t),200,1);
    end
    yhat = Xhat*theta;
    plot(x,y,   'ko','LineWidth',2, 'MarkerSize', 10); hold on;
    plot(t,yhat,'LineWidth',6,'Color',[0.2 0.5 0.2]);

Python

# Python code to demonstrate overfitting and LASSO
    import cvxpy as cvx
    import numpy as np
    import matplotlib.pyplot as plt
    
    # Setup the problem
    N = 20
    x = np.linspace(-1,1,N)
    a = np.array([1, 0.5, 0.5, 1.5, 1])
    y = a[0]*eval_legendre(0,x) + a[1]*eval_legendre(1,x) + \
        a[2]*eval_legendre(2,x) + a[3]*eval_legendre(3,x) + \
        a[4]*eval_legendre(4,x) + 0.25*np.random.randn(N)
    
    # Solve LASSO using CVX
    d = 20
    lambd = 1
    X = np.zeros((N, d))
    for p in range(d):
      X[:,p] = eval_legendre(p,x)
    theta       = cvx.Variable(d)
    objective   = cvx.Minimize( cvx.sum_squares(X*theta-y) \
                                + lambd*cvx.norm1(theta) )
    prob        = cvx.Problem(objective)
    prob.solve()
    thetahat = theta.value
    
    # Plot the curves
    t = np.linspace(-1, 1, 500)
    Xhat = np.zeros((500,d))
    for p in range(P):
      Xhat[:,p] = eval_legendre(p,t)
    yhat = np.dot(Xhat, thetahat)
    plt.plot(x, y, 'o')
    plt.plot(t, yhat, linewidth=4)

**Figure 7.23.** We fit a dataset of $N = 20$ data points. (a) The ground truth model that generates the data. The model is a 4th-order ordinary polynomial. (b) Vanilla regression result, without any regularization. Note that there is severe overfitting because the model complexity is too high. (c) Ridge regression result, by setting $\lambda = 0.5$. (d) LASSO regression result, by setting $\lambda = 2$.

Let us compare the various regression results. Figure 7.23(b) shows the vanilla regression, which as you can see fits the $N = 20$ data points very well. However, no one would believe that such a fitting curve can generalize to unseen data. Figure 7.23(c) shows the ridge regression result. When performing the analysis, we sweep a range of $\lambda$ and pick the value $\lambda = 0.5$ so that the fitted curve is neither too “wild” nor too “flat”. We can see that the fitting is improved. However, since the ridge regression only penalizes large-magnitude coefficients, the fitting is still not ideal. Figure 7.23(d) shows the LASSO regression result. Since the true model is a 4th-order polynomial and we use a 20th-order polynomial, the true solution is sparse. Therefore, LASSO is helpful, and hence we can pick a sparse solution.

The significance of LASSO is often not about the fitting of the data points but the number of active coefficients. In Figure 7.24 we show a comparison between the ground truth coefficients, the vanilla regression coefficients, the ridge regression coefficients, and the LASSO regression coefficients. It is evident that the LASSO solution contains a much smaller number of non-zeros compared to the ridge regression. Most of the high-order coefficients are zero. By contrast, the vanilla regression coefficients are wild. The ridge regression is better, but there are many non-zero high-order coefficients.

**Figure 7.24.** Coefficients of the regression models. (a) The ground truth model, which is a 4th-order polynomial. There are only 5 non-zero coefficients. (b) The vanilla regression coefficients. Note that the values are wild and large, although the curve fits the training data points very well. (c) The ridge regression coefficients. While the overall magnitudes are significantly improved from the vanilla, some high-order coefficients are still non-zero. (d) The LASSO regression coefficients. There are very few non-zeros, and the non-zeros match well with the ground truth.

Closing remark. In this section, we discussed two regularization techniques: ridge regression and LASSO regression. Both techniques are about adding a penalty term to the training loss to constrain the regression coefficients. In the optimization literature, writings on ridge and LASSO regression are abundant, covering both algorithms and theoretical properties. An example of a theoretical question addressed in the literature is: Under what conditions is LASSO guaranteed to recover the correct support of the solution, i.e., locating the correct positions of the non-zeros? Problems like these are beyond the scope of this book.