Regularized Linear Regression Models

Implementing Pathwise Coordinate Descent For The Lasso and The Elastic Net In Python Using NumPy

Explanations for Solving Some of the Most Popular Supervised Learning Algorithms


Date Written: 2021-01-15

Blog post cover image
Model Coefficient Value Changes With Growing Regularization Penalty Values

Hey there! 👋

Welcome to the final part of a three-part deep-dive on regularized linear regression modeling! In part one, linear modeling was established with the derivation of OLS showing how to solve for model coefficients to make predictions of the response given new feature data. Next, in part two, overfitting issues of the OLS model were discussed and Ridge Regression was presented as a technique to help reduce overfitting through regularization. Building off the same concept as Ridge Regression, the Lasso and the Elastic Net are now presented. The series concludes with general considerations of use cases for the techniques presented.

The models and included implementations were tested on a wine quality prediction dataset of which the code and results can be viewed at the project repository here.


The Lasso for Regression

The Lasso, or Least Absolute Shrinkage and Selection Operator, includes the addition of an L₁ penalty to the OLS loss function (Part One Eq #7), bringing selective model parameters to zero for a large enough value of an associated tuning parameter, λ. In other words, the Lasso performs automated feature selection producing a vector of model coefficients with sparsity (amount of elements that are zero) varying on the magnitude of a tuning parameter.

The Lasso loss function can be formalized as (equation #1):

L(β)=yXβ22+λβ1 with tuning parameter λ0L(\beta) = \lVert \textbf{y} - \textbf{X}\beta \rVert_2^2 + \lambda \lVert \beta \rVert_1 \text{ with tuning parameter } \lambda \geq 0

Similar to previous cases, an intercept term can be included through data augmentation of the design matrix with a column of 1s. Furthermore, formulating the problem as a least-squares optimization problem produces (equation #2):

β^=argminβL(β)=argminβ12nyXβ22+λβ1\hat{\beta} = \underset{\beta}{\text{argmin}} \: L(\beta) = \underset{\beta}{\text{argmin}} \frac{1}{2n} \lVert \textbf{y} - \textbf{X}\beta \rVert_2^2 + \lambda \lVert \beta \rVert_1

However, unlike previous cases, no closed-form solution exists for this problem. This is due to the fact that the addition of the L₁ penalty makes the function no longer continuously differentiable because of the non-smooth absolute component. To remedy this issue, a discrete optimization technique needs to be applied to search for an optimal solution.

Numerous algorithms exist to this end, such as LARS (Least Angle Regression) and Forward Stepwise Regression, however, the Pathwise Coordinate Descent algorithm is leveraged within this work. In short, this algorithm optimizes a parameter at a time holding all other parameters constant.

Pathwise Coordinate Descent

Before beginning the algorithm, all features should be standardized to have zero mean and variance of one. From there, a p+1 length coefficient vector is initialized to zero. Cycles are then run across all coefficients until convergence — where values of the coefficients stabilize and do not change more than a certain tolerance — is reached. Within each cycle, for every coefficient, an update is calculated and subsequently has the soft-thresholding operator applied to it.

The simplest form of Coordinate Descent updates calculates — for each coefficient — the simple (single variable as opposed to multiple regression) least-squares coefficient value using the partial residuals across all other features in the design matrix. Partial residuals, in this case, are found as (equation #3):

ri,j=yikjxi,kβkr_{i, j} = y_i - \sum_{k \neq j} x_{i, k} \beta_k

Therefore, the estimate for a particular coefficient value can be found as (equation #4):

βj=1ni=1nxi,jri,j\beta^*_j = \frac{1}{n} \sum_{i=1}^n x_{i, j} r_{i, j}

Now, the penalty, as dictated by the tuning parameter, is included in the model through the soft-thresholding operator. This is expressed as (equation #5):

βj=S(βj,λ)=sign(βj)(βjλ)+={βjλif βj>0 and λ<βjβj+λif βj<0 and λ<βj0if λβj\beta_j = \textbf{S}(\beta^*_j, \lambda) = \text{sign}(\beta^*_j)(\lvert \beta^*_j \rvert - \lambda)_+ = \begin{cases} \beta^*_j - \lambda & \text{if } \beta^*_j \gt 0 \text{ and } \lambda \lt \lvert \beta^*_j \rvert \\ \beta^*_j + \lambda & \text{if } \beta^*_j \lt 0 \text{ and } \lambda \lt \lvert \beta^*_j \rvert \\ 0 & \text{if } \lambda \geq \lvert \beta^*_j \rvert \end{cases}

Naive updates can be utilized for improved efficiency. These updates are found via (equation #6):

βj=1ni=1nxi,jri+βj\beta^*_j = \frac{1}{n} \sum_{i=1}^n x_{i, j} r_i + \beta_j

where rir_i is the current model residual for all samples, nn.

When the number of samples is much greater than the number of features (nn >> pp), further efficiency improvements can be derived by using covariance updates. For these updates, the first term of the Naive update equation above (Eq. #6) is replaced as shown by (equation #7):

i=1nxi,jri=xj,yk:βk>0xj,xkβk\sum_{i=1}^n x_{i, j}r_i = \langle x_j, y \rangle - \sum_{k: \lvert\beta_k\rvert\gt0}\langle x_j, x_k \rangle \beta_k

Utilizing warm starts can bring efficiency boosts as well. Using warm starts, a sequence of models are fitted — with tuning parameter values from a max tuning parameter value down to a minimum tuning parameter value that is some small factor (thousandth) of the max value — initializing the coefficients of each iteration to the solution of the last iteration. In many cases, it is actually faster to fit this path of models than a single model for some small λ\lambda. Thus the path can be expressed as (equation #8):

λmaxλmin=ϵλmax\lambda_{max} \rightarrow \lambda_{min} = \epsilon \cdot \lambda_{max}

where ϵ\epsilon is typically 0.001 and there are 100 values spaced on a log scale.

Furthermore, the max value of the tuning parameter to begin the path at can be found by finding the minimum value that will bring the estimates for all model coefficients to zero. This is since any values above this value will result in total sparsity of the coefficient vector. The max value of the path (starting point) can be found as (equation #9):

λmax=maxlxl,yn\lambda_{max} = \frac{\underset{l}{\text{max}}\lvert\langle x_l, y \rangle\rvert}{n}

By providing a starting place to begin searching for optimality, warm starting can many times speed up convergence and also puts the pathwise in the Pathwise Coordinate Descent algorithm.

Implementation of the Lasso In Python Using NumPy

One possible way to implement pathwise coordinate descent for the Lasso (with options for tuning the convergence tolerance, path length, and returning the path) is:

def lasso(X, y, l1, tol=1e-6, path_length=100, return_path=False):
    """The Lasso Regression model with intercept term.
    Intercept term included via design matrix augmentation.
    Pathwise coordinate descent with co-variance updates is applied.
    Path from max value of the L1 tuning parameter to input tuning parameter value.
    Features must be standardized (centered and scaled to unit variance)
    Params:
        X - NumPy matrix, size (N, p), of standardized numerical predictors
        y - NumPy array, length N, of numerical response
        l1 - L1 penalty tuning parameter (positive scalar)
        tol - Coordinate Descent convergence tolerance (exited if change < tol)
        path_length - Number of tuning parameter values to include in path (positive integer)
        return_path - Boolean indicating whether model coefficients along path should be returned
    Returns:
        if return_path == False:
            NumPy array, length p + 1, of fitted model coefficients
        if return_path == True:
            List, length 3, of last fitted model coefficients, tuning parameter path and coefficient values
    """
    X = np.hstack((np.ones((len(X), 1)), X))
    m, n = np.shape(X)
    B_star = np.zeros((n))
    l_max = max(list(abs(np.dot(np.transpose(X[:, 1:]), y)))) / m
    # At or above l_max, all coefficients (except intercept) will be brought to 0
    if l1 >= l_max:
        return np.append(np.mean(y), np.zeros((n - 1)))
    l_path = np.geomspace(l_max, l1, path_length)
    coeffiecients = np.zeros((len(l_path), n))
    for i in range(len(l_path)):
        while True:
            B_s = B_star
            for j in range(n):
                k = np.where(B_s != 0)[0]
                update = (1/m)*((np.dot(X[:,j], y)- \
                                np.dot(np.dot(X[:,j], X[:,k]), B_s[k]))) + \
                                B_s[j]
                B_star[j] = (np.sign(update) * max(abs(update) - l_path[i], 0))
            if np.all(abs(B_s - B_star) < tol):
                coeffiecients[i, :] = B_star
                break
    if return_path:
        return [B_star, l_path, coeffiecients]
    else:
        return B_star

The Elastic Net

In this form of regularized linear regression, the OLS loss function is changed by the addition of both an L1L_1 and L2L_2 penalty to the loss function with tuning parameters controlling the intensity of regularization and the balance between the different penalties. This loss function can be formalized as (equation #10):

L(β)=yXβ22+λ[(1α)12β22+αβ] with tuning parameters λ0,0α1L(\beta) = \lVert \mathbf{y} - \mathbf{X}\beta \rVert_2^2 + \lambda [ (1-\alpha) \frac{1}{2} \lVert \beta \rVert_2^2 + \alpha \lVert \beta \rVert] \\ \text{ with tuning parameters } \lambda \geq 0, 0 \leq \alpha \leq 1

Having both L1L_1 and L2L_2 penalties, the Elastic Net serves to deliver a compromise between Ridge regression and the Lasso, bringing coefficients towards zero and selectively to zero. Here, α\alpha can be considered as the parameter determining the ratio of L1L_1 penalty to add, whereas λ\lambda can be thought of as the intensity of regularization to apply.

The Elastic Net loss function is also used to formalize a least-squares optimization problem (equation #11):

β^=argminβL(β)=argminβ12nyXβ22+λ[(1α)12β22+αβ]\hat\beta = \underset{\beta}{\text{argmin}} \: L(\beta) = \underset{\beta}{\text{argmin}} \frac{1}{2n}\lVert \mathbf{y} - \mathbf{X}\beta \rVert_2^2 + \lambda [ (1-\alpha) \frac{1}{2} \lVert \beta \rVert_2^2 + \alpha \lVert \beta \rVert]

Similarly to the Lasso, an intercept is included through design matrix augmentation, a 12n\frac{1}{2n} term is added for mathematical completeness, and pathwise coordinate descent is implemented to solve since a closed-form solution does not exist due to the L1L_1 penalty term.

Just as in the Lasso, pathwise coordinate descent is used to solve for model coefficient estimates however changes need to be made to the update equations to account for the dual penalties. In this case, the jjth coefficient value obtained after soft-thresholding is now found as (equation #12):

βj=S(βj,λα)1+λ(1α)\beta_j = \frac{\mathbf{S}(\beta^*_j, \lambda \alpha)}{1 + \lambda (1 - \alpha)}

The soft-thresholding operator is the same operator applied in the Lasso update (Eq. #5) (equation #13):

sign(βj)(βjλα)+\text{sign}(\beta^*_j)(\lvert \beta^*_j \rvert - \lambda \alpha)_+

Furthermore, naive updates or covariance updates can be used along with warm starts. For warm starts, the max λ\lambda value can be calculated as (equation #14):

λmax=maxlxl,ynα\lambda_{\text{max}} = \frac{\underset{l}{\text{max}} \lvert \langle x_l, y \rangle \rvert}{n \alpha}

Implementation in Python Using NumPy

One possible way to implement pathwise coordinate descent to solve the Elastic Net (with options for tuning the convergence tolerance, path length, and returning the path) is:

def elastic_net(X, y, l, alpha, tol=1e-4, path_length=100, return_path=False):
    """The Elastic Net Regression model with intercept term.
    Intercept term included via design matrix augmentation.
    Pathwise coordinate descent with co-variance updates is applied.
    Path from max value of the L1 tuning parameter to input tuning parameter value.
    Features must be standardized (centered and scaled to unit variance)
    Params:
        X - NumPy matrix, size (N, p), of standardized numerical predictors
        y - NumPy array, length N, of numerical response
        l - l penalty tuning parameter (positive scalar)
        alpha - alpha penalty tuning parameter (positive scalar between 0 and 1)
        tol - Coordinate Descent convergence tolerance (exited if change < tol)
        path_length - Number of tuning parameter values to include in path (positive integer)
    Returns:
        NumPy array, length p + 1, of fitted model coefficients
    """
    X = np.hstack((np.ones((len(X), 1)), X))
    m, n = np.shape(X)
    B_star = np.zeros((n))
    if alpha == 0:
        l2 = 1e-15
    l_max = max(list(abs(np.dot(np.transpose(X), y)))) / m / alpha
    if l >= l_max:
        return np.append(np.mean(y), np.zeros((n - 1)))
    l_path = np.geomspace(l_max, l, path_length)
    for i in range(path_length):
        while True:
            B_s = B_star
            for j in range(n):
                k = np.where(B_s != 0)[0]
                update = (1/m)*((np.dot(X[:,j], y)- \
                                np.dot(np.dot(X[:,j], X[:,k]), B_s[k]))) + \
                                B_s[j]
                B_star[j] = (np.sign(update) * max(
                    abs(update) - l_path[i] * alpha, 0)) / (1 + (l_path[i] * (1 - alpha)))
            if np.all(abs(B_s - B_star) < tol):
                break
    return B_star

Conclusion

Throughout this series, different regularized forms of linear regression have been examined as tools to overcome the tendency to overfit training data of the Ordinary Least Squares model. The Elastic Net — due to its balance between regularization varieties — tends to be the most robust to aid the overfitting issue, however, the Lasso certainly can prove helpful in many situations due to its automated feature selection. Ridge Regression is also a good tool to use to ensure a reduction in possible model overfitting as it shrinks model coefficients towards zero, reducing model variance.

The header image of this series well demonstrates the difference between Ridge Regression and the Lasso, as it can be seen that the model coefficients all shrink towards zero on the left for the Ridge Regression case, and, on the right, coefficients are being brought to zero in a selective order for the Lasso case.

Congratulations! 🎉🎊🥳

You made it to the end of the series!

I hope that these posts were informative and helpful for you to learn more about regularized linear regression and the necessary optimization needed to solve the associated models.

See here for the different sources utilized to create this series of posts.

If you are just now beginning the series, make sure to check out part one and part two!

Please leave a comment if you would like! I am always trying to improve my posts (logically, syntactically, or otherwise) and am happy to discuss anything related! 👋