onelonedatum Blog

Hey there! 👋

Welcome to the final part of a three-part deep-dive on regularized linear regression modeling! In part one, linear modeling was established with the derivation of OLS showing how to solve for model coefficients to make predictions of the response given new feature data. Next, in part two, overfitting issues of the OLS model were discussed and Ridge Regression was presented as a technique to help reduce overfitting through regularization. Building off the same concept as Ridge Regression, the Lasso and the Elastic Net are now presented. The series concludes with general considerations of use cases for the techniques presented.

The models and included implementations were tested on a wine quality prediction dataset of which the code and results can be viewed at the project repository here.

The Lasso for Regression

The Lasso, or Least Absolute Shrinkage and Selection Operator, includes the addition of an L₁ penalty to the OLS loss function (Equation #7), bringing selective model parameters to zero for a large enough value of an associated tuning parameter, λ. In other words, the Lasso performs automated feature selection producing a vector of model coefficients with sparsity (amount of elements that are zero) varying on the magnitude of a tuning parameter.

The Lasso loss function can be formalized as:

L(\beta) = ||y - X \beta||_2^2 + \lambda ||\beta||_1 \text{ with tuning parameter } \lambda \ge 0

Similar to previous cases, an intercept term can be included through data augmentation of the design matrix with a column of 1s. Furthermore, formulating the problem as a least-squares optimization problem produces:

\hat{\beta} = \arg \min_{\beta} L(\beta) = \arg \min_{\beta} \frac{1}{2n} ||y - X\beta||_2^2 + \lambda ||\beta||_1

However, unlike previous cases, no closed-form solution exists for this problem. This is due to the fact that the addition of the L₁ penalty makes the function no longer continuously differentiable because of the non-smooth absolute component. To remedy this issue, a discrete optimization technique needs to be applied to search for an optimal solution.

Numerous algorithms exist to this end, such as LARS (Least Angle Regression) and Forward Stepwise Regression, however, the Pathwise Coordinate Descent algorithm is leveraged within this work. In short, this algorithm optimizes a parameter at a time holding all other parameters constant.

Pathwise Coordinate Descent

Before beginning the algorithm, all features should be standardized to have zero mean and variance of one. From there, a p+1 length coefficient vector is initialized to zero. Cycles are then run across all coefficients until convergence — where values of the coefficients stabilize and do not change more than a certain tolerance — is reached. Within each cycle, for every coefficient, an update is calculated and subsequently has the soft-thresholding operator applied to it.

The simplest form of Coordinate Descent updates calculates — for each coefficient — the simple (single variable as opposed to multiple regression) least-squares coefficient value using the partial residuals across all other features in the design matrix. Partial residuals, in this case, are found as:

r_{i,j} = y_i - \sum_{k \ne j} x_{i,k} \beta_k

Therefore, the estimate for a particular coefficient value can be found as:

\beta_j^* = \frac{1}{n} \sum_{i=1}^{n} x_{i,j} r_{i,j}

Now, the penalty, as dictated by the tuning parameter, is included in the model through the soft-thresholding operator. This is expressed as:

\beta_j = S(\beta_j^*,\lambda) = \text{sign}(\beta_j^*)(|\beta_j^*| - \lambda)_+ = \begin{cases} \beta_j^* - \lambda & \beta_j^* > 0 \text{ and } \lambda < |\beta_j^*| \\ \beta_j^* + \lambda & \beta_j^* > 0 \text{ and } \lambda < |\beta_j^*| \\ 0 & \lambda \geq |\beta_j^*| \end{cases}

Naive updates can be utilized for improved efficiency. These updates are found via:

\beta_j^* = \frac{1}{n} \sum_{i=1}^{n} x_{i,j} r_i + \beta_j

where rᵢ is the current model residual for all samples, n.

When the number of samples is much greater than the number of features (n >> p), further efficiency improvements can be derived by using covariance updates. For these updates, the first term of the naive updates equation above (Eq. #6) is replaced as shown by:

\sum_{i=1}^{n} x_{i,j} r_i = \langle x_j, y \rangle - \sum_{k: |\beta_k| > 0} \langle x_j, x_k \rangle \beta_k

Utilizing warm starts can bring efficiency boosts as well. Using warm starts, a sequence of models are fitted — with tuning parameter values from a max tuning parameter value down to a minimum tuning parameter value that is some small factor (thousandth) of the max value — initializing the coefficients of each iteration to the solution of the last iteration. In many cases, it is actually faster to fit this path of models than a single model for some small λ value.

\lambda_\text{max} \to \lambda_\text{min} = \epsilon \cdot \lambda_\text{max}

where 𝜖 is typically 0.001 and there are 100 values spaced on a log scale.

Furthermore, the max value of the tuning parameter to begin the path at can be found by finding the minimum value that will bring the estimates for all model coefficients to zero. This is since any values above this value will result in total sparsity of the coefficient vector. The max value of the path (starting point) can be found as:

\lambda_{\text{max}} = \frac{\max_l |\langle x_l, y \rangle|}{n}

By providing a starting place to begin searching for optimality, warm starting can many times speed up convergence and also puts the pathwise in the Pathwise Coordinate Descent algorithm.

Implementation of the Lasso In Python Using NumPy

One possible way to implement pathwise coordinate descent for the Lasso (with options for tuning the convergence tolerance, path length, and returning the path) is:

Loading gist...

The Elastic Net

In this form of regularized linear regression, the OLS loss function is changed by the addition of both an L₁ and L₂ penalty to the loss function with tuning parameters controlling the intensity of regularization and the balance between the different penalties. This loss function can be formalized as:

L(\beta) = ||y - X \beta||_2^2 + \lambda \left( (1-\alpha) \frac{1}{2} ||\beta||_2^2 + \alpha ||\beta||_1 \right) \text{ with tuning parameters } \lambda \geq 0, 0 \leq \alpha \leq 1

Having both L₁ and L₂ penalties, the Elastic Net serves to deliver a compromise between Ridge regression and the Lasso, bringing coefficients towards zero and selectively to zero. Here, α can be considered as the parameter determining the ratio of L₁ penalty to add, whereas λ can be thought of as the intensity of regularization to apply.

The Elastic Net loss function is also used to formalize a least-squares optimization problem:

\hat{\beta} = \arg \min_{\beta} L(\beta) = \arg \min_{\beta} \frac{1}{2n} ||y - X\beta||_2^2 + \lambda [(1 - \alpha) \frac{1}{2} ||\beta||_2^2 + \alpha ||\beta||]

Similarly to the Lasso, an intercept is included through design matrix augmentation, a 1/(2n) term is added for mathematical completeness, and pathwise coordinate descent is implemented to solve since a closed-form solution does not exist due to the L₁ penalty term.

Just as in the Lasso, pathwise coordinate descent is used to solve for model coefficient estimates however changes need to be made to the update equations to account for the dual penalties. In this case, the jth coefficient value obtained after soft-thresholding is now found as:

\beta_j = \frac{S(\beta_j^*, \lambda \alpha)}{1 + \lambda (1 - \alpha)}

The soft-thresholding operator is the same operator applied in the Lasso update (Eq. #5):

sign(\beta_j^*)(||\beta_j^*|| - \lambda\alpha) +

Furthermore, naive updates or covariance updates can be used along with warm starts. For warm starts, the max λ value can be calculated as:

\lambda_{max} = \frac{\max_{l} |\langle x_{l}, y \rangle|}{n \alpha}

Implementation in Python Using NumPy

One possible way to implement pathwise coordinate descent to solve the Elastic Net (with options for tuning the convergence tolerance, path length, and returning the path) is:

Loading gist...

Conclusion

Throughout this series, different regularized forms of linear regression have been examined as tools to overcome the tendency to overfit training data of the Ordinary Least Squares model. The Elastic Net — due to its balance between regularization varieties — tends to be the most robust to aid the overfitting issue, however, the Lasso certainly can prove helpful in many situations due to its automated feature selection. Ridge Regression is also a good tool to use to ensure a reduction in possible model overfitting as it shrinks model coefficients towards zero, reducing model variance.

The header image of this series well demonstrates the difference between Ridge Regression and the Lasso, as it can be seen that the model coefficients all shrink towards zero on the left for the Ridge Regression case, and, on the right, coefficients are being brought to zero in a selective order for the Lasso case.

Congratulations! 🎉🎊🥳

You made it to the end of the series!

I hope that these posts were informative and helpful for you to learn more about regularized linear regression and the necessary optimization needed to solve the associated models.

See here for the different sources utilized to create this series of posts.

If you are just now beginning the series, make sure to check out part one and part two!

Please leave a comment if you would like! I am always trying to improve my posts (logically, syntactically, or otherwise) and am happy to discuss anything related! 👋

onelonedatumonelonedatumonelonedatum