Regularized Linear Regression Models

Using Ridge Regression to Overcome Drawbacks of Ordinary Least Squares (OLS)

Weaknesses of OLS, Optimization to Obtain the Ridge Model Estimator, and an Implementation in Python Using Numpy

Date Written: 2021-01-14

Blog post cover image
Model Coefficient Value Changes With Growing Regularization Penalty Values

Hello again and hopefully welcome back 👋

In the last part of this three-part deep-dive exploration into regularized linear regression modeling techniques, several topics were covered: the equation between the response and feature variables underlying linear regression models, the sum of squared error (SSE) loss function, the Ordinary Least Squares (OLS) model, and the necessary optimization steps to find an OLS model estimator that can be trained on sample data to produce predictions of a response given new feature data.

Moving forward, in this part, drawbacks of OLS, potential remedies, and the Ridge Regression model will be discussed.

Similar to the last part, all implementations suggested here were validated on a wine quality prediction dataset that, along with other related files, can be found at the project’s repository, here.

Drawbacks of OLS

As is with most things, there are tradeoffs that have to made when modeling. One such major tradeoff is that of the bias-variance tradeoff. Any model’s error can be broken down into two components: bias and variance. Bias can be considered the error implicit in the modeling algorithm, whereas variance can be considered the error derived from differences in idiosyncrasies across training datasets. A good model is one that should have the overall error minimized, thus both bias and variance should be minimized. However, there is a tradeoff to consider as increasing bias will often decrease variance.

For the OLS model, a high variance is a concern. Since the SSE is being optimized, the model tends to fit outlier data points since they will produce higher error values due to the squared term within the loss function. By fitting these outlier points, the OLS model can subsequently base predictions off modeled patterns that are only present in the training data — idiosyncratic outlying points — and not representative of the entire population. This phenomenon is called overfitting and can lead to predictive models with low accuracy when generalizing to new predictions.

Since OLS is a low bias model, it is well-suited to have its variance lowered through bias addition, which may result in higher overall predictive ability. One way to add bias is through shrinkage, biasing model coefficient estimates toward zero. This shrinkage can be achieved through the addition of a regularization penalty to the loss function which applies a unique form of shrinkage to the overall coefficient estimates.

In the next section, I’ll cover Ridge Regression regularization by the addition of a tuning parameter (λ\lambda) coefficient controlled L2L_2 penalty to the OLS loss function.

Make sure to check out the next and final part of the series to learn about the other two forms of regularized linear regression, the Lasso and the Elastic Net.

Ridge Regression

This form of regression is also known as Tikhonov Regularization and modifies the OLS loss function Part One (equation #7) with the addition of an L2L_2 penalty with an associated tuning parameter, λ\lambda. This loss function can be described using vector notation as (equation #1):

L(β)=yXβ22+λβ22 with tuning parameter λ0L(\beta) = \lVert y - X\beta \rVert_2^2 + \lambda \lVert \beta \rVert_2^2 \text{ with tuning parameter } \lambda \geq 0

Similarly to the OLS case, this loss function can then be formulated as a least-squares optimization problem to find estimates for the model coefficients that minimize the loss function as (equation #2):

β^=argminβ  L(β)=argminβ12nyXβ22+λβ22\hat\beta = \underset{\beta}{\operatorname{argmin}} \; L(\beta) = \underset{\beta}{\operatorname{argmin}} \frac{1}{2n} \lVert y - X\beta \rVert_2^2 + \lambda \lVert \beta \rVert_2^2

Just like the OLS case, a 12n\frac{1}{2n} term is added in order to simply solving the gradient and allow the objective function to converge to the expected value of the model error by the Law of Large Numbers.

This problem is also unconstrained and a closed-form solution for the Ridge estimator can be found by setting the gradient of the loss function (objective) equal to zero and solving the resultant equation. This produces an estimator result of (equation #3):

β^=(XTX+λI)1XTy\hat\beta = (\textbf{X}^{\textbf{T}}\textbf{X} + \lambda \textbf{I})^{-1}\textbf{X}^{\textbf{T}}\textbf{y}

This estimator should also be shown to be unique. In this case, the associated Hessian matrix is (equation #4):

H=2XTX+2λI\textbf{H} = 2\textbf{X}^{\textbf{T}}\textbf{X} + 2\lambda \textbf{I}

It turns out that this matrix can be shown to be positive definite by (equation #5):

βT(XTX+λI)β=(Xβ)Xβ+λβTβ=Xβ22+λβ220β0\beta^{\textbf{T}}(\textbf{X}^{\textbf{T}}\textbf{X} + \lambda \textbf{I})\beta = (\textbf{X}\beta)\textbf{X}\beta + \lambda \beta^{\textbf{T}}\beta = \lVert \textbf{X}\beta \rVert_2^2 + \lambda \lVert \beta \rVert_2^2 \succeq 0 \: \forall \: \beta \neq 0

Thus, since the associated Hessian matrix of the Ridge loss function is positive definite, the function is strongly convex, which implies that the Ridge estimator (equation #3) is the unique global minimizer to the Ridge Regression problem.

Implementations of the estimator can be simplified by noticing that the problem can be reformulated as an OLS problem through data augmentation. This would take the form of (equation #6):

[β^0β^]=argminβ0,β[Y0][1nX0λ×I][β0β]22\begin{bmatrix} \hat\beta_0 \\ \hat\beta \end{bmatrix} = \underset{\beta_0, \beta}{\text{argmin}} \left\| \begin{bmatrix} \textbf{Y} \\ 0 \end{bmatrix} - \begin{bmatrix} 1_n & \textbf{X} \\ 0 & \lambda \times \textbf{I} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta \end{bmatrix} \right\|_2^2

Thus, by utilizing the above data augmentation the same result as Part One Equation #9 from the last part of this series can be used to solve for the coefficient estimates. That result is reproduced here:

β^=(XTX)1XTy\hat\beta = (\textbf{X}^{\textbf{T}}\textbf{X})^{-1}\textbf{X}^{\textbf{T}}\textbf{y}

Implementing the Estimator Using Python and NumPy

Similar to the OLS case, the matrix inverse does not scale well, thus the NumPy function solve, which employs the LAPACK _gesv routine, is used to find the least-squares solution. This function solves the equation in the case where A is square and full-rank (linearly independent columns). However, in the case that A is not full-rank, then the function lstsq should be used, which utilizes the xGELSD routine and thus finds the singular value decomposition of A.

One possible Python implementation of Ridge Regression with an optional intercept term is:


Thanks for reading part two of Regularized Linear Regression Models! 🙌

If you have not already, make sure to check out part one!

Continue on to part three to learn about the Lasso and the Elastic Net, the last two regularized linear regression techniques!

See here for the different sources utilized to create this series of posts.

Please leave a comment if you would like! I am always trying to improve my posts (logically, syntactically, or otherwise) and am happy to discuss anything related! 👍