Date Written: 2021-01-14

In the last part of this three-part deep-dive exploration into **regularized linear regression** modeling techniques, several topics were covered: the equation between the response and feature variables underlying linear regression models, the sum of squared error (SSE) loss function, the ** Ordinary Least Squares** (OLS) model, and the necessary optimization steps to find an OLS model estimator that can be trained on sample data to produce predictions of a response given new feature data.

Moving forward, in this part, drawbacks of OLS, potential remedies, and the ** Ridge Regression** model will be discussed.

Similar to the last part, all implementations suggested here were validated on a wine quality prediction dataset that, along with other related files, can be found at the project’s repository, **here**.

As is with most things, there are tradeoffs that have to made when modeling. One such major tradeoff is that of the **bias-variance tradeoff.** Any model’s error can be broken down into two components: bias and variance. Bias can be considered the error implicit in the modeling algorithm, whereas variance can be considered the error derived from differences in idiosyncrasies across training datasets. A good model is one that should have the overall error minimized, thus both bias and variance should be minimized. However, there is a tradeoff to consider as increasing bias will often decrease variance.

For the OLS model, a high variance is a concern. Since the SSE is being optimized, the model tends to fit outlier data points since they will produce higher error values due to the squared term within the loss function. By fitting these outlier points, the OLS model can subsequently base predictions off modeled patterns that are only present in the training data — idiosyncratic outlying points — and not representative of the entire population. This phenomenon is called ** overfitting** and can lead to predictive models with low accuracy when generalizing to new predictions.

Since OLS is a low bias model, it is well-suited to have its variance lowered through bias addition, which may result in higher overall predictive ability. One way to add bias is through **shrinkage**, biasing model coefficient estimates toward zero. This shrinkage can be achieved through the addition of a **regularization penalty** to the loss function which applies a unique form of shrinkage to the overall coefficient estimates.

In the next section, I’ll cover ** Ridge Regression —** regularization by the addition of a tuning parameter ($\lambda$) coefficient controlled $L_2$ penalty to the OLS loss function.

Make sure to check out the next and final part of the series to learn about the other two forms of regularized linear regression,

the Lassoandthe Elastic Net.

This form of regression is also known as ** Tikhonov Regularization** and modifies the OLS loss function Part One (equation #7) with the addition of an $L_2$ penalty with an associated tuning parameter, $\lambda$. This loss function can be described using vector notation as (equation #1):

$L(\beta) = \lVert y - X\beta \rVert_2^2 + \lambda \lVert \beta \rVert_2^2 \text{ with tuning parameter } \lambda \geq 0$

Similarly to the OLS case, this loss function can then be formulated as a *least-squares* optimization problem to find estimates for the model coefficients that minimize the loss function as (equation #2):

$\hat\beta = \underset{\beta}{\operatorname{argmin}} \; L(\beta) = \underset{\beta}{\operatorname{argmin}} \frac{1}{2n} \lVert y - X\beta \rVert_2^2 + \lambda \lVert \beta \rVert_2^2$

Just like the OLS case, a $\frac{1}{2n}$ term is added in order to simply solving the gradient and allow the objective function to converge to the expected value of the model error by **the Law of Large Numbers**.

This problem is also unconstrained and a closed-form solution for **the Ridge estimator** can be found by setting the gradient of the loss function (objective) equal to zero and solving the resultant equation. This produces an estimator result of (equation #3):

$\hat\beta = (\textbf{X}^{\textbf{T}}\textbf{X} + \lambda \textbf{I})^{-1}\textbf{X}^{\textbf{T}}\textbf{y}$

This estimator should also be shown to be unique. In this case, the associated ** Hessian** matrix is (equation #4):

$\textbf{H} = 2\textbf{X}^{\textbf{T}}\textbf{X} + 2\lambda \textbf{I}$

It turns out that this matrix can be shown to be **positive definite** by (equation #5):

$\beta^{\textbf{T}}(\textbf{X}^{\textbf{T}}\textbf{X} + \lambda \textbf{I})\beta = (\textbf{X}\beta)\textbf{X}\beta + \lambda \beta^{\textbf{T}}\beta = \lVert \textbf{X}\beta \rVert_2^2 + \lambda \lVert \beta \rVert_2^2 \succeq 0 \: \forall \: \beta \neq 0$

Thus, since the associated Hessian matrix of the Ridge loss function is positive definite, the function is strongly convex, which implies that the Ridge estimator (equation #3) is the **unique** global minimizer to the ** Ridge Regression** problem.

Implementations of the estimator can be simplified by noticing that the problem can be reformulated as an OLS problem through **data augmentation**. This would take the form of (equation #6):

$\begin{bmatrix} \hat\beta_0 \\ \hat\beta \end{bmatrix} = \underset{\beta_0, \beta}{\text{argmin}} \left\| \begin{bmatrix} \textbf{Y} \\ 0 \end{bmatrix} - \begin{bmatrix} 1_n & \textbf{X} \\ 0 & \lambda \times \textbf{I} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta \end{bmatrix} \right\|_2^2$

Thus, by utilizing the above data augmentation the same result as Part One Equation #9 from the last part of this series can be used to solve for the coefficient estimates. That result is reproduced here:

$\hat\beta = (\textbf{X}^{\textbf{T}}\textbf{X})^{-1}\textbf{X}^{\textbf{T}}\textbf{y}$

Similar to the OLS case, the matrix inverse does not scale well, thus the NumPy function `solve`

, which employs the LAPACK *_gesv* routine, is used to find the least-squares solution. This function solves the equation in the case where A is square and full-rank (linearly independent columns). However, in the case that A is not full-rank, then the function `lstsq`

should be used, which utilizes the xGELSD routine and thus finds the singular value decomposition of A.

One possible Python implementation of ** Ridge Regression** with an optional intercept term is:

Thanks for reading part two of ** Regularized Linear Regression Models**! 🙌

If you have not already, make sure to check out **part one**!

Continue on to part three to learn about ** the Lasso** and

See **here** for the different sources utilized to create this series of posts.

Please leave a comment if you would like! I am always trying to improve my posts (logically, syntactically, or otherwise) and am happy to discuss anything related! 👍