Date Written: 2021-01-15

Welcome to the final part of a three-part deep-dive on **regularized linear regression modeling**! In part one, linear modeling was established with the derivation of OLS showing how to solve for model coefficients to make predictions of the response given new feature data. Next, in part two, ** overfitting** issues of the OLS model were discussed and

The models and included implementations were tested on a wine quality prediction dataset of which the code and results can be viewed at the project repository **here****.**

The Lasso, or ** Least Absolute Shrinkage and Selection Operator**, includes the addition of an

** The Lasso** loss function can be formalized as (equation #1):

$L(\beta) = \lVert \textbf{y} - \textbf{X}\beta \rVert_2^2 + \lambda \lVert \beta \rVert_1 \text{ with tuning parameter } \lambda \geq 0$

Similar to previous cases, an intercept term can be included through data augmentation of the design matrix with a column of **1**s. Furthermore, formulating the problem as a *least-squares* optimization problem produces (equation #2):

$\hat{\beta} = \underset{\beta}{\text{argmin}} \: L(\beta) = \underset{\beta}{\text{argmin}} \frac{1}{2n} \lVert \textbf{y} - \textbf{X}\beta \rVert_2^2 + \lambda \lVert \beta \rVert_1$

However, unlike previous cases, no closed-form solution exists for this problem. This is due to the fact that the addition of the ** L₁** penalty makes the function no longer continuously differentiable because of the non-smooth absolute component. To remedy this issue, a discrete optimization technique needs to be applied to search for an optimal solution.

Numerous algorithms exist to this end, such as ** LARS** (Least Angle Regression) and

Before beginning the algorithm, all features should be standardized to have zero mean and variance of one. From there, a ** p+1** length coefficient vector is initialized to zero. Cycles are then run across all coefficients until convergence — where values of the coefficients stabilize and do not change more than a certain tolerance — is reached. Within each cycle, for every coefficient, an update is calculated and subsequently has the soft-thresholding operator applied to it.

The simplest form of ** Coordinate Descent updates** calculates — for each coefficient — the simple (single variable as opposed to multiple regression) least-squares coefficient value using the partial residuals across all other features in the design matrix. Partial residuals, in this case, are found as (equation #3):

$r_{i, j} = y_i - \sum_{k \neq j} x_{i, k} \beta_k$

Therefore, the estimate for a particular coefficient value can be found as (equation #4):

$\beta^*_j = \frac{1}{n} \sum_{i=1}^n x_{i, j} r_{i, j}$

Now, the penalty, as dictated by the tuning parameter, is included in the model through the soft-thresholding operator. This is expressed as (equation #5):

$\beta_j = \textbf{S}(\beta^*_j, \lambda) = \text{sign}(\beta^*_j)(\lvert \beta^*_j \rvert - \lambda)_+ =
\begin{cases}
\beta^*_j - \lambda & \text{if } \beta^*_j \gt 0 \text{ and } \lambda \lt \lvert \beta^*_j \rvert \\
\beta^*_j + \lambda & \text{if } \beta^*_j \lt 0 \text{ and } \lambda \lt \lvert \beta^*_j \rvert \\
0 & \text{if } \lambda \geq \lvert \beta^*_j \rvert
\end{cases}$

** Naive updates** can be utilized for improved efficiency. These updates are found via (equation #6):

$\beta^*_j = \frac{1}{n} \sum_{i=1}^n x_{i, j} r_i + \beta_j$

where $r_i$ is the current model residual for all samples, $n$.

When the number of samples is much greater than the number of features ($n$ >> $p$), further efficiency improvements can be derived by using **covariance updates**. For these updates, the first term of the ** Naive update** equation above (Eq. #6) is replaced as shown by (equation #7):

$\sum_{i=1}^n x_{i, j}r_i = \langle x_j, y \rangle - \sum_{k: \lvert\beta_k\rvert\gt0}\langle x_j, x_k \rangle \beta_k$

Utilizing ** warm starts** can bring efficiency boosts as well. Using

$\lambda_{max} \rightarrow \lambda_{min} = \epsilon \cdot \lambda_{max}$

where $\epsilon$ is typically 0.001 and there are 100 values spaced on a log scale.

Furthermore, the max value of the tuning parameter to begin the path at can be found by finding the minimum value that will bring the estimates for all model coefficients to zero. This is since any values above this value will result in total sparsity of the coefficient vector. The max value of the path (starting point) can be found as (equation #9):

$\lambda_{max} = \frac{\underset{l}{\text{max}}\lvert\langle x_l, y \rangle\rvert}{n}$

By providing a starting place to begin searching for optimality, warm starting can many times speed up convergence and also puts the *pathwise* in the ** Pathwise Coordinate Descent** algorithm.

One possible way to implement ** pathwise coordinate descen**t for

```
def lasso(X, y, l1, tol=1e-6, path_length=100, return_path=False):
"""The Lasso Regression model with intercept term.
Intercept term included via design matrix augmentation.
Pathwise coordinate descent with co-variance updates is applied.
Path from max value of the L1 tuning parameter to input tuning parameter value.
Features must be standardized (centered and scaled to unit variance)
Params:
X - NumPy matrix, size (N, p), of standardized numerical predictors
y - NumPy array, length N, of numerical response
l1 - L1 penalty tuning parameter (positive scalar)
tol - Coordinate Descent convergence tolerance (exited if change < tol)
path_length - Number of tuning parameter values to include in path (positive integer)
return_path - Boolean indicating whether model coefficients along path should be returned
Returns:
if return_path == False:
NumPy array, length p + 1, of fitted model coefficients
if return_path == True:
List, length 3, of last fitted model coefficients, tuning parameter path and coefficient values
"""
X = np.hstack((np.ones((len(X), 1)), X))
m, n = np.shape(X)
B_star = np.zeros((n))
l_max = max(list(abs(np.dot(np.transpose(X[:, 1:]), y)))) / m
# At or above l_max, all coefficients (except intercept) will be brought to 0
if l1 >= l_max:
return np.append(np.mean(y), np.zeros((n - 1)))
l_path = np.geomspace(l_max, l1, path_length)
coeffiecients = np.zeros((len(l_path), n))
for i in range(len(l_path)):
while True:
B_s = B_star
for j in range(n):
k = np.where(B_s != 0)[0]
update = (1/m)*((np.dot(X[:,j], y)- \
np.dot(np.dot(X[:,j], X[:,k]), B_s[k]))) + \
B_s[j]
B_star[j] = (np.sign(update) * max(abs(update) - l_path[i], 0))
if np.all(abs(B_s - B_star) < tol):
coeffiecients[i, :] = B_star
break
if return_path:
return [B_star, l_path, coeffiecients]
else:
return B_star
```

In this form of regularized linear regression, the OLS loss function is changed by the addition of both an $L_1$ and $L_2$ penalty to the loss function with tuning parameters controlling the intensity of regularization and the balance between the different penalties. This loss function can be formalized as (equation #10):

$L(\beta) = \lVert \mathbf{y} - \mathbf{X}\beta \rVert_2^2 + \lambda [ (1-\alpha) \frac{1}{2} \lVert \beta \rVert_2^2 + \alpha \lVert \beta \rVert] \\ \text{ with tuning parameters } \lambda \geq 0, 0 \leq \alpha \leq 1$

Having both $L_1$ and $L_2$ penalties, ** the Elastic Net** serves to deliver a compromise between Ridge regression and the Lasso, bringing coefficients towards zero and selectively to zero. Here, $\alpha$ can be considered as the parameter determining the ratio of $L_1$ penalty to add, whereas $\lambda$ can be thought of as the intensity of regularization to apply.

** The Elastic Net** loss function is also used to formalize a

$\hat\beta = \underset{\beta}{\text{argmin}} \: L(\beta) = \underset{\beta}{\text{argmin}} \frac{1}{2n}\lVert \mathbf{y} - \mathbf{X}\beta \rVert_2^2 + \lambda [ (1-\alpha) \frac{1}{2} \lVert \beta \rVert_2^2 + \alpha \lVert \beta \rVert]$

Similarly to the Lasso, an intercept is included through design matrix augmentation, a $\frac{1}{2n}$ term is added for mathematical completeness, and ** pathwise coordinate descent** is implemented to solve since a closed-form solution does not exist due to the $L_1$ penalty term.

Just as in ** the Lasso, pathwise coordinate descent** is used to solve for model coefficient estimates however changes need to be made to the update equations to account for the dual penalties. In this case, the $j$th coefficient value obtained after soft-thresholding is now found as (equation #12):

$\beta_j = \frac{\mathbf{S}(\beta^*_j, \lambda \alpha)}{1 + \lambda (1 - \alpha)}$

The soft-thresholding operator is the same operator applied in ** the Lasso** update (Eq. #5) (equation #13):

$\text{sign}(\beta^*_j)(\lvert \beta^*_j \rvert - \lambda \alpha)_+$

Furthermore, **naive updates** or **covariance updates** can be used along with **warm starts**. For **warm starts**, the max $\lambda$ value can be calculated as (equation #14):

$\lambda_{\text{max}} = \frac{\underset{l}{\text{max}} \lvert \langle x_l, y \rangle \rvert}{n \alpha}$

One possible way to implement ** pathwise coordinate descent** to solve

```
def elastic_net(X, y, l, alpha, tol=1e-4, path_length=100, return_path=False):
"""The Elastic Net Regression model with intercept term.
Intercept term included via design matrix augmentation.
Pathwise coordinate descent with co-variance updates is applied.
Path from max value of the L1 tuning parameter to input tuning parameter value.
Features must be standardized (centered and scaled to unit variance)
Params:
X - NumPy matrix, size (N, p), of standardized numerical predictors
y - NumPy array, length N, of numerical response
l - l penalty tuning parameter (positive scalar)
alpha - alpha penalty tuning parameter (positive scalar between 0 and 1)
tol - Coordinate Descent convergence tolerance (exited if change < tol)
path_length - Number of tuning parameter values to include in path (positive integer)
Returns:
NumPy array, length p + 1, of fitted model coefficients
"""
X = np.hstack((np.ones((len(X), 1)), X))
m, n = np.shape(X)
B_star = np.zeros((n))
if alpha == 0:
l2 = 1e-15
l_max = max(list(abs(np.dot(np.transpose(X), y)))) / m / alpha
if l >= l_max:
return np.append(np.mean(y), np.zeros((n - 1)))
l_path = np.geomspace(l_max, l, path_length)
for i in range(path_length):
while True:
B_s = B_star
for j in range(n):
k = np.where(B_s != 0)[0]
update = (1/m)*((np.dot(X[:,j], y)- \
np.dot(np.dot(X[:,j], X[:,k]), B_s[k]))) + \
B_s[j]
B_star[j] = (np.sign(update) * max(
abs(update) - l_path[i] * alpha, 0)) / (1 + (l_path[i] * (1 - alpha)))
if np.all(abs(B_s - B_star) < tol):
break
return B_star
```

Throughout this series, different regularized forms of linear regression have been examined as tools to overcome the tendency to overfit training data of the ** Ordinary Least Squares** model.

The header image of this series well demonstrates the difference between ** Ridge Regression** and

Congratulations! 🎉🎊🥳

You made it to the end of the series!

I hope that these posts were informative and helpful for you to learn more about ** regularized linear regression** and the necessary

See **here** for the different sources utilized to create this series of posts.

If you are just now beginning the series, make sure to check out part one and part two!

Please leave a comment if you would like! I am always trying to improve my posts (logically, syntactically, or otherwise) and am happy to discuss anything related! 👋