### Least Squares Linear Regression In Python

As the name implies, the method of **Least Squares** minimizes the sum of the **squares** of the residuals between the observed targets in the dataset, and the targets predicted by the linear approximation. In this proceeding article, we’ll see how we can go about finding the best fitting line using linear algebra as opposed to something like gradient descent.

### Algorithm

Contrary to what I had initially thought, the `scikit-learn`

implementation of Linear Regression minimizes a cost function of the form:

using the singular value decomposition of X.

If you’re already familiar with Linear Regression, you might see some similarities to the preceding equation and the mean square error (MSE).

As quick refresher, suppose we had the following scatter plot and regression line.

We calculate the distance from the line to a given data point by subtracting one from the other. We take the square of the difference because we don’t want the predicted values below the actual values to cancel out with those above the actual values. In mathematical terms, the latter can be expressed as follows:

The cost function used in the `scikit-learn`

library is similar, only we’re calculating it simultaneously using matrix operations.

For those of you that have taken a Calculus course, you’ve probably encountered this kind of notation before.

In this case, **x** is a vector and we are calculating its magnitude.

In the same sense, when we surround the variable for a matrix (i.e. A) by vertical bars, we are saying that we want to go from a matrix of rows and columns to a scalar. There are multiple ways of deriving a scalar from a matrix. Depending on which one is used, you’ll see a different symbol to the right of the variable (the extra 2 in the equation wasn’t put there by accident).

The additional 2 implies that we are taking the Euclidean norm of the matrix.

Suppose we had a matrix A, the Euclidean norm of A is equal to the square root of the largest eigenvalue of the transpose of A dot A. For the sake of clarity, let’s walk through an example.

So that’s how we quantify the error. However, that gives rise to a new question. Specifically, how do we actually go about minimizing it? Well, as it turns out, the minimum norm least squares solution (coefficients) can be found by calculating the pseudoinverse of the input matrix X and multiplying that by the output vector y.

where the pseudo-inverse of X is defined as:

The matrices from above can all be obtain from the Singular Value Decomposition (SVD) of X. Recall that the SVD of X can be described as follows:

If you’re curious as to how you actually determine U, sigma and the transpose of V, check out **this** article I wrote a while back which goes over how to use SVD for dimensionality reduction.

We construct the diagonal matrix D^+ by taking the inverse of the values within the `sigma`

matrix. The `+`

refers to the fact that all the elements must be greater than 0 since we can’t divide by 0.

### Python Code

Let’s take a look to see how we could go about implementing Linear Regression from scratch using basic `numpy`

functions. To begin, we import the following libraries.

```
from sklearn.datasets import make_regression
from matplotlib import pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
```

Next, we generate data using the `scikit-learn`

library.

```
X, y, coefficients = make_regression(
n_samples=50,
n_features=1,
n_informative=1,
n_targets=1,
noise=5,
coef=True,
random_state=1
)
```

We store the the rank and the number of columns of the matrix as variables.

```
n = X.shape[1]
r = np.linalg.matrix_rank(X)
```

We find the equivalent to our matrix of features using singular value decomposition.

`U, sigma, VT = np.linalg.svd(X, full_matrices=False)`

Then, D^+ can be derived from sigma.

`D_plus = np.diag(np.hstack([1/sigma[:r], np.zeros(n-r)]))`

*V* is of course equal to the transpose of its transpose as described in the following identity.

Finally, we determine **Moore-Penrose pseudoinverse** of X.

`X_plus = V.dot(D_plus).dot(U.T)`

As we saw in the preceding section, the vector of coefficients can calculated by multiplying the pseudoinverse of the matrix X by y.

`w = X_plus.dot(y)`

To obtain the actual error, we compute the residual sum of squares using the very first equation we saw.

`error = np.linalg.norm(X.dot(w) - y, ord=2) ** 2`

To verify we obtained the correct answer, we can make use a `numpy`

function that will compute and return the least squares solution to a linear matrix equation. To be specific, the function returns 4 values.

- Least Squares solution
- Sums of residuals (error)
- Rank of the matrix (X)
- Singular values of the matrix (X)

`np.linalg.lstsq(X, y)`

We can visually determine if the coefficient actually lead to the optimal fit by plotting the regression line.

```
plt.scatter(X, y)
plt.plot(X, w*X, c='red')
```

Let’s do the same thing using the `scikit-learn`

implementation of Linear Regression.

`lr = LinearRegression()`

`lr.fit(X, y)`

`w = lr.coef_[0]`

Finally, we plot the regression line using the newly found coefficient.

```
plt.scatter(X, y)
plt.plot(X, w*X, c='red')
```