### Machine Learning Algorithms Part 6: K-Nearest Neighbors In Python

K-Nearest Neighbors (or KNN) is one of the simplest machine learning algorithms and is used in a wide array of institutions. **KNN** is a **non-parametric, lazy** learning algorithm. When we say a technique is **non-parametric**, it means that it does not make any assumptions about the underlying data. Being a **Lazy l**earning algorithm implies there is little to no training phase.

### Some pros and cons of KNN

**Pros**:

- No assumptions about data
- Simple algorithm — easy to understand
- Versatile — classification or regression

**Cons**:

- High memory requirement — Stores all of the training data
- Sensitive to irrelevant features and the scale of the data

### How it works

- Pick a value for
**K**(i.e. 5).

2. Take the **K** nearest neighbors of the new data point according to their Euclidean distance.

3. Among these neighbors, count the number of data points in each category and assign the new data point to the category where you counted the most neighbors.

### Code

Let’s take a look at how we could go about classifying data using the K-Nearest Neighbors algorithm with python. For this tutorial, we’ll be using the cancer dataset from the `sklearn.datasets`

module.

As always, we need to start by importing the required libraries.

```
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
```

The dataset classifies tumors into two categories (malignant and benign) and contains something like 30 features. In the real world, you’d look at the correlations and select a subset of features that plays the greatest role in determining whether a tumor is malignant or not. However, for the sake of simplicity, we’ll pick a couple at random.

We must encode categorical data for it to be interpreted by the model (i.e. malignant = `0`

and benign = `1`

).

`breast_cancer = load_breast_cancer()`

```
X = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
X = X[['mean area', 'mean compactness']]
```

```
y = pd.Categorical.from_codes(breast_cancer.target, breast_cancer.target_names)
binarizer = LabelBinarizer()
encoded_y = binarizer.fit_transform(y)
```

As mentioned in another tutorial, the point of building a model, is to classify new data. Therefore, we need to put aside data to verify whether our model does a good job at predicting new incoming data or it is overfitting. By default, the test set created by `train_test_split`

is 25% of the original data.

`train_X, test_X, train_y, test_y = train_test_split(X, encoded_y, random_state=1)`

The `sklearn`

library has provided a layer of abstraction on top of python. To use **KNN**, it’s sufficient to create an instance of `RandomForestClassifier`

. By default, the `KNeighborsClassifier`

looks for the **5** nearest neighbors. We must explicitly tell the classifier to use Euclidean distance for determining the proximity of neighbors.

```
knn_model = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_model.fit(train_X, train_y)
```

Using our newly trained model, we predict the class given the features in the test set.

`knn_test_predictions = knn_model.predict(test_X)`

The numbers on the diagonal of the confusion matrix correspond to correct predictions whereas the others imply false positives and false negatives.

`confusion_matrix(test_y, knn_test_predictions)`

Given our confusion matrix, our model has an accuracy of 121/143 = 84.6%.

**Cory Maklin**

_Sign in now to see your channels and recommendations!_www.youtube.com