kNN Classification in R
Visualize a k-Nearest-Neighbors (kNN) classification in R with Tidymodels.
New to Plotly?
Plotly is a free and open-source graphing library for R. We recommend you read our Getting Started guide for the latest installation or upgrade instructions, then move on to our Plotly Fundamentals tutorials or dive straight in to some Basic Charts tutorials.
kNN Classification in R
Visualize Tidymodels' k-Nearest Neighbors (kNN) classification in R with Plotly.
Basic binary classification with kNN
This section gets us started with displaying basic binary classification using 2D data. We first show how to display training versus testing data using various marker styles, then demonstrate how to evaluate our classifier's performance on the test split using a continuous color gradient to indicate the model's predicted score.
We will use Tidymodels for training our model and for loading and splitting data. Tidymodels is a popular Machine Learning (ML) library that offers various tools for creating and training ML algorithms, feature engineering, data cleaning, and evaluating and testing models.
We will train a k-Nearest Neighbors (kNN) classifier. First, the model records the label of each training sample. Then, whenever we give it a new sample, it will look at the k
closest samples from the training set to find the most common label, and assign it to our new sample.
Display training and test splits
Using Tidymodels, we first generate synthetic data that form the shape of a moon. We then split it into a training and testing set. Finally, we display the ground truth labels using a scatter plot.
In the graph, we display all the negative labels as squares, and positive labels as circles. We differentiate the training and test set by adding a dot to the center of test data.
library(tidyverse)
library(tidymodels)
library(plotly)
make_moons <- read.csv(file = "data/make_moons.csv")
make_moons$y <- as.character(make_moons$y)
set.seed(123)
make_moons_split <- initial_split(make_moons, prop = 3/4)
make_moons_training <- make_moons_split %>%
training()
make_moons_test <- make_moons_split %>%
testing()
train_index <- as.integer(rownames(make_moons_training))
test_index <- as.integer(rownames(make_moons_test))
make_moons[train_index,'split'] = 'Train Split Label'
make_moons[test_index,'split'] = 'Test Split Label'
make_moons$y <- paste(make_moons$split,make_moons$y)
fig <- plot_ly(data = make_moons, x = ~X1, y = ~X2, type = 'scatter', mode = 'markers',alpha = 0.5, symbol = ~y, symbols = c('square','circle','square-dot','circle-dot'),
marker = list(size = 12,
color = 'lightyellow',
line = list(color = 'black',width = 1)))
fig
Visualize predictions on test split
Now, we train the kNN model on the same training data displayed in the previous graph. Then, we predict the confidence score of the model for each of the data points in the test set. We will use shapes to denote the true labels, and the color will indicate the confidence of the model for assign that score.
Notice that plot_ly
only requires one function call to plot both negative and positive labels, and can additionally set a continuous color scale based on the yscore
output by our kNN model.
library(plotly)
library(tidymodels)
db <- read.csv('data/make_moons.csv')
db$y <- as.factor(db$y)
db_split <- initial_split(db, prop = 3/4)
train_data <- training(db_split)
test_data <- testing(db_split)
x_test <- test_data %>% select(X1, X2)
y_test <- test_data %>% select(y)
knn_dist <- nearest_neighbor(neighbors = 15, weight_func = 'rectangular') %>%
set_engine('kknn') %>%
set_mode('classification') %>%
fit(y~., data = train_data)
yscore <- knn_dist %>%
predict(x_test, type = 'prob')
colnames(yscore) <- c('yscore0','yscore1')
yscore <- yscore$yscore1
pdb <- cbind(x_test, y_test)
pdb <- cbind(pdb, yscore)
fig <- plot_ly(data = pdb,x = ~X1, y = ~X2, type = 'scatter', mode = 'markers',color = ~yscore, colors = 'RdBu', symbol = ~y, split = ~y, symbols = c('square-dot','circle-dot'),
marker = list(size = 12, line = list(color = 'black', width = 1)))
fig
Probability Estimates with Contour
Just like the previous example, we will first train our kNN model on the training set.
Instead of predicting the confidence for the test set, we can predict the confidence map for the entire area that wraps around the dimensions of our dataset. To do this, we use meshgrid
to create a grid, where the distance between each point is denoted by the mesh_size
variable.
Then, for each of those points, we will use our model to give a confidence score, and plot it with a contour plot.
library(plotly)
library(pracma)
library(kknn)
library(tidymodels)
make_moons <- read.csv(file = "data/make_moons.csv")
make_moons_classification <- make_moons
make_moons$y <- as.character(make_moons$y)
set.seed(123)
make_moons_split <- initial_split(make_moons, prop = 3/4)
make_moons_training <- make_moons_split %>%
training()
make_moons_test <- make_moons_split %>%
testing()
train_index <- as.integer(rownames(make_moons_training))
test_index <- as.integer(rownames(make_moons_test))
mesh_size = .02
margin = 0.25
x_min = min(make_moons$X1) - margin
x_max = max(make_moons$X1) + margin
y_min = min(make_moons$X2) - margin
y_max = max(make_moons$X2) + margin
xrange <- seq(x_min, x_max, mesh_size)
yrange <- seq(y_min, y_max, mesh_size)
xy <- meshgrid(x = xrange, y = yrange)
xx <- xy$X
yy <- xy$Y
make_moons_classification$y <- as.factor(make_moons_classification$y)
knn_dist <- nearest_neighbor(neighbors = 15, weight_func = 'rectangular') %>%
set_engine('kknn') %>%
set_mode('classification') %>%
fit(y~., data = make_moons_classification)
dim_val <- dim(xx)
xx1 <- matrix(xx, length(xx), 1)
yy1 <- matrix(yy, length(yy), 1)
final <- data.frame(xx1, yy1)
colnames(final) <- c('X1','X2')
pred <- knn_dist %>%
predict(final, type = 'prob')
predicted <- pred$.pred_1
Z <- matrix(predicted, dim_val[1], dim_val[2])
fig <- plot_ly(x = xrange, y= yrange, z = Z, colorscale='RdBu', type = "contour")
fig
Now, let's try to combine our Contour
plot with the first scatter plot of our data points, so that we can visually compare the confidence of our model with the true labels.
library(plotly)
library(pracma)
library(kknn)
library(tidymodels)
make_moons <- read.csv(file = "data/make_moons.csv")
make_moons_classification <- make_moons
make_moons$y <- as.character(make_moons$y)
set.seed(123)
make_moons_split <- initial_split(make_moons, prop = 3/4)
make_moons_training <- make_moons_split %>%
training()
make_moons_test <- make_moons_split %>%
testing()
train_index <- as.integer(rownames(make_moons_training))
test_index <- as.integer(rownames(make_moons_test))
mesh_size = .02
margin = 0.25
x_min = min(make_moons$X1) - margin
x_max = max(make_moons$X1) + margin
y_min = min(make_moons$X2) - margin
y_max = max(make_moons$X2) + margin
xrange <- seq(x_min, x_max, mesh_size)
yrange <- seq(y_min, y_max, mesh_size)
xy <- meshgrid(x = xrange, y = yrange)
xx <- xy$X
yy <- xy$Y
make_moons_classification$y <- as.factor(make_moons_classification$y)
knn_dist <- nearest_neighbor(neighbors = 15, weight_func = 'rectangular') %>%
set_engine('kknn') %>%
set_mode('classification') %>%
fit(y~., data = make_moons_classification)
make_moons[train_index,'split'] = 'Train Split Label'
make_moons[test_index,'split'] = 'Test Split Label'
make_moons$y <- paste(make_moons$split,make_moons$y)
dim_val <- dim(xx)
xx1 <- matrix(xx, length(xx), 1)
yy1 <- matrix(yy, length(yy), 1)
final <- data.frame(xx1, yy1)
colnames(final) <- c('X1','X2')
pred <- knn_dist %>%
predict(final, type = 'prob')
predicted <- pred$.pred_1
Z <- matrix(predicted, dim_val[1], dim_val[2])
fig <- plot_ly(symbols = c('square','circle','square-dot','circle-dot'))%>%
add_trace(x = xrange, y= yrange, z = Z, colorscale='RdBu', type = "contour", opacity = 0.5) %>%
add_trace(data = make_moons, x = ~X1, y = ~X2, type = 'scatter', mode = 'markers', symbol = ~y ,
marker = list(size = 12,
color = 'lightyellow',
line = list(color = 'black',width = 1)))
fig