What is the formula in kknn

Let us first implement K-nearest neighbour in R, and later understand how it actually works.

Install and load the R library “kknn“. This will install , all methods required to implement K-nearest neighbour.

> install.packages("kknn") > library(kknn)

Now load the builtin  iris dataset into the memory using data(iris) and later partition the data into training and testing data set. imp is a vector(R data type) which contains numbers from range 1:  m and with length as 1/3rd of number of rows in iris data set with an equal probability of getting any number in range 1:m. iris[imp,] selects all the elements from iris dataset whose index in present in imp. iris[-imp,] just does the otherwise by selecting every element but one present in imp data set. As an end result we have our data set partitioned into two parts with train: test as 7:3.

> data(iris) > m <- nrow(iris) > imp <- sample(1:m, m/3, prob = rep(1/m,m)) > iris.train <- iris[-imp,] > iris.test <- iris[imp,]

Now let us generate the “model”(will be clear in a moment) for our algorithm using the expression given below.formula = formula(Species~.) specifies the formula to be used for generating model, Species~. is a short-cut for using all the attributes of the dataset for predicting the class Species. k specifies the number of “neighbours” that have to be taken into consideration while assigning a class to the object instance.  distance=1 specifies the parameter(p) of Minkowski distance.

> iris.knn <- kknn(formula = formula(Species~.), train = iris.train, test = iris.test, k = 7, distance = 1)

fitted is a generic R – method for extracting the classes of object predicted by a model. Later using table actual classes and predicted classes have been compared against each other.

> fit <- fitted(iris.knn) > table(iris.test$Species, fit)
setosa versicolor virginica
setosa 13 0 0
versicolor 0 16 4
virginica 0 2 15

Results may vary depending on the training and testing partition that you have done on your system but it won’t really vary much. Later using the model generated and use predict() command for predicting the class of test objects.
Anywhere if you wish to find more about any command do the following and help shall populate on its own.

>??command

How does K-NN works?

To make it really simple k-nn works on the paradigm that if k instances of same class “look” like the current instance then it is highly likely that object class is same as that of those k instance. Formalizing this k-nn classification finds a group of k objects in the training set that are closest to the test object, and assigns the test object a class on the basis of predominant class in its neighbourhood.

Defining closeness of two objects ?

Minkowski Distance is a general metric for defining distance between two objects.  Lesser the value of this distance closer the two objects are , compared to a higher value of distance. General formula for calculating the distance between two objects P and Q:

Dist(P,Q) =

Algorithm:

Simply for the test object, z, distance between z and all the objects in training set is calculated and from these class of k closest objects is taken and of these the most predominant class is assigned to the object test object z. Ties are broken in an unspecified manner, either by randomly assigning the most frequent class in the training set  , or weighing classes among the k neighbours and assigning it.

Formal Definition of this algorithm:

Source: R/nearest_neighbor_kknn.R

details_nearest_neighbor_kknn.Rd

kknn::train.kknn() fits a model that uses the K most similar data points from the training set to predict new samples.

For this engine, there are multiple modes: classification and regression

This model has 3 tuning parameters:

  • neighbors: # Nearest Neighbors (type: integer, default: 5L)

  • weight_func: Distance Weighting Function (type: character, default: ‘optimal’)

  • dist_power: Minkowski Distance Order (type: double, default: 2.0)

## K-Nearest Neighbor Model Specification (regression) ## ## Main Arguments: ## neighbors = integer(1) ## weight_func = character(1) ## dist_power = double(1) ## ## Computational engine: kknn ## ## Model fit template: ## kknn::train.kknn(formula = missing_arg(), data = missing_arg(), ## ks = min_rows(0L, data, 5), kernel = character(1), distance = double(1))

min_rows() will adjust the number of neighbors if the chosen value if it is not consistent with the actual data dimensions.

## K-Nearest Neighbor Model Specification (classification) ## ## Main Arguments: ## neighbors = integer(1) ## weight_func = character(1) ## dist_power = double(1) ## ## Computational engine: kknn ## ## Model fit template: ## kknn::train.kknn(formula = missing_arg(), data = missing_arg(), ## ks = min_rows(0L, data, 5), kernel = character(1), distance = double(1))

Factor/categorical predictors need to be converted to numeric values (e.g., dummy or indicator variables) for this engine. When using the formula method via fit(), parsnip will convert factor columns to indicators.

Predictors should have the same scale. One way to achieve this is to center and scale each so that each predictor has mean zero and a variance of one.

The underlying model implementation does not allow for case weights.

$\begingroup$

I have trained my data using kknn on R and was able to predict on a new data set. However, I'd like to know what the actual final equation is so I can reproduce the prediction manually.

My training code is as follows:

train.kknn (mod1~S+T+H+W, train, kmax = 25, kernel = c("triangular"))

$\endgroup$

library(kknn) ## Not run: data(miete) (train.con <- train.kknn(nmqm ~ wfl + bjkat + zh, data = miete, kmax = 25, kernel = c("rectangular", "triangular", "epanechnikov", "gaussian", "rank", "optimal"))) plot(train.con) (train.ord <- train.kknn(wflkat ~ nm + bjkat + zh, miete, kmax = 25, kernel = c("rectangular", "triangular", "epanechnikov", "gaussian", "rank", "optimal"))) plot(train.ord) (train.nom <- train.kknn(zh ~ wfl + bjkat + nmqm, miete, kmax = 25, kernel = c("rectangular", "triangular", "epanechnikov", "gaussian", "rank", "optimal"))) plot(train.nom) ## End(Not run) data(glass) glass <- glass[,-1] (fit.glass1 <- train.kknn(Type ~ ., glass, kmax = 15, kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 1)) (fit.glass2 <- train.kknn(Type ~ ., glass, kmax = 15, kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 2)) plot(fit.glass1) plot(fit.glass2)

Toplist

Latest post

TAGs