# what is the best method to choose 'k' value in KNN algo ?

Discussion in 'Masters Program - Customers only' started by Sahil bali, Mar 27, 2017.

1. ### Sahil bali New Member

Joined:
Dec 9, 2016
Messages:
1
0
As we know the value of k is generally choose as the square-root of the number of observations in dataset. But what if other values of k will give us better accuracy of the model.
For example -->
-Total observation of dataset is 1000.
- Spitting data into train -> 900 and test -> 100

k --> accuracy percentage
30 --> 73%
32 --> 64%
5 --> 64%
6 --> 66%
7 --> 67%
8 --> 69%
9 --> 69%
10 --> 70%
11 --> 73%
12 --> 75%
13 --> 72%
14 --> 71%
15 --> 73%
16 --> 73%
17 --> 74%
18 --> 72%
19 --> 73%
20 --> 72%

In above different experimentation with k value, we find at value k= 12 we are getting maximum accuracy that is 75%. And if we take square root of 1000 , we get 32 and so if k=32, the accuracy will become 64%.
So how to choose k value and why?
Plus incase of k-means clusttering algo should we use the same trick to find k value or the approach would be some what different??
TIA

#1
2. ### Ambika_2 Well-Known Member Simplilearn Support

Joined:
Nov 25, 2015
Messages:
217
8
Hi Sahil,

you can apply the Elbow method:

First of all, compute the sum of squared error (SSE) for some values of k .. The SSE is defined as the sum of the squared distance between each member of the cluster and its centroid. Mathematically:

SSE=∑Ki=1∑x∈cidist(x,ci)2SSE=∑i=1K∑x∈cidist(x,ci)2

If you plot k against the SSE, you will see that the error decreases as k gets larger; this is because when the number of clusters increases, they should be smaller, so distortion is also smaller. The idea of the elbow method is to choose the k at which the SSE decreases abruptly. This produces an "elbow effect" in the graph, as you can see in the following picture:
Sometimes, there are more than one elbow, or no elbow at all. In those situations you usually end up calculating the best k by evaluating how well k-means performs in the context of the particular clustering problem you are trying to solve.