what is the best method to choose 'k' value in KNN algo ?

Discussion in 'Masters Program - Customers only' started by Sahil bali, Mar 27, 2017.

  1. Sahil bali

    Sahil bali New Member

    Joined:
    Dec 9, 2016
    Messages:
    1
    Likes Received:
    0
    As we know the value of k is generally choose as the square-root of the number of observations in dataset. But what if other values of k will give us better accuracy of the model.
    For example -->
    -Total observation of dataset is 1000.
    - Spitting data into train -> 900 and test -> 100

    k --> accuracy percentage
    30 --> 73%
    32 --> 64%
    5 --> 64%
    6 --> 66%
    7 --> 67%
    8 --> 69%
    9 --> 69%
    10 --> 70%
    11 --> 73%
    12 --> 75%
    13 --> 72%
    14 --> 71%
    15 --> 73%
    16 --> 73%
    17 --> 74%
    18 --> 72%
    19 --> 73%
    20 --> 72%

    In above different experimentation with k value, we find at value k= 12 we are getting maximum accuracy that is 75%. And if we take square root of 1000 , we get 32 and so if k=32, the accuracy will become 64%.
    So how to choose k value and why?
    Plus incase of k-means clusttering algo should we use the same trick to find k value or the approach would be some what different??
    TIA
     
    #1
  2. Ambika_2

    Ambika_2 Well-Known Member
    Simplilearn Support

    Joined:
    Nov 25, 2015
    Messages:
    217
    Likes Received:
    8
    Hi Sahil,

    you can apply the Elbow method:

    First of all, compute the sum of squared error (SSE) for some values of k .. The SSE is defined as the sum of the squared distance between each member of the cluster and its centroid. Mathematically:

    SSE=∑Ki=1∑x∈cidist(x,ci)2SSE=∑i=1K∑x∈cidist(x,ci)2

    If you plot k against the SSE, you will see that the error decreases as k gets larger; this is because when the number of clusters increases, they should be smaller, so distortion is also smaller. The idea of the elbow method is to choose the k at which the SSE decreases abruptly. This produces an "elbow effect" in the graph, as you can see in the following picture:
    Sometimes, there are more than one elbow, or no elbow at all. In those situations you usually end up calculating the best k by evaluating how well k-means performs in the context of the particular clustering problem you are trying to solve.
    Please check the below link for your reference:

    http://stackoverflow.com/questions/...in-r-determine-the-optimal-number-of-clusters

    Regards,

    Ambika
    GTA-Simplilearn
     
    #2

Share This Page