linear regression

Discussion in 'Big Data and Analytics' started by Narayana Surya, Dec 21, 2018.

  1. Narayana Surya

    Narayana Surya Well-Known Member
    Alumni

    Joined:
    Feb 27, 2018
    Messages:
    59
    Likes Received:
    0
    Hi

    Can you please clarify the following doubts related to linear regression

    Thanks

    1)why the dependent variable should be normal distribution in linear regression and what to do if it does not follow normal distribution ?
    2)why we should normalize variables before starting model(According to my understanding there is no distance calculations between points in linear regression) ?
    3)should i drop variables with auto correlation?
    4)what should i do to a variable if it is string type(generally we will drop) but in some cases these values also get impacted(ex:RACE black for house cost predictor)
    5)how can we know if there is outlier in values ( generally we find outliers by q1-1.5(IQR) and q3+.5(IQR) above and below these value we will consider them outliers but in some cases even salary of CEO also comes in salary at that time weather we have to consider it outliers)..?
    6)Can you please explain me how we are get's correlation values( As we are humans we know that cost of house increases with area,bathrooms...etc and decreases with crime rate etc but how algorithm knows the difference )
    7)What key word(FIT) generally does..?(As per my understanding it is used for producing linear regression equ so that it can be used for other values in test data to predict outcome)
    8)why we are taking score(independent variables,dependent variables) in model what does it represent...?
    9)what is polynomial regression and how can we know that data is not linear regression ..?

    Please explain each one of these in simple terms as it will be easy to understand
     
    #1
  2. Vikas Kumar_18

    Vikas Kumar_18 Well-Known Member
    Simplilearn Support Alumni

    Joined:
    Dec 17, 2018
    Messages:
    174
    Likes Received:
    31
    Hi Narayan,

    Q1: Ans:
    For all machine learning algorithms; our prior assumption is that our data should always normally distributed or bell shaped curved or gaussian distribution. Because all the rules are made as per this assumption only. If it is not normally distributed then we just assume it.

    Q2: Ans:
    Normalize and standardize to make all the dependent features in a single scale where data is distributed as a single or uniform scale. For an example if i give you two independent feature "age" and "Salary" where usually "age" has boundary of (0-100) but salary bound is(100000 - 1000000) so in this case salary would get the higher importance while model building so make it standardize so that age and salary both would be in same scale using the formula. Just type on google and get the formula for Normalize the data.

    Q3: Ans:
    It depends.

    Q4: Ans:
    You have to make it as categorical data and check their ratio or strength. Such as if we have string data which have 4 types like "aa,ab,ac,ad". then make it categorical and check their ratio. If dummification required then go for the dummification before model building.

    Q5. Ans:
    Again it depends; the domain expert will decide the outlier because if we don't know the domain and try to build and interpret the model then model won't predict well. For an Example: Flight ticket pricing changes dynamically but if you will apply the Box and Whisker plot to identify the outliers so it will target few price as outlier but domain expert make it a important data because person who booked flight at very last moment, has to pay very high price which goes beyond that limit (Q3+1.5IQR). So domain expertise requires.

    Q6. Ans:
    Algorithms do what packages and function try to tell them. What you explained in the question same way ML tells system to do manually and display the result using heatmap through corrplot function.

    Q7: Ans:

    FIT means fitting the line, joining the points to draw a line so it is called as FIT or Fitting line.

    Q8: Ans:

    Score has different values which actually works as metric to find and compare which model predicts better.

    Q9: Ans:
    I already explained you in "Polynomial" thread please just go through that.


    It's very important to know that; Data science domain is inter domain and it's not a concrete domain where i say this is the way to do and follow the approach, It is based on try and compare the result.

    I would like to request you to make 2 to 3 questions in a thread because more then that will even confuse you with the concepts.
     
    #2
    Last edited: Jan 3, 2019
  3. Narayana Surya

    Narayana Surya Well-Known Member
    Alumni

    Joined:
    Feb 27, 2018
    Messages:
    59
    Likes Received:
    0
    Thanks for all your detail explanation
    Q1: Ans:
    For all machine learning algorithms; our prior assumption is that our data should always normally distributed or bell shaped curved or gaussian distribution. Because all the rules are made as per this assumption only. If it is not normally distributed then we just assume it.

    Doubt:
    So according to above statement for other models( Decision tree,Random forest..etc) also we are consider data as normal distribution

    Q4: Ans:
    You have to make it as categorical data and check their ratio or strength. Such as if we have string data which have 4 types like "aa,ab,ac,ad". then make it categorical and check their ratio. If dummification required then go for the dummification before model building.


    i did not get what dummification means can you please explain it...?
     
    #3
  4. Vikas Kumar_18

    Vikas Kumar_18 Well-Known Member
    Simplilearn Support Alumni

    Joined:
    Dec 17, 2018
    Messages:
    174
    Likes Received:
    31
    Q1. Ans: Yes for all ML Models our prior assumption is data should be normally distributed.

    Q2. Ans: If you have "aa,ab,ac,ad" 4 types available in a column then after dummification 4 seperate columns were made and wherever these are present make it as 1 and absent for 0 . I have enclosed a file just go for small understanding. Just google it for better understanding.
     

    Attached Files:

    #4
  5. Narayana Surya

    Narayana Surya Well-Known Member
    Alumni

    Joined:
    Feb 27, 2018
    Messages:
    59
    Likes Received:
    0
    Q1. Ans: Yes for all ML Models our prior assumption is data should be normally distributed.

    but in real time most of the data is not normally distributed then what we should do if it is the case..?
    should we have to convert it into normal form..?
     
    #5
  6. Vikas Kumar_18

    Vikas Kumar_18 Well-Known Member
    Simplilearn Support Alumni

    Joined:
    Dec 17, 2018
    Messages:
    174
    Likes Received:
    31
    It depends because usually we want result based on our true data. If data is transformed the data then basically transformed data would be used for model building and it wouldn't predict correctly for test data which belongs to the same family(which is distributed same as train data).
     
    #6

Share This Page