Machine Learning | June 23 - July 21

Discussion in 'Big Data and Analytics' started by Priyanka_Mehta, Jun 22, 2018.

  1. Priyanka_Mehta

    Priyanka_Mehta Well-Known Member
    Simplilearn Support

    Joined:
    May 25, 2017
    Messages:
    726
    Likes Received:
    50
    Hello All,

    Greetings from Simplilearn!!

    Let us have a discussion about the course, explore the course here and try to resolve all our queries related to the same.

    Happy Learning!!

    Regards,
    Priyanka
    GTA - Simplilearn
     
    #1
  2. _28259

    _28259 Member

    Joined:
    Apr 3, 2018
    Messages:
    12
    Likes Received:
    1
    Hi Aayushi ,

    While working with the dataset for bf_train and test i was going through your code and i tried to replicate the same with my understanding.

    This is your code:
    #encoding categorical variable
    for var in categorical_columns:
    lb = LabelEncoder()
    full_var_data = pd.concat((train[var],test[var]),axis=0).astype('str')
    lb.fit( full_var_data )
    train[var] = lb.transform(train[var].astype('str'))
    test[var] = lb.transform(test[var].astype('str'))

    Now i am trying to understand the use this extra lines like axis=0 and astype method and i thought of doing it without them :

    My code:

    from sklearn.preprocessing import LabelEncoder
    for var in cat_col:
    lb=LabelEncoder()
    full_var=pd.concat((train[var],test[var]))
    train[var]=lb.fit_transform(train[var])
    test[var]=lb.fit_transform(test[var])

    If you can see that i have used fit and transform together which is ok but the real question here is i have not used astype method and it works fine.i wonder how ! The reason why i have not used astype('str') is because machine understands binary format and not string value in the column. .(Please correct me if am wrong)

    Basically it converts my categorical column of object type(string) to numerical data directly.(Even the 'Age' Column is converted to numerical value),So need of replacing the value for data points like +4 and +55 etc..

    Output :

    upload_2018-7-2_15-25-51.png

    So my question here is : Can it be done this way as well?

    Please correct me if am wrong .

    Many Thanks
    Pritam
     

    Attached Files:

    #2
  3. Aayushi_6

    Aayushi_6 Well-Known Member

    Joined:
    Sep 19, 2016
    Messages:
    65
    Likes Received:
    5
    Hi Pritam,

    1) For pd.concat, axis = 0 is the default value. So both ways are correct.

    2) Regarding astype() function, you can avoid using it till the time it is not giving any value error/type error.

    Hope it helps!

    Thank you.
     
    #3
    _28259 likes this.
  4. _28259

    _28259 Member

    Joined:
    Apr 3, 2018
    Messages:
    12
    Likes Received:
    1
    Hi Aayushi,

    Could you kindly share the Dimensionality reduction code file as well . It is missing for me when i downloaded the materials from LMS.

    Thanks
    Pritam
     
    #4
  5. Aayushi_6

    Aayushi_6 Well-Known Member

    Joined:
    Sep 19, 2016
    Messages:
    65
    Likes Received:
    5

    Thanks for the reminder. Shared now. Please check.
     
    #5
  6. _28259

    _28259 Member

    Joined:
    Apr 3, 2018
    Messages:
    12
    Likes Received:
    1
    Hi Aayushi,

    While working on the BigMart dataset i am facing an issue while standardizing the "Item_Visibility" Column.

    My code :

    upload_2018-7-8_13-18-24.png

    Error :

    upload_2018-7-8_13-19-39.png

    upload_2018-7-8_13-19-7.png

    Some Troubleshooting tried by me :

    1. First we can see that the column here is a One-D array which is causing the problem.
    upload_2018-7-8_13-21-4.png
    2. now i tried to reshape this column :
    upload_2018-7-8_13-24-44.png

    But still not working !! am doing something wrong i guess.

    Please help.

    Many Thanks
    Pritam
     
    #6
  7. Prabhakar M R(2685)

    Joined:
    Jun 7, 2014
    Messages:
    3
    Likes Received:
    0
    Hi Aayushi,
    I am facing the following error during the final step of the prediction using LR. Please help as I am not able to troubleshoot what the actual issue is.

    Thanks,
    Prabhakar


    upload_2018-7-8_23-4-50.png
     
    #7
  8. _32200

    _32200 Member
    Alumni

    Joined:
    Jun 13, 2018
    Messages:
    4
    Likes Received:
    0
    how to decide based on the % of missing value whether to impute it or remove the column?
     
    #8
  9. Neha_155

    Neha_155 Member

    Joined:
    May 21, 2018
    Messages:
    7
    Likes Received:
    0
    Hi Aayushi,

    You have shared the code to reduce the outliers using the code for the dataframe:

    # delete the observations

    Q1 = train['Item_Visibility'].quantile(0.25)
    Q3 = train['Item_Visibility'].quantile(0.75)
    IQR = Q3 - Q1
    filt_train = train.query('(@Q1 - 1.5 * @IQR) <= Item_Visibility <= (@Q3 + 1.5 * @IQR)')


    This works if we have separate training and testing data. In case we have only one file, after the split, the data is converted into an array.

    So, how to remove outliers from the array using this?
    or should use this code before the split?

    Please suggest.

    Thanks,
    Neha
     
    #9
  10. _28259

    _28259 Member

    Joined:
    Apr 3, 2018
    Messages:
    12
    Likes Received:
    1
    I cracked it. ! thanks for your support.

    Pritam
     
    #10
  11. gaint461(1538146)

    Joined:
    May 28, 2014
    Messages:
    10
    Likes Received:
    0
    Hi All,

    This is Kiran i am new to this thread.

    Feel Free to write to me on :

    Kirankumarp@microland.net
    or
    Personal email
    gaint461@gmail.com


    Hi Aayushi,

    I have started the project ,

    but would like to know how to proceed further.

    Questions about the project:

    1) How do i separate the data Train dataset and Test dataset? - If i am right should we need split 80-20 rule?
    2) I have reached till 2nd point on the problem statement which is handling the missing values. Please find the pdf file attached. Please help me if i am going right or let me know if i am going in the right track.
    3) I need help on the 3rd point on the problem statement.

    I know i am way too slow , pacing up a little and i need help.

    Regards
    Kiran
     

    Attached Files:

    #11
  12. Aayushi_6

    Aayushi_6 Well-Known Member

    Joined:
    Sep 19, 2016
    Messages:
    65
    Likes Received:
    5
    Hi Prabhakar,

    This error means your input to the model contains NULL values.
    Please cross check your model input values.
     
    #12
  13. Aayushi_6

    Aayushi_6 Well-Known Member

    Joined:
    Sep 19, 2016
    Messages:
    65
    Likes Received:
    5
    Hi Kiran,

    1) Yes, you need to split it 80-20 rule.
    2) I have seen your attached doc, so far is good. Only point I fail to understand is why you have reset index to ocean proximity feature.
    3) For 3rd point, you need to perform label encoding to convert categorical column values to numerical values.
    Hope it helps!
     
    #13
  14. Aayushi_6

    Aayushi_6 Well-Known Member

    Joined:
    Sep 19, 2016
    Messages:
    65
    Likes Received:
    5
    We remove the column when we have more than 90% of the values as null.
     
    #14
  15. _32200

    _32200 Member
    Alumni

    Joined:
    Jun 13, 2018
    Messages:
    4
    Likes Received:
    0
  16. gaint461(1538146)

    Joined:
    May 28, 2014
    Messages:
    10
    Likes Received:
    0

    Ok Aayushi,

    I will look for label encoding and try and convert the categorical column values to numerical values.

    and for the reset of indexing to ocean proximity is a though to keep the data visibly clean sorted by region and then to look in that way rather than the normal indexing. If i am wrong do let me know.

    Regards
    Kiran
     
    #16
  17. Aayushi_6

    Aayushi_6 Well-Known Member

    Joined:
    Sep 19, 2016
    Messages:
    65
    Likes Received:
    5
    Hi Neha,

    Yes, we should always perform such preprocessing steps before splitting.
     
    #17
  18. Aayushi_6

    Aayushi_6 Well-Known Member

    Joined:
    Sep 19, 2016
    Messages:
    65
    Likes Received:
    5
    Hi Kiran,

    Looking and exploring the data with different perspectives is okay. Make sure, you dont miss to use "ocean proximity" as a feature.
     
    #18
  19. Aayushi_6

    Aayushi_6 Well-Known Member

    Joined:
    Sep 19, 2016
    Messages:
    65
    Likes Received:
    5
    Great !!!
     
    #19
  20. Prabhakar M R(2685)

    Joined:
    Jun 7, 2014
    Messages:
    3
    Likes Received:
    0
    Hi Aayushi, I had checked and did not find any null values in any of the features. I tried immediately, to edit the post and add this information, but couldn't as it asked to me login to add comment and wouldn't let me in to edit the comments.

    Any other reason which could cause this?

    Regards,
    Prabhakar
     

    Attached Files:

    #20
    Last edited: Jul 10, 2018
  21. Neha_155

    Neha_155 Member

    Joined:
    May 21, 2018
    Messages:
    7
    Likes Received:
    0
    Thanks, Aayushi.
     
    #21
  22. Aayushi_6

    Aayushi_6 Well-Known Member

    Joined:
    Sep 19, 2016
    Messages:
    65
    Likes Received:
    5

    Hi Prabhakar,

    Hope you have imputed the data before splitting. And have used fillna() function for imputation.
     
    #22
  23. _28259

    _28259 Member

    Joined:
    Apr 3, 2018
    Messages:
    12
    Likes Received:
    1
    Hi Aayushi ,

    I am not satisfied with the RMSE values for test and train in my DT algorithm .

    upload_2018-7-11_19-35-41.png

    Also my RF Algorithm is not working as it should :

    upload_2018-7-11_19-36-21.png

    Please suggest some reference or solution for hyperparameter tuning or maybe something more i should implement on my dataset like scaling,normalizing,transformation.

    The way i see i can only implement label encoding on my Ocean Proximity Column .

    Thanks
    Pritam
     
    #23
  24. Gaurav Verma_8

    Joined:
    May 31, 2018
    Messages:
    3
    Likes Received:
    0
    Hi Aayushi - In the housing project, is it fine to use MinMaxScaler instead of StandardScaler? Also, even though I am getting an R-squared value of 0.863, the quality of predictions is not good. The LR, DT, RF rmse values are all in 60k to 80k, but all test values are less than train values.
     
    #24
    Last edited: Jul 12, 2018
  25. Aayushi_6

    Aayushi_6 Well-Known Member

    Joined:
    Sep 19, 2016
    Messages:
    65
    Likes Received:
    5
    Hi Pritam,

    Good to see your efforts. Keep it up.
    Regarding hyper-parameter tuning, in the coming class I am going to cover the demonstration of Grid Search which will going to be of great help.
    As of now, in order to improve the results of Random Forest, I can suggest to tune the n_estimators with 100,200,500 values (20 seems to be very less) and correspondingly change the value of max_depth.
    Generally, with proper tuning Random Forest performs better than DT.

    Hope it helps!!
     
    #25
  26. _28259

    _28259 Member

    Joined:
    Apr 3, 2018
    Messages:
    12
    Likes Received:
    1
    Thank you . It worked with n_estimators to some extent lowering my RMSE for test in both DT and RF but the my headache remains cause the test RMSE is still high while comparing it with RMSE for train.in both (RF & DT).

    upload_2018-7-12_21-11-55.png


    Looking forward to your upcoming classes to see some other techniques .

    Thanks
    Pritam
     
    #26
  27. Prabhakar M R(2685)

    Joined:
    Jun 7, 2014
    Messages:
    3
    Likes Received:
    0
    Hi Aayushi,

    Yes I have imputed the data with fillna() function. [(housingdata["total_bedrooms"] = housingdata["total_bedrooms"].fillna(housingdata["total_bedrooms"].mean())].

    All this has worked. I am facing the issue only at the last step.

    Regards,
    Prabhakar
     
    #27
  28. JUPUDI AVINASH

    JUPUDI AVINASH New Member

    Joined:
    Jun 19, 2018
    Messages:
    1
    Likes Received:
    0
    Hi aayushi,

    In the california housing project, I am getting really odd values in terms of intercept and coefficient. The values are very large. I am attaching a pdf. Plz have a look. Because of this R2 value is very off. Definitely there is an issue. Can you help me in this.

    Thanks
    Jupudi Avinash
     

    Attached Files:

    #28
  29. gaint461(1538146)

    Joined:
    May 28, 2014
    Messages:
    10
    Likes Received:
    0
    Hi Aayushi,

    I am starting from scratch again , Building the environment and updating the environment with all packages
    then trying to understand basics and learn.

    With regard to project i will work on it and try and complete it at the earliest.

    One doubt i had was the splitting of data set ---> like in housing.csv file i thought i need to separate the data set 80%-20% rule manually whereas i skipped the point that problematically this can be achieved. Maybe scrolling back to previous lessons would have achieved this.

    I know everyone is one step ahead of me and i need practice and that's what i am going to do.

    Practice and Learn.

    If there is anything i will ask i will surely let you know.

    Regards
    Kiran Kumar P
     
    #29
  30. Neha_155

    Neha_155 Member

    Joined:
    May 21, 2018
    Messages:
    7
    Likes Received:
    0
    Hi Aayushi,

    What is the allowed % difference in RMSE values of Test and Train set?

    As for the project, I am getting the RMSE value of "Test" greater than "training" in DT and RF.

    Please suggest, as tried lots of combination but failed to get the RMSE value of "Test" comparable "Training".

    Thanks,
    Neha
     
    #30
  31. gaint461(1538146)

    Joined:
    May 28, 2014
    Messages:
    10
    Likes Received:
    0
    Ha ha ha .....i just Label encoder part ......

    Understood that with health.csv in the training class.

    i know i have a long way , but i am not giving up.....grrrr....i will make it.
     
    #31
  32. Neha_155

    Neha_155 Member

    Joined:
    May 21, 2018
    Messages:
    7
    Likes Received:
    0
    Hi Aayushi,

    After using GridSearch for Random Forest and Decision Tree, getting a difference of "0.10 - 0.14" in the "Test" and "Training" Score.

    Is this ok, or I have to tune this? How much difference is "Score" value can be ignored?

    Thanks,
    Neha
     
    #32
  33. Aayushi_6

    Aayushi_6 Well-Known Member

    Joined:
    Sep 19, 2016
    Messages:
    65
    Likes Received:
    5

    Hi Neha,

    That would be absolutely fine.
    We generally take up till 5% of variation.
     
    #33
  34. Aayushi_6

    Aayushi_6 Well-Known Member

    Joined:
    Sep 19, 2016
    Messages:
    65
    Likes Received:
    5
    Hi Kiran,

    Keep it up !!!
     
    #34
  35. Neha_155

    Neha_155 Member

    Joined:
    May 21, 2018
    Messages:
    7
    Likes Received:
    0
    But, this becomes 10-15% of the variation.
    So, I need to tune to make it till 5%?
     
    #35
  36. Aayushi_6

    Aayushi_6 Well-Known Member

    Joined:
    Sep 19, 2016
    Messages:
    65
    Likes Received:
    5
    Hi Neha,

    Try to optimize your results as much as you can before the last date of project submission. This practice will enhance your learning capabilities.
    Then you can submit.

    Thank you.
     
    #36
  37. _28259

    _28259 Member

    Joined:
    Apr 3, 2018
    Messages:
    12
    Likes Received:
    1
    Hi Aayushi,

    Kindly explain this one parameter called "Writeup" as to what i should upload in them exactly on the Project Submission page. Snap below:

    upload_2018-7-17_17-1-17.png

    As for the others i believe the Screen shot of my code and Source code should be attached.

    Please suggest.

    Thanks
    Pritam
     

    Attached Files:

    #37
  38. Gaurav Verma_8

    Joined:
    May 31, 2018
    Messages:
    3
    Likes Received:
    0
    Hi Aayushi
    In each of LR, DT, RF and LR (median_income), do we have to create a scatter plot with
    plt.scatter(predicted, actual)
    plt.title("Title of the plot")
    plt.show()

    Again, my rmse values are very very high - in the range of 60000 to 80000, though my test is less than train.
    Hope this is fine... coz there are people whose values are very low in single digits.

    Thanks
     
    #38
  39. Aayushi_6

    Aayushi_6 Well-Known Member

    Joined:
    Sep 19, 2016
    Messages:
    65
    Likes Received:
    5
    Yes Gaurav.
    Don't worry about the RMSE values.
     
    #39
  40. Aayushi_6

    Aayushi_6 Well-Known Member

    Joined:
    Sep 19, 2016
    Messages:
    65
    Likes Received:
    5
    Hi Pritam,

    In writeup, you just have to write a quick summary of your code, what features u have build , what algorithms have you used etc..
     
    #40
  41. gaint461(1538146)

    Joined:
    May 28, 2014
    Messages:
    10
    Likes Received:
    0
    Hi Aayushi,

    I think i almost completed the project codes.

    Please help me understand whether i have done it properly or have missed anything.

    If everything is alright then i will do the write-up and explanation of how i did the project and then submit them.

    if i am wrong in anyway then i will re-do the same and get back to you.

    Attached a copy of the html file:
    California Housing Project.zip

    Regards
    Kiran
     

    Attached Files:

    #41
  42. gaint461(1538146)

    Joined:
    May 28, 2014
    Messages:
    10
    Likes Received:
    0
    Yes the accuracy is at 62% , so should i do some tuning? or change the model?
     
    #42
  43. gaint461(1538146)

    Joined:
    May 28, 2014
    Messages:
    10
    Likes Received:
    0
    I had super fun and exploration was amazing. Still Learning and with some guidance I can achieve more. Thanks Aayushi and Simplilearn for this. Even if I don’t get anything I will still not regret as I had fun....:)
     
    #43
  44. Aayushi_6

    Aayushi_6 Well-Known Member

    Joined:
    Sep 19, 2016
    Messages:
    65
    Likes Received:
    5
    Hi Kiran,

    So far is good. Please perform the decision trees and random forest model on the data as given in the project document steps.
    Thanks
     
    #44
  45. Aayushi_6

    Aayushi_6 Well-Known Member

    Joined:
    Sep 19, 2016
    Messages:
    65
    Likes Received:
    5
    Glad to hear.
    Thank you.
    Keep learning :)
     
    #45
  46. gaint461(1538146)

    Joined:
    May 28, 2014
    Messages:
    10
    Likes Received:
    0
    Hi Aayushi,

    I have completed the project and also few write-ups based on my understanding.

    Please help me to know whether i can upload the project and complete it for my certification.

    Regards

    Kiran Kumar P
     
    #46
  47. Aayushi_6

    Aayushi_6 Well-Known Member

    Joined:
    Sep 19, 2016
    Messages:
    65
    Likes Received:
    5
    Hi Kiran,

    Yes, you can upload it for submission.
    Thank you.
     
    #47
  48. gaint461(1538146)

    Joined:
    May 28, 2014
    Messages:
    10
    Likes Received:
    0
  49. Vanka Anand Rao

    Vanka Anand Rao New Member

    Joined:
    May 17, 2018
    Messages:
    1
    Likes Received:
    0
    Hi ,
    I am getting below error while uploading project document.
     

    Attached Files:

    #49
  50. Neha_155

    Neha_155 Member

    Joined:
    May 21, 2018
    Messages:
    7
    Likes Received:
    0
    Hi Priyanka,

    On project submission page, on trying to add "Writeup" file, getting error:
    U n a b l e t o C o n n e c t t o t c p : / / s e c u r e . s i m p l i c d n . n e t . s 3 . a m a z o n a w s . c o m : 8 0 . E r r o r # 1 1 0 : C o n n e c t i o n t i m e d o u t

    Please look into this issue so that I can submit my project.

    Thanks,
    Neha
     
    #50

Share This Page