Programming Basics and Data Analytics with Python | Anand

Discussion in 'Big Data and Analytics' started by Nishant_Singh, Apr 25, 2020.

  1. Akshay Singh_8

    Joined:
    Mar 31, 2020
    Messages:
    4
    Likes Received:
    0
    df.dropna(inplace = True) -> this is the correct syntax. worked for me
     
    #101
  2. Akshay Singh_8

    Joined:
    Mar 31, 2020
    Messages:
    4
    Likes Received:
    0
    Awesome, Thanks Sakthi
     
    #102
  3. Akshay Singh_8

    Joined:
    Mar 31, 2020
    Messages:
    4
    Likes Received:
    0
    Did
    Hi,

    do you have any solution for this now ?
    Regards,
    Akshay Singh
     
    #103
  4. Sedric Hibler

    Sedric Hibler Member

    Joined:
    Mar 30, 2020
    Messages:
    14
    Likes Received:
    5
    Hi - no they never did. I got them to go down a point haha.
    I tried both using the median for all the na rows that i could (some still were dropped if they had multiple columns with na). One thing I didnt try towards the end was dropping some of the numerical columns out of my dataset when building x_data for the model. I didnt do that because I just simply didnt see any good correlations at all (all either 10% or less). Since we got different numbers (i.e. your rscore being double mines), I do wonder what we did differently.
     
    #104
  5. Gaurav Kilania

    Joined:
    Feb 29, 2020
    Messages:
    4
    Likes Received:
    0
    Hi sir,

    In size column, what should we do with the- "varies with device" ?
     
    #105
  6. Sakthi Vijaya Kumar

    Sakthi Vijaya Kumar Active Member

    Joined:
    Apr 27, 2020
    Messages:
    26
    Likes Received:
    6
    impute with median of the column
     
    #106
  7. Sakthi Vijaya Kumar

    Sakthi Vijaya Kumar Active Member

    Joined:
    Apr 27, 2020
    Messages:
    26
    Likes Received:
    6
    Impute those rows with a median of the column, that's what i tried and he too told in yesterday class
     
    #107
  8. Piyush Sharma_10

    Piyush Sharma_10 New Member

    Joined:
    Dec 28, 2019
    Messages:
    1
    Likes Received:
    0
    Hi Anand,

    I'm stuck at step 11: linear regression in the project (App Rating) where python throws an error.

    Code:
    lm1=lm.fit(X_train,y_train)

    Error:
    ValueError: could not convert string to float: 'TOOLS'

    What i could think of this error - When creating dummies for category, genres, and content rating, there are two columns for 'Tools' (One from Category and other from Content Rating). I've also checked the data type - it is float, as I've set the dtype as float while creating dummies. But not able to go beyond this step. What am I missing?
     
    #108
  9. Kalindi Dharamsey

    Joined:
    Nov 25, 2019
    Messages:
    14
    Likes Received:
    0
    Hello Anand Sir,

    Have you uploaded the KNN file - .ipny? I am unable to locate it on Google Drive. I see all the other files on the topic that you have covered. If so, where will it be? If not, kindly do so the soonest. Will we able to look at your shared file for this class/session and for how long?

    Thanks,
    Kalindi
     
    #109
  10. MANOHAR TATKARE

    Joined:
    Apr 9, 2020
    Messages:
    5
    Likes Received:
    1
    Hello Sir

    Please note live classes recording dtd. 24-05-2020 has no audio. Kindly upload the fresh recording at the earliest. Thanks.

    Regards

    Manohar
     
    #110
  11. Firoz Syed

    Firoz Syed Member

    Joined:
    Mar 9, 2020
    Messages:
    14
    Likes Received:
    0
    HI Sakthi ,

    Yeah we get around 160-167 columns , but lot of them with Nan values , so how to proceed . Should i drop or impute those Nan's with respective medians ?. And also i concatenated the inp1 and inp2, so do i need to drop any columns ?
     
    #111
  12. Firoz Syed

    Firoz Syed Member

    Joined:
    Mar 9, 2020
    Messages:
    14
    Likes Received:
    0
    Hi Anand ,

    I am reviewing the last lecture (session 8) but there is no audio ? can you please check ?
     
    #112
  13. Firoz Syed

    Firoz Syed Member

    Joined:
    Mar 9, 2020
    Messages:
    14
    Likes Received:
    0
    Hi Sakthi ,

    When i tried R2 score , system throwed me a compute error. Did you import any package for this.

    Below is what i did :

    #from sklearn.metrics import r2_score
     
    #113
  14. Firoz Syed

    Firoz Syed Member

    Joined:
    Mar 9, 2020
    Messages:
    14
    Likes Received:
    0
    Hi Sakthi ,

    my r2 scores are same and i feel it is very low and i am not sure if it is correct or not. Apart from this , i identified below dependent and independent variables. What is the whole point in having dummy columns here ?. And also any input can you give on how to identify these independent variables atleast if you have so many ?
    ### DEFINE THE INDEPENDENT VARIABLES
    ### y = mx + b
    X = inp1[['Reviews', 'Size','Price']]

    ### DEFINE THE DEPENDENT VARIABLE
    y = inp1['Rating']

    upload_2020-5-25_19-2-33.png

    upload_2020-5-25_19-2-58.png
     
    #114
  15. Joanna Quintero

    Joined:
    Dec 2, 2019
    Messages:
    9
    Likes Received:
    2
    Anand,

    There is not audio for last class recording (Sunday 24th). Please help. Thank you
     
    #115
  16. Joanna Quintero

    Joined:
    Dec 2, 2019
    Messages:
    9
    Likes Received:
    2
    Hi Anand,
    My r2_score(y_test,pred_y) is very low: 0.0518302688187805
    I know I have to keep working on data by removing outliers and independent variables with vey high and low correlation, to increase the accuracy.
    However, due the lack of time, it is ok if I submit the project with this value?
     
    #116
  17. Sakthi Vijaya Kumar

    Sakthi Vijaya Kumar Active Member

    Joined:
    Apr 27, 2020
    Messages:
    26
    Likes Received:
    6
    Hi,
    I hope the Get_dummies function won't create any NAN values. have you dropped all NAN values before you call get_dummies function? I got 156 columns and I did some columns reduction after creating a box plot and make some categories which had the same median as one category and again done a get_dummies from inp0 so I got around 126 columns.
     
    #117
  18. Sakthi Vijaya Kumar

    Sakthi Vijaya Kumar Active Member

    Joined:
    Apr 27, 2020
    Messages:
    26
    Likes Received:
    6
    Yes, I too did the same package import. can you post the error you got? I will try to help.
     
    #118
  19. Sakthi Vijaya Kumar

    Sakthi Vijaya Kumar Active Member

    Joined:
    Apr 27, 2020
    Messages:
    26
    Likes Received:
    6
    Hi,
    Make 'Rating' as a dependent variable 'y' and make independent of all other columns as x. selecting only 3 columns will not get you the good r2 score. Then, there is no point in doing dummies for categorical columns.

    from sklearn.model_selection import train_test_split
    df_train, df_test = train_test_split(inp3, train_size = 0.7, random_state = 100)

    y_train = df_train.pop("Rating")
    X_train = df_train

    y_test = df_test.pop("Rating")
    X_test = df_test

    from sklearn.linear_model import LinearRegression
    lr = LinearRegression()
    lr.fit(X_train, y_train)

    from sklearn.metrics import r2_score
    y_train_pred= lr.predict(X_train)
    r2_score(y_train, y_train_pred)

    y_test_pred= lr.predict(X_test)
    r2_score(y_test, y_test_pred)


    Try this code. you can find this code from the solution document for the practice project given in the course itself.
     
    #119
  20. Sedric Hibler

    Sedric Hibler Member

    Joined:
    Mar 30, 2020
    Messages:
    14
    Likes Received:
    5
    I submitted with a simillar score. I was told that it isnt about the score but moreso about the process you followed to get there, if its correct, so I think you will be ok.
    One thing I recall in the process is checking for correlated data throughout the way. It simply was not correlating in this case so therefore I expect the final answers are supposed to be low. When you look at other projects like the bike project, you will see some high correlations there and different end results with the same process.
     
    #120
  21. Firoz Syed

    Firoz Syed Member

    Joined:
    Mar 9, 2020
    Messages:
    14
    Likes Received:
    0
    HI Vijaya Kumar ,

    Thanks a lot for your response. My bad no NaN's got created. So now after dummies got created , i dropped the original columns of Category , Genres, Content_Rating. So in total i have around 139 columns. All the dummy column's has zero's and one's.
     
    #121
  22. Firoz Syed

    Firoz Syed Member

    Joined:
    Mar 9, 2020
    Messages:
    14
    Likes Received:
    0

    Thanks Vijaya Kumar. So x should contain all the columns as independent varables which means x = inp1.iloc[:,0:140] (139 Columns) and y-inp1['Rating']. Can you please confirm this part ?. Calculating R2 score has no issues now.
     
    #122
  23. Firoz Syed

    Firoz Syed Member

    Joined:
    Mar 9, 2020
    Messages:
    14
    Likes Received:
    0
    One last Question , we have a column called 'Type' which has values FREE (String's) because this will impact when we tried to fit the Regression model. Is it safe to drop or Need to perform anything ?
     
    #123
    Last edited: May 26, 2020
  24. _69197

    _69197 New Member

    Joined:
    Nov 8, 2019
    Messages:
    1
    Likes Received:
    0
    Hi, like others before me I get stuck on project no. 4.1. I have tried convert the string to int or float, but that results in all entries turning into 140933. This cannot be correct. Please could someone indicate how to impute / index / split the values in the column "Size". I don't remember that it was shown how to do this with a loaded csv-file. Thanks.
     
    #124
  25. Firoz Syed

    Firoz Syed Member

    Joined:
    Mar 9, 2020
    Messages:
    14
    Likes Received:
    0
    In which Format to upload the project in lms. I tried jupyter notebook conversion to pdf but it doesn't seem to work. I missed last class , so not sure what was discussed on this ?
     
    #125
  26. SAURABH MALVI

    SAURABH MALVI Member

    Joined:
    Feb 6, 2020
    Messages:
    3
    Likes Received:
    0
    Hi Nishant,

    As announced by you on the last day of the batch i.e. on 24th May 2020 that you will be sending all the participants a mail with instructions of how to post the feedback on the social networking sites. Have you sent that mail? I am asking you because i haven't seen any mail coming from you.

    Please confirm.

    Regards,

    Saurabh Malvi
     
    #126
    Last edited: May 27, 2020
  27. Firoz Syed

    Firoz Syed Member

    Joined:
    Mar 9, 2020
    Messages:
    14
    Likes Received:
    0
    Hi Saurabh ,

    Have you submitted the project ? Any idea about the deadline of the project ?

    Thanks
    Firoz Syed
     
    #127
  28. Sakthi Vijaya Kumar

    Sakthi Vijaya Kumar Active Member

    Joined:
    Apr 27, 2020
    Messages:
    26
    Likes Received:
    6
    Hi,
    Either can you drop the Type columns or get dummies for that, both give the same results, so I dropped that column.
     
    #128
  29. Sakthi Vijaya Kumar

    Sakthi Vijaya Kumar Active Member

    Joined:
    Apr 27, 2020
    Messages:
    26
    Likes Received:
    6
    Yes you are right
     
    #129
  30. Sakthi Vijaya Kumar

    Sakthi Vijaya Kumar Active Member

    Joined:
    Apr 27, 2020
    Messages:
    26
    Likes Received:
    6
    Hi,
    After get_dummies for these columns, automatically the category genres and content_rating will drop from the table. so no issues if you dropped those manually.
     
    #130
  31. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25

    Akshay, your approach is correct. pd.get_dummies() , will create one column for every value of a categorical column. We saw this in our logistic regression example.
    if there is a column called gender , it has male and female, post pd.get_dummies(), you will see a separate column called male where the values will be 1 for every row that has M and 0 for every row that has Female.

    Also, as seen in class, pd.get_dummies() has a parameter called drop_first , which can help in reducing the number of columns created.
     
    #131
  32. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    Sedric et all,
    For improving the accuracy of the models, one or all of the following approaches can be taken
    1. data cleansing - remove missing values and any non use ful characters
    2. impute them with mean or median if numeric. Here, please see if you can do some localised imputation. refer to the methodology we used for imputing age , in our logistic regression class. this will increase your score
    3. Build new columns. for example , i see the following opportunities to create new features(Columns)
    - create a rating category column (high rating (>4 and < 5), medium rating(>3 and <4) and low rating(>1 and <3)
    4. You can build new features as categories around size , installs and price
    5. From the last updated date, you can get some perspectives, like how recently was an updated. The more recently an app is updated, it would mean that they are listening to customers and updating their product at a faster rate. You can then use this to see, how the rating of the app is

    6. Once the above is done, you will end up with many more columns. You can then apply correlation to find out which of the numeric columns are closer to dependent variable.
    6.1 - in this , you will need to drop all the categorical columns which you converted to dummies. No point correlating them with your dep variable

    7. Check for multi collinearity. meaning which of your independent variables are highly correlated with each other. You can use VIF that we learnt in our class.
    7.1 Once you identify Idependent variables that have high VIF
    7.2 You can choose those that have high correlation with Dep variable in your ML model
    8. Split the test and training data
    9. run your models using different models. even though we only tried one regression model, you all now have the code to try out other models.
    just use the same code and see which of those regression models are doing good.

    i am sure, your r2 score will improve post that.
     
    #132
    Firoz Syed and Sedric Hibler like this.
  33. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    Feroz, please check now. you should be able to get the audio
     
    #133
  34. jeeban Patro

    jeeban Patro Member

    Joined:
    Feb 16, 2016
    Messages:
    3
    Likes Received:
    0
    why jmy jupiter note getting struck whenever i am calling seaborn function??
     
    #134
  35. Firoz Syed

    Firoz Syed Member

    Joined:
    Mar 9, 2020
    Messages:
    14
    Likes Received:
    0
    Thank you Vijaya Kumar.
     
    #135
  36. SAURABH MALVI

    SAURABH MALVI Member

    Joined:
    Feb 6, 2020
    Messages:
    3
    Likes Received:
    0
    Hi Firoz,

    I have submitted the project on Monday 25th May 2020. The Deadline was 25th May 2020 till end of the day.

    Regards,

    Saurabh Malvi
     
    #136
  37. neethdjay

    neethdjay New Member

    Joined:
    Jan 21, 2020
    Messages:
    1
    Likes Received:
    0
    Hi All,

    I have submitted the project.

    Could you please let me know long it would take to be evaluated?

    Thanks,
    Neeth
     
    #137
  38. MANOHAR TATKARE

    Joined:
    Apr 9, 2020
    Messages:
    5
    Likes Received:
    1
    Sir, the recording has no audio yet. please check and upload it again. Meanwhile my project has got approved. Thank you for your guidance.
     
    #138
    rishi_wmalhotra likes this.
  39. rishi_wmalhotra

    Joined:
    Dec 2, 2019
    Messages:
    7
    Likes Received:
    2
    Sir, there is still no audio in last session
     
    #139
  40. Firoz Syed

    Firoz Syed Member

    Joined:
    Mar 9, 2020
    Messages:
    14
    Likes Received:
    0

    Still the same Anand. Kindly help !
     
    #140
  41. rishi_wmalhotra

    Joined:
    Dec 2, 2019
    Messages:
    7
    Likes Received:
    2
    Congratulations Manohar
     
    #141
    MANOHAR TATKARE likes this.
  42. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25

    Piyush,
    Please check your code again. I ONLY see the value "TOOLS" under the category column. The "Content Rating" does not have any value called "TOOLS. so, unless, there is some code error, you will not be getting 2 columns for TOOLs
     
    #142
  43. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25

    Jeeban,
    Seaborn gets stuck, only if you try to plot graph with a lot of data points. please share your code here. and i will see if i can help
     
    #143
  44. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    Feroz, yes, you are correct. I think, we need to check back with Simplilearn on this. They must be able to reload this. Can you please raise a simplilearn ticket?
     
    #144
  45. MANOHAR TATKARE

    Joined:
    Apr 9, 2020
    Messages:
    5
    Likes Received:
    1
    It is still the same despite raising ticket(s).
     
    #145
  46. Gaurav Kilania

    Joined:
    Feb 29, 2020
    Messages:
    4
    Likes Received:
    0
    Still, there is no audio in the last lecture's recording. Please look into it Anand Sir and Simplilearn Team
     
    #146
  47. jeeban Patro

    jeeban Patro Member

    Joined:
    Feb 16, 2016
    Messages:
    3
    Likes Received:
    0
    HI Anand.

    Good Afternoon..!!

    It's been a long time i am stuck in a point where i have to omit all the outliers.
    for the columns Price,Review and installs.
    for the price i have been found a number of outliers from the boxplott and unable to get which values i need to omit.
    My price filed summery is as fallows
    count 8886.000000
    mean 0.963526
    std 16.194792
    min 0.000000
    25% 0.000000
    50% 0.000000
    75% 0.000000
    max 400.000000
    Name: Price, dtype: float64

    and after doing this operation like
    Price = playstore_app_analysis1.loc[(playstore_app_analysis1["Price"] >= 0) & (playstore_app_analysis1["Price"] <200),['Price']]
    Price.count

    the boxplot i am getting still have outstanding outliers
    upload_2020-6-2_18-30-17.png

    so, advice me what will be the appropriate step need to take in order to proceed further.

    Thanks.
    Jeeban
     
    #147
  48. MANOHAR TATKARE

    Joined:
    Apr 9, 2020
    Messages:
    5
    Likes Received:
    1
    It is still the same even after 10 days are passed. Hope we will get the recording with proper audio. Thanks.
     
    #148
  49. rishi_wmalhotra

    Joined:
    Dec 2, 2019
    Messages:
    7
    Likes Received:
    2
    Hi all,

    I raised a ticket too, however, it is an issue at the webex end. It might take some more time to get recording. I read through Anand Sir's Jupyter notebook on Logistic Regression and Principal Component Analysis and appeared for the exam and have cleared it plus my project got approved too.
     
    #149
    MANOHAR TATKARE likes this.
  50. prateekjain6342

    prateekjain6342 New Member

    Joined:
    Jul 12, 2020
    Messages:
    1
    Likes Received:
    0
    Hi, I am getting the following error when I'm trying to create Linear Regression Model:

    ValueError: array must not contain infs or NaNs

    My Code:

    from sklearn.model_selection import train_test_split
    df_train,df_test = train_test_split(inp1,test_size = 0.3)
    df_train.replace([np.inf, -np.inf], np.nan)
    df_test.replace([np.inf, -np.inf], np.nan)
    X_train = df_train.drop('Rating',axis = 1)
    y_train = df_train['Rating']
    X_test = df_test.drop('Rating',axis = 1)
    y_test = df_test['Rating']
    X_train.dropna()
    y_train.dropna()
    X_test.dropna()
    y_train.dropna()
    from sklearn.linear_model import LinearRegression
    lm = LinearRegression()
    lm.fit(X_train,y_train)

    I have tried removing all the NaNs and infs values but nothing works.
     
    #150

Share This Page