Programming Basics and Data Analytics with Python | Anand

Discussion in 'Big Data and Analytics' started by Nishant_Singh, Apr 25, 2020.

  1. Sedric Hibler

    Sedric Hibler Member

    Joined:
    Mar 30, 2020
    Messages:
    14
    Likes Received:
    5
    thanks for confirming. I have that same visual of the price field and dont know why, but you have 25% 50% and 75% showing as 0 as well. Its probably a statistics question to figure out why those values would be 0 as i believe they are the reason the box appears to be a flat line.
    Most of the values in the dataframe are 0 (or free apps) or less than 1 dollar so i also wonder if we shorten the x axis to less than 50 whether or not it would appear different? Will test that and get back to you in a few mins

    Update: i think i answered my own question hah.
    The 75% means that 75% of the values are under 0, therefore we wont see the box in this case. because the median value is the line (which is 0), but 25, 50 and 75% of the values falling within 0 is what is the problem, which means this graph is accurate per the data.
    Only 800 rows in the original data file are not equal to 0. There are like 10,000 rows, so this makes sense that we really cant use that for anything.
     
    #51
    Last edited: May 20, 2020
    Sakthi Vijaya Kumar likes this.
  2. Siva Shankar Biswal

    Joined:
    Mar 23, 2020
    Messages:
    3
    Likes Received:
    0
    Hi Anand,

    Facing an issue while dropping a row as per the index number but the row still shows while table. Please help. I have attached the screenshot of the code. Please guide if the code or the function is wrong.
     

    Attached Files:

    #52
  3. Sakthi Vijaya Kumar

    Sakthi Vijaya Kumar Active Member

    Joined:
    Apr 27, 2020
    Messages:
    26
    Likes Received:
    6
    please assign into a df while dropping a row.
    like df = df.dropna()
     
    #53
  4. Sakthi Vijaya Kumar

    Sakthi Vijaya Kumar Active Member

    Joined:
    Apr 27, 2020
    Messages:
    26
    Likes Received:
    6
    Yes same thought. after dropping all .na and other rows, there are only 642 paid columns. so the graph is correct per data.
    thanks for sharing your thought.
     
    #54
  5. Joanna Quintero

    Joined:
    Dec 2, 2019
    Messages:
    9
    Likes Received:
    2
    Thank you so much for your help. I'm still unable to do it. I have checked the recorded classes and Anand files but I can't get it.
    I wish Anand can help but it seems that he is not answering questions this week either. How bad, since the project should be done by this coming Sunday.
     
    #55
    Sedric Hibler likes this.
  6. Joanna Quintero

    Joined:
    Dec 2, 2019
    Messages:
    9
    Likes Received:
    2
    I'm getting a similar boxplot for Price
    upload_2020-5-20_18-14-25.png
     
    #56
    Sakthi Vijaya Kumar likes this.
  7. Sedric Hibler

    Sedric Hibler Member

    Joined:
    Mar 30, 2020
    Messages:
    14
    Likes Received:
    5
    I am hopeful he does provide insight here before Saturdays class as well. The 5th class is where he goes through functions and loops by the way. It seems that both options can work for that question.
    upload_2020-5-20_20-16-35.png

    I am not exactly how much I am ok to help through the forums as a student, but if you post what is happening or where it is going wrong in your code, we may be able to offer tips towards how to fix it. But definitely take a look at our 5th class first to see if it helps.

    Also I think your price boxplot is probably correct. Thats what we talked about a lil bit today- that it accurately represents what we have in the data for that column. Not sure if Anand has other thoughts or could confirm our thinking there though as that would be great.
     
    #57
  8. Sakthi Vijaya Kumar

    Sakthi Vijaya Kumar Active Member

    Joined:
    Apr 27, 2020
    Messages:
    26
    Likes Received:
    6
    Hi, have you finished your model building? what is your r2 score after building a linear regression model?
    here is mine, I am sure about the value is correct or not, just cross-checking with you?

    upload_2020-5-21_19-57-40.png
     
    #58
    Sedric Hibler likes this.
  9. Siva Shankar Biswal

    Joined:
    Mar 23, 2020
    Messages:
    3
    Likes Received:
    0
    Hi, not happening, its dropping the whole table. Could you please help me with the code where we will drop the reviews which have more than 2 million reviews.
     
    #59
  10. Shishir Dwarkanath

    Joined:
    Dec 9, 2019
    Messages:
    5
    Likes Received:
    1
    Hi Sakthi,

    Which 'Column' did you use for 'Y'?

    Thanks
    Shishir
     
    #60
  11. Sedric Hibler

    Sedric Hibler Member

    Joined:
    Mar 30, 2020
    Messages:
    14
    Likes Received:
    5
    Nice! I got to this point BUT i do not feel confident that I did it correctly. Your score looks quite a bit better than what I came out with. I feel like I am missing something in the last couple steps. Give me a sec to review the requirements and I'll update you again with my final.
    upload_2020-5-21_12-6-4.png
     
    #61
  12. Sakthi Vijaya Kumar

    Sakthi Vijaya Kumar Active Member

    Joined:
    Apr 27, 2020
    Messages:
    26
    Likes Received:
    6
    inp0 = inp0[inp0.Reviews < 2000000], my data frame name is inp0 and this code works for me.
    for this, you should convert the 'Reviews' column into int/float before running the given code.
    check that and reply.

    Thanks
    Vijay
     
    #62
  13. Sakthi Vijaya Kumar

    Sakthi Vijaya Kumar Active Member

    Joined:
    Apr 27, 2020
    Messages:
    26
    Likes Received:
    6
    I selected 'Rating' for Y
     
    #63
  14. Sedric Hibler

    Sedric Hibler Member

    Joined:
    Mar 30, 2020
    Messages:
    14
    Likes Received:
    5
    I may go back and review what i dropped etc along the way. my rscore is still very low. Out of curiosity, did you categorize the type column or just drop it? it wasnt in the instructions to do either, but i noticed you cant use it when using the fit command since it isnt numerical.

    upload_2020-5-21_14-31-30.png

    I think my main question is - do you see that any of these columns correlate to ratings? because I do not.
     
    #64
    Last edited: May 21, 2020
  15. Sakthi Vijaya Kumar

    Sakthi Vijaya Kumar Active Member

    Joined:
    Apr 27, 2020
    Messages:
    26
    Likes Received:
    6
    Reg the type column, I tried both categorize and drop it, both approaches give me the same results.
    I just checked the correlation value for price, reviews, and size vs ratings (as dependent value). correlation values are nearly 0.33 something for all the 3. I checked individually, dint create any heat maps. So I don't see these columns will affect the model. and also I dindt dropped any of these columns. I am waiting for Saturday class, once Anand explained about model building will try to dropping of those columns and check the r2 scores
    According to the problem statement given this is the score I got. If we are free to improvise the model will try to drop some columns and check the score.
     
    #65
    Sedric Hibler likes this.
  16. Sakthi Vijaya Kumar

    Sakthi Vijaya Kumar Active Member

    Joined:
    Apr 27, 2020
    Messages:
    26
    Likes Received:
    6
    Hi,
    After some finetuning, I got this result. Both Train and Test scores are the same scores for the R2 score but still not confident that the r2 score is correct or not.
    upload_2020-5-22_20-6-6.png
     
    #66
  17. Joanna Quintero

    Joined:
    Dec 2, 2019
    Messages:
    9
    Likes Received:
    2
    Hi Anand,
    Can we have a time on Saturday's class to answers questions about the Assessment since they haven't been answer through the community forum?
    It would be appreciated.
    Thank you so much!
     
    #67
  18. rishi_wmalhotra

    Joined:
    Dec 2, 2019
    Messages:
    7
    Likes Received:
    2
    Hi Anand Sir,

    My Jupyter Notebook in Simplilearn Lab is not able to read the file and everytime it give error "File Not Found" however, my downloaded Anaconda Jupyter notebook is able to read it. Therefore as I am already delayed, I have started my project in my Anaconda Jupyter notebook and will share with you on Google drive or any other means that you suggest.

    Hope that's ok?
     
    #68
  19. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    Sanjeev,
    Please follow the installation procedures. Do let me know what you did in exact steps, so that i can understand your problem.
    "Unable to run Jupyter" is very generic for me.
     
    #69
  20. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    Hi,
    As the error clearly indicates, its an "indentation error". Meaning,
    1. you were intending to have the if statement below the for loop
    2. But, your if statement is starting in the next line,exactly where the for loop starts
    3. to resolve this, please move the statement as below.
    4. All you need to do is, follow the color coding of keywords that jupyter automatically displays for you


    mylist = [1000,2000,3000,3999,4999,6000]
    for num in mylist:
    if num% 2 == 0:
    print('num is even',num)
    elif num % 2 != 0:
    print('num is odd',num)
    else:
    print('this is the last statement')
     
    #70
  21. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    please go to self-learning. scroll to the bottom. you will see, ebooks
     
    #71
  22. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    #72
  23. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    All the recordings are there. Kindly check again
     
    #73
  24. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    Has been shared. Pls check again
     
    #74
  25. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    No Problem. Please feel free to post your questions here. i will try to answer as soon as i can
     
    #75
  26. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    This has nothing to do with packages. Its code structure, which needs indentation. python is an indentation based language.
    All you need to do is, be sensitive to that. You will get the hang of it, once you practice more.

    Do not compare it with R.
     
    #76
  27. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    Hi ,
    As mentioned clearly in the project requirements, you can apply np.log1p to the columns Reviews and Installs. But before that, please do the below
    1. check for missing values and remove or impute them
    2. Check any string variables, i see one in row 10472 , change it or remove it
    3. perform df['Reviews'] = pd.to_numeric(df['Reviews']) -- this will convert reviews from object to numeric
    4. then apply np.log1p on reviews column.

    please check this link for further references
     
    #77
  28. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    Hi,
    as mentioned in the pandas class, before you decide to drop anything, try to find out if anything can be imputed.


    - for example, you can change all values to number and represent them in KBs or MBs,for example, if the row has a value 6000 bytes, after converting toMB, it would be either 6 KB or .006 MB

    - Then, where you see "Varies with device", you can choose to drop them or impute them with mean or median
     
    #78
  29. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    Sakthi
    You are correct. Sorry to have made you work harder by googling. As you can understand class notes are there to help you with an approach or with thoughts. For example, before you will first need to understand whether to take a decision to drop or impute the column, if you decide to impute then you decide to perform the logic that you had written. class notes definitely covers that :). If every answer is already there in class notes, then you will lose the fun of doing the project.
    DataScience is a field of seeking. The more you seek, the more you learn.
     
    #79
  30. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    R2 is the accuracy.
    What is important is RMSE. we will cover it in the class today
     
    #80
  31. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    Hi, As discussed in class, the syntax for dropping a row is
    df.drop(rowindex, inplace=True)
    so , in our case it would be df.drop[2454, inplace=True).

    - as mentioned earlier, it will delete a row, but the index is not deleted. that index will be reorganized to some other row.

    - to understand this, please do the below
    1. df.iloc[2454] - you must be seeing the values as App KBA-EZ Health Guide
    2. df.drop(2454,inplace=False), this will drop the row only for that execution
    3. df.drop(2454,inplace=True), this will permanently drop the row from dataframe
    4. df.iloc[2454], you will see the next row sitting in this index. with values as "App Foothillsvet"

    the above means the row that you wanted to delete has been deleted and the index 2454 is now assigned to the next row.
     
    #81
  32. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    Joanna,
    I have to answer as much as possible here. Since, most of you are starting to look at the project in the last week, there is a high volume of questions.
    I am open to answer the questions in community forum, after 11 pm india time.
    I would like to complete at least linear and logistic regression today.
     
    #82
  33. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    Sedric,
    please look at the notes on box plot, in pandas and visualization notebook. I had also discussed in class as to what is a upper, lower and middle quartiles and mentioned about what is the purpose. I understand that, you all are now focusing in the project, but would be obliged, if you can look at the class notes and past recordings as well.
     
    #83
  34. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    As mentioned, mean and median way of imputation, is not the perfect solution for every imputation problem. That was why i showed a custom way of imputing. In spite of that, there may be scenarios, where you can still see outliers.
    In such circumstances, you can drop the rows.

    Since, this is a basic course, advanced techniques like balancing the data, is not covered. You can search for packages like SMOTE and MICE, for solving the balancing problem. I would suggest to do it, only after you get a hang of how to solve a problem at the basic level. If you consider yourself at a mature, level, please explore the packages i suggested.
     
    #84
  35. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    We will discuss this in detail in today's class. All your project questions will be answered indirectly through our class today.
    Great that, you guys are exploring by yourself. Really happy about it
     
    #85
  36. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    Yes, r2 score is low. This would mean, that there is a need for more data transformation.
    That was one of the reasons, why even in the requirement , its mentioned to do log transformation. We will try another methodology of scaling today with another dataset. this might help you solve the problem
     
    #86
  37. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    Joanna,
    I have answered this question, with options.
     
    #87
  38. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    Hi Jeeban,
    The following code is working for me

    mydict = {"a": 102, "b": 222, "c": 322, "d": 422}
    listdata = list(mydict.items())
    print(listdata)

    output
    [('a', 102), ('b', 222), ('c', 322), ('d', 422)]
     
    #88
  39. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    Joanna/Others,
    Class notes are to give you some approach. Not the exact solution for the project.
    I understand, i did not answer the questions the past 2 weeks and i apologize for it. That said, i have been constantly asking all of you to take a look at the project right from the first day. I am not here to give excuses, but i ask of you to understand, that we need to work together here.


    That said, the problem you are trying to solve, needs logic to be built. As mentioned in my previous responses, and as mentioned in the classes,

    1. the column size, is a string. you cannot directly convert a string with alphanumeric to numbers.
    2. you will first need to make the string completely numeric
    3. you can create a function or simply do a for loop
    3.1. for this , please apply logic to remove , alphabets from the string as below
    3.1.1 check if a string is alphanumeric, by checking isalnum() (taught in the class)
    3.1.2 if its alphanum, then you need to strip the alphabet in the string using string manipulation (also taught in the class)
    3.1.2.1 whereever there is a value "varies with device", move them to a separate dataframe and either impute them with the mean or median or
    3.1.2.1 remove those rows completely from the original dataframe
    3.1.3. Convert the string to numeric (We have seen quite a few built in functions in the class, pls explore that)
    3.1.4 this will correct the values 3.0M to 3.0
    3.1.5 then convert the values to one common unit, like MB or KB or bytes
    You will now have a numeric column that you can work with
     
    #89
  40. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    I have answered this before. will answer it now again. Price boxplot has outliers and may have outliers even after transformation.
    For handling any imputations with price for now, you can impute with the median and not mean.

    other way, is to scale all the columns, to bring them into one unit of measurement. this is something that we will try to cover in sunday's class. but, for this course, this might be slightly off topic.
     
    #90
  41. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    201
    Likes Received:
    25
    Hi All, since most of you are starting to look at the project now, please be ready to wait after 11 pm India time today to get questions answered.
    i will be covering Linear, Logistic regression and clustering today. post that, will answer your questions on project.

    please be mindful, that i will not share the code . but will provide guidelines only
     
    #91
    Sedric Hibler likes this.
  42. Sedric Hibler

    Sedric Hibler Member

    Joined:
    Mar 30, 2020
    Messages:
    14
    Likes Received:
    5
    Thanks - this makes sense, and I will update my answer to use the median instead of mean (since the mean can be skewed in this case).
     
    #92
    Last edited: May 23, 2020
  43. Firoz Syed

    Firoz Syed Member

    Joined:
    Mar 9, 2020
    Messages:
    14
    Likes Received:
    0
    Hi Anand ,

    Any Hint for the below would be appreciated !

    " Check out the records with very high price " ------- > How to make a conclusion of what is very high price ? The reason i am asking because we need to drop them but without determining it's not going to help. I knew the max price ..

    Box Plot i got is like this :

    upload_2020-5-23_10-51-25.png
     
    #93
    Last edited: May 23, 2020
  44. Shishir Dwarkanath

    Joined:
    Dec 9, 2019
    Messages:
    5
    Likes Received:
    1
    I used 'Installs'. This is because we had been asked to apply quantiles to this column.
     
    #94
  45. Surendra Kumar Sarki

    Joined:
    Jan 10, 2020
    Messages:
    6
    Likes Received:
    0

    I am also stock at the same !!!!!
     
    #95
  46. Sedric Hibler

    Sedric Hibler Member

    Joined:
    Mar 30, 2020
    Messages:
    14
    Likes Received:
    5
    I am not sure that using Installs makes sense here. I say that because Y is your dependent variable (aka what you are trying to predict). In the problem statement we are trying to predict the rating and not installs so to me Y would be rating, and maybe installs could be a X possibly since it is independent.
     
    #96
  47. Akshay Singh_8

    Joined:
    Mar 31, 2020
    Messages:
    4
    Likes Received:
    0
    Hi Anand,

    i am working on the Python project and stuck in point 8.3 Get dummy columns for Category, Genres, and Content Rating. This needs to be done as the models do not understand categorical data, and all data should be numeric. Dummy encoding is one way to convert character fields to numeric. Name of dataframe should be inp2

    i have used get_dummies but this adds many different columns in the dataframe with values 0 and 1. am i missing something or is it correct ?

    Regards,
    Akshay Singh
     
    #97
  48. Sakthi Vijaya Kumar

    Sakthi Vijaya Kumar Active Member

    Joined:
    Apr 27, 2020
    Messages:
    26
    Likes Received:
    6
    Waht is your r2 score value for using installs as Y?

    'Installs' is not an independent variable for the given problem statement, so using 'Installs' will give you a wrong model, even though you have high r2 values
     
    #98
    Last edited: May 24, 2020
  49. Sakthi Vijaya Kumar

    Sakthi Vijaya Kumar Active Member

    Joined:
    Apr 27, 2020
    Messages:
    26
    Likes Received:
    6
    Have your R2 score increased?
     
    #99
  50. Sakthi Vijaya Kumar

    Sakthi Vijaya Kumar Active Member

    Joined:
    Apr 27, 2020
    Messages:
    26
    Likes Received:
    6

    Hi,
    After applying dummy_variables to the categorical columns, you will get nearly 150-156 columns. It is not an issue. you are doing good. go ahead. make sure you apply dummy variable only for the categorical columns like, 'Category','Content Rating', 'Genres'.


    Regards
    Sakthi Vijaya Kumar
     
    #100

Share This Page