Data Science with Python | Samridhi

Discussion in 'Big Data and Analytics' started by Nishant_Singh, Jun 3, 2019.

  1. Nishant_Singh

    Nishant_Singh Well-Known Member
    Simplilearn Support

    Joined:
    Aug 1, 2018
    Messages:
    222
    Likes Received:
    30
    #1
  2. _56897

    _56897 New Member

    Joined:
    Jan 22, 2019
    Messages:
    1
    Likes Received:
    0
    DataScience with Python Course Recording section displaying all the recording from Session 1 to Session 12: however session 1 and session 2 are having same name for recording files this is same with session 8 and session 9, please check and upload the appropriate recordings. Thank you.
     
    #2
  3. Aniruddha Gaikwad

    Aniruddha Gaikwad New Member

    Joined:
    May 29, 2019
    Messages:
    1
    Likes Received:
    0
    Hi,
    I was just going through all the files and I came across something that I have to ask.
    I will post the screenshots.
    Where did the first line go? "But soft what light through yonder window breaks" ... This line go omitted?
    Screenshot (202).png Screenshot (204).png
     
    #3
  4. madhanpradeep

    madhanpradeep Member
    Alumni

    Joined:
    Jul 1, 2015
    Messages:
    2
    Likes Received:
    0
    1.In Assessment section under project 1 , at the end there is a download link for data set, i could download 1 Zip folder there . I can also see another README.MD file but nothing is there. What is this.

    2.after installing Anaconda ( jupyter), do we need to install python . How does Jupyter differ from python other than seeing the output
     
    #4
  5. Abel C Dixon

    Abel C Dixon Member

    Joined:
    Jun 4, 2019
    Messages:
    3
    Likes Received:
    0
    Python need to be installed if not installed in your machine.
    Python is the base framework so you do need to install.
    If you are using mac python 2 or 3 may be pre-installed else you need to install the latest version of python 3.7
    Link:https://www.python.org/downloads/release/python-373/
     
    #5
  6. Samridhi Dutta

    Samridhi Dutta Well-Known Member
    Trainer

    Joined:
    Aug 16, 2017
    Messages:
    157
    Likes Received:
    20
    No need to seperately install python after installing Anaconda, as python comes free along with Anaconda. Anaconda comes pre-installed with large number of packages, which will not be the case if you seperately install Jupyter Notebook or any other IDE.
     
    #6
  7. Samridhi Dutta

    Samridhi Dutta Well-Known Member
    Trainer

    Joined:
    Aug 16, 2017
    Messages:
    157
    Likes Received:
    20
    Hi,

    Please re-execute the code, it is giving correctly.

    Regards,
    Samridhi
     
    #7
    Aniruddha Gaikwad likes this.
  8. madhanpradeep

    madhanpradeep Member
    Alumni

    Joined:
    Jul 1, 2015
    Messages:
    2
    Likes Received:
    0
    Hi ,
    In the slide provided in python e book, in Page 162 , Negative Indices -1 is taking last but 2nd value. I'm confused. If iam right the beginning index is 0 and the beginning of reverse is -1. Please comment on this. See attached Image with markings
     
    #8
  9. Samridhi Dutta

    Samridhi Dutta Well-Known Member
    Trainer

    Joined:
    Aug 16, 2017
    Messages:
    157
    Likes Received:
    20
    The last index is denoted by -1 as discussed in the class.

    Anyway, thanks for pointing out the error on Pg 162. Will request our support team to review this.

    Regards,
    Samridhi
     
    #9
  10. Ashwini Bharat Kakade

    Joined:
    May 31, 2019
    Messages:
    4
    Likes Received:
    0
    Capture1.PNG Hi ,
    I have already installed Anaconda in my PC . When i am using Jupyter notebook ..it's not showing any output & also showing error ...as per the screen Shot .Please suggest me the error.
     
    #10
  11. Ashwini Bharat Kakade

    Joined:
    May 31, 2019
    Messages:
    4
    Likes Received:
    0
  12. Samridhi Dutta

    Samridhi Dutta Well-Known Member
    Trainer

    Joined:
    Aug 16, 2017
    Messages:
    157
    Likes Received:
    20
    Hi, you are having windows or Mac? Did you install anaconda using default settings? You can try using other IDEs like Jupyter Lab (it has interface similar to Jupyter Notebook), and check if it is working. If nothing works, try uninstalling anaconda and re-installing it without changing the default settings. While installing, please be careful on 32-bit / 64-bit setting coherent with your OS.

    Also, until your issue wrt Anaconda gets resolved, please use the Simplilearn Lab for practice purpose.

    Regards,
    Samridhi
     
    #12
  13. Jyoti Saxena

    Jyoti Saxena Member

    Joined:
    May 20, 2019
    Messages:
    2
    Likes Received:
    0
    Hi Samridhi,

    When i am defining the person the way you explained same way still getting error message as "
    NameError: name 'Person' is not defined.

    Could you please help me out how can i fix this issue

    Also let us know, we need to first define then what the step to do in next level.


    upload_2019-7-5_1-4-51.png
     
    #13
  14. Samridhi Dutta

    Samridhi Dutta Well-Known Member
    Trainer

    Joined:
    Aug 16, 2017
    Messages:
    157
    Likes Received:
    20
    Hi Jyoti,
    as explained in the class!

    Regards
    Samridhi
     
    #14
  15. Abel C Dixon

    Abel C Dixon Member

    Joined:
    Jun 4, 2019
    Messages:
    3
    Likes Received:
    0
    I have a doubt in the Project MovieLens data set that there is a question asking to create a separate column for each genre category with a one-hot encoding ( 1 and 0) whether or not the movie belongs to that genre.I didn't understand regarding what exactly the question asking for ?
     
    #15
  16. Samridhi Dutta

    Samridhi Dutta Well-Known Member
    Trainer

    Joined:
    Aug 16, 2017
    Messages:
    157
    Likes Received:
    20
    Hi,

    following are the steps:
    1. split the genres col based on |
    2. concatenate the resultant lists in genres column
    3. find out the unique genres using set function
    4. these unique genres will be the new columns
    5. populate the columns with binary values if the genre is there in the genres column. Hint: use isin function

    If required we can discuss in the class.

    Regards,
    Samridhi
     
    #16
  17. Manoj Kumar Sahoo_1

    Joined:
    Jun 13, 2019
    Messages:
    6
    Likes Received:
    0
    Hi Samridhi,
    I was trying to understand Feature Engineering and in the class you discussed below ways to Impute NaN values
    #Multiple ways to impute for embarked:
    # - 1st is that impute with the mode value
    # - 2nd for Pclass =1 and fare close to 80, find the embarked category and impute by that
    # - 3rd for Pclass =1, Sex = female and fare close to 80, find the embarked category and impute by that



    I understand the fist way but for 2 nd and 3 rd how do we detrmine which categorical columns to select such we can calculate mean or median for any continious columns using those categorical values.

    For Eg:

    Why we are using Pclass and Sex columns and why not other categorical columns
     
    #17
  18. Samridhi Dutta

    Samridhi Dutta Well-Known Member
    Trainer

    Joined:
    Aug 16, 2017
    Messages:
    157
    Likes Received:
    20
    Hi Manoj,

    We need to find the columns which are leading to maximum variance in percentage of people in different categories of embarked. Choose that categorical column which is leading to max variance in the embarked status.

    More the number of significant columns you will choose for imputation, more chances of going right in predicting the missing values.

    Regards,
    Samridhi
     
    #18
  19. Abel C Dixon

    Abel C Dixon Member

    Joined:
    Jun 4, 2019
    Messages:
    3
    Likes Received:
    0
    In the movie lens project there is a question asking
    1. Determine the features affecting the ratings of any particular movie .I think the correlation martix is the best choice but what should be the paramters taken into account in the correlation matrix apart from the movie name,genres and age ?
     
    #19
  20. Soumya Ranjan Sethi

    Joined:
    Mar 12, 2019
    Messages:
    5
    Likes Received:
    0
    I don't understand how to solve the project 1 the movie project, how I can merge the 3 data sets and create the master data, I am new to python please someone helps me. Untitled.png
     
    #20
  21. Manoj Kumar Sahoo_1

    Joined:
    Jun 13, 2019
    Messages:
    6
    Likes Received:
    0




    Can you please explain this for 5 mins in the next class
     
    #21
  22. Manoj Kumar Sahoo_1

    Joined:
    Jun 13, 2019
    Messages:
    6
    Likes Received:
    0
    HI Abel ,

    As per my understanding, there are various ways
    The choices that i have employed till now are
    1)RandomForest helps in finding the criticality of the features by a parameter called feature_importances_.
    2)Heatmap to understand the correlation between different features and there weightage.
     
    #22
  23. Samridhi Dutta

    Samridhi Dutta Well-Known Member
    Trainer

    Joined:
    Aug 16, 2017
    Messages:
    157
    Likes Received:
    20
    Sure
     
    #23
  24. Samridhi Dutta

    Samridhi Dutta Well-Known Member
    Trainer

    Joined:
    Aug 16, 2017
    Messages:
    157
    Likes Received:
    20
    Hi,

    I have uploaded a file on google drive where merging of 3 files has been done. Please refer to the movie lens ipynb file on google drive.

    Regards,
    Samridhi
     
    #24
  25. Samridhi Dutta

    Samridhi Dutta Well-Known Member
    Trainer

    Joined:
    Aug 16, 2017
    Messages:
    157
    Likes Received:
    20
    Hi Abel,
    As per the updated project guidelines you don't need to attempt this question.

    However, I have uploaded the code on google drive to do this.

    Regards,
    Samridhi
     
    #25
  26. Samridhi Dutta

    Samridhi Dutta Well-Known Member
    Trainer

    Joined:
    Aug 16, 2017
    Messages:
    157
    Likes Received:
    20
    Hi Abel,

    Please treat ratings variable as an ordered factor variable. We can find the association between cat + cat var using chisquare tests and between continuous + cat using anova.

    If we try to consider ratings as a continuous variable, we'll notice extremely poor results. Though technically it can be treated as a regression problem. Feature importances can also be found using linear regression statsmodel (using the p-values).

    Regards,
    Samridhi
     
    #26
  27. Simon_35

    Simon_35 Member

    Joined:
    Jun 19, 2019
    Messages:
    3
    Likes Received:
    0
    Support team and Samridhi,

    I was in Samridhi class July 1-19 Python for DS classes, could you send me URL to download the class recordings, my ID was registrated at back end, so I have had no access to these.

    Thanks
    Simon

    Registration ID: 380922
    Session number: 579 407 191
     
    #27
  28. Samridhi Dutta

    Samridhi Dutta Well-Known Member
    Trainer

    Joined:
    Aug 16, 2017
    Messages:
    157
    Likes Received:
    20
    Sure Simon,

    I am sending your request to the support team.

    Regards,
    Samridhi
     
    #28
  29. SUNNY BHAVEEN CHANDRA

    SUNNY BHAVEEN CHANDRA Well-Known Member

    Joined:
    Feb 4, 2019
    Messages:
    55
    Likes Received:
    8
    Hi Simon,

    As per your request, I've shared the recording with you. Kindly download them within 24 hrs otherwise the links will expire.

    Regards,
    Sunny
    Teaching Assistant
     
    #29
  30. Rathakrishnan R K

    Joined:
    Jun 4, 2019
    Messages:
    3
    Likes Received:
    0
    1. Find the ratings for all the movies reviewed by for a particular user of user id = 2696

    Got a littble bit of confusion on how I could get the outcome of all the ratings gievn by the specific user id = 2696.
    Please help!
    Thank you!
     
    #30
  31. Rathakrishnan R K

    Joined:
    Jun 4, 2019
    Messages:
    3
    Likes Received:
    0
    1. Find the ratings for all the movies reviewed by for a particular user of user id = 2696

    Got a littble bit of confusion on how I could get the outcome of all the ratings gievn by the specific user id = 2696.
    Please help!
    Thank you!
     
    #31
  32. Manoj Kumar Sahoo_1

    Joined:
    Jun 13, 2019
    Messages:
    6
    Likes Received:
    0
    Hi Krishnan you can use filtering in dataframe to achieve that like a below example
    dataframe[dataframe.user_id == 2696]
     
    #32
  33. Manoj Kumar Sahoo_1

    Joined:
    Jun 13, 2019
    Messages:
    6
    Likes Received:
    0
    Hi Samridhi,

    I have the below doubts.
    Can you please clarify on the below.


    1)To determine criticality of features influencing an outcome, today in industry what is used

    stats model tests like ttest,annova ,chisquare or Machine Learning Models like Random Forest.
    I dont see much point in using stats model for each feature it will be very type consuming infinding which will be a set of critical features influencing an outcome.

    2)Why RandomSearch better than Grid Search, isnt Random Search trying to find the right parameters by luck.
     
    #33
  34. Rathakrishnan R K

    Joined:
    Jun 4, 2019
    Messages:
    3
    Likes Received:
    0
  35. Samridhi Dutta

    Samridhi Dutta Well-Known Member
    Trainer

    Joined:
    Aug 16, 2017
    Messages:
    157
    Likes Received:
    20
    Hi Manoj,

    Industry is more and more moving towards Machine Learning models like Decision Trees and Randomforests. However, many companies use statistical modeling as well. Hence it depends on industry and the company.

    RandomizedSearch and GridSearch both give almost similar results, the difference is in the approach of 2 algos.
    In Grid Search, we try every combination of a preset list of values of the hyper-parameters and choose the best combination based on the cross validation score.

    Random search tries random combinations of a range of values. It is good in testing a wide range of values and normally it reaches a very good combination very fast, but the problem that it doesn’t guarantee to give the best parameters combination.

    On the other hand, Grid search will give the best combination but it can take a lot of time.

    Regards,
    Samridhi
     
    #35
  36. Manoj Kumar Sahoo_1

    Joined:
    Jun 13, 2019
    Messages:
    6
    Likes Received:
    0



    Thanks for clarifying !!!!!!!!!!!!!!!!!that was really helpful!!!!.
    I had a question regarding chi2 test.




    How to understand which category inside a categorical column is influencing the Output categorical column if the p-value obatained from chi2 test is less than the agreed alpha values.
    For example in titanic dataset, pclass and survived are two categorical columns.


    How to understand which class within pclass has a higher rate of survival such that one can buy ticket for that class.
     
    #36
  37. Rajkumar Tripathi

    Joined:
    May 20, 2019
    Messages:
    10
    Likes Received:
    0
    Hi Samridhi,
    I was doing hands on in tuples while doing the practical examples i have the following clarification mentioned below:
    i created a tuple (its a nested tuple) example mentioned below:

    Clarification1:
    test =(["hello","rajkumar",33,45,66],("rajesh","atul",33,45.8),[55.6,66.3,99.85],(15,10,3,6,9))
    how would i print output like
    first element of tuple is: List
    second element of tuple is: Tuple
    third element of tuple is: List
    fourth element of tuple is: Tuple

    Clarification 2:
    test =(["hello","rajkumar",33,45,66],("rajesh","atul",33,45.8),[55.6,66.3,99.85],(15,10,3,6,9))
    how would i get the output printed as:
    This tuple contain 2 lists and 2 tuples

    Clarification3:
    test =(["hello","rajkumar",33,45,66],("rajesh","atul",33,45.8),[55.6,66.3,99.85],(15,10,3,6,9))
    How would i get output as below:
    The length of first element in tuple (list) is: 5
    The length of second element in tuple is: 4
    The length of third element in tuple is:3
    The lenght of fourth element in tuple is:5
     
    #37
  38. Samridhi Dutta

    Samridhi Dutta Well-Known Member
    Trainer

    Joined:
    Aug 16, 2017
    Messages:
    157
    Likes Received:
    20
    Hi Manoj,
    Chisquare test is only testing whether the 2 variables are independent or not. You can find out the pclass leading to higer survival rate using barplots / group-by commands. (Please refer the barplots visualizations between 2 categorical variables).It can't be found out using the p-values of chisq test.
    If p-value smaller than alpha value this indicates H0 rejected i.e there is a relationship between 2 variables.

    Repeating chisq tests / anova tests are primarily used for feature selection incase the data has high dimensions. These are to be included in your data exploration / data preprocessing step.

    Regards,
    Samridhi
     
    #38
    Last edited: Aug 2, 2019
  39. Samridhi Dutta

    Samridhi Dutta Well-Known Member
    Trainer

    Joined:
    Aug 16, 2017
    Messages:
    157
    Likes Received:
    20

    Hi Everyone,

    This is a basic python question. Can other learners also try their hands on the above question and post your solutions here. I will review the answers.

    Rajkumar, could you also please try more on your question and post your code here. This will help me understand where you are wrong in the concepts.

    All this has been discussed in the class, and I would expect all of you to be able to solve these questions.

    Regards,
    Samridhi
     
    #39
  40. Viraja Vedula

    Viraja Vedula Member
    Alumni

    Joined:
    Oct 19, 2017
    Messages:
    2
    Likes Received:
    0
    Hi i wanted to know which is the best SQL programming courses available for data analytics ..any suggestions is appreciated
     
    #40
  41. Devanshu_5

    Devanshu_5 Member

    Joined:
    May 10, 2019
    Messages:
    3
    Likes Received:
    0
    maam i am not able to install anaconda on my laptop
     
    #41
  42. Devanshu_5

    Devanshu_5 Member

    Joined:
    May 10, 2019
    Messages:
    3
    Likes Received:
    0
    Maam i am trying this list comprehension by myself but here char_to_int [data[4]] or for any data is showing error as list index does not string.
     

    Attached Files:

    #42
  43. mohit ranjan mishra

    mohit ranjan mishra New Member

    Joined:
    Jun 20, 2019
    Messages:
    1
    Likes Received:
    0
    Good afternoon maam,
    I don't understand how to solve the project 1(movielens). Can you please give some time to explain this project tomorrow.
     
    #43
  44. vignesh s s

    vignesh s s New Member

    Joined:
    Feb 28, 2019
    Messages:
    1
    Likes Received:
    0
    Samridhi,

    Im facing issue while importing file into jupyter notebook. The command I used is while doing demo assignment01 in NLP

    df_SpamCollection =pd.read_csv('C:\Users\Vignesh_Ss\Desktop\SpamCollection')

    The error is as follows,

    "SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 292-293: truncated \UXXXXXXXX escape during import".

    For this error I tried in google and found out this sol,
    Prefixing with 'r' works very well, but it needs to be in the correct syntax. For example:

    passwordFile = open(r'''C:\Users\Bob\SecretPasswordFile.txt''')

    While trying this, I faced other error as "ParserError: Error tokenizing data. C error: Expected 2 fields in line 12, saw 4".

    I tried my best but couldn't resolve this. Can you help me on this.

    Thanks,
    vignesh s s.
     
    #44
  45. amitv08

    amitv08 Member

    Joined:
    Mar 23, 2016
    Messages:
    3
    Likes Received:
    0
    Have you hit Run button or used Shift+Enter? Please try that. See if that helps
     
    #45
  46. Samridhi Dutta

    Samridhi Dutta Well-Known Member
    Trainer

    Joined:
    Aug 16, 2017
    Messages:
    157
    Likes Received:
    20
    Hi Vignesh,

    Please put the spamCollection file in your working directory (note the file, not the entire folder) and then execute the following command:
    pd.read_csv("SpamCollection", sep = "\t", names = ["response", "message"])

    Please note if you don't specify the sep to "\t", it will throw error.

    Regards,
    Samridhi
     
    #46
  47. VIGNESH G_1

    VIGNESH G_1 Member

    Joined:
    Mar 16, 2019
    Messages:
    2
    Likes Received:
    0
    Hi Samridhi,

    I have trying the Movie Lens Project and stuck with getting with the below visualization
    1. Top 25 movies by viewership rating.
    I am able to create a cross tab with the master with the title and Rating

    import pandas as pd

    movies_df = pd.read_csv("movies.dat",sep='::',header=None)
    movies_cols=['MovieId','Title','Genres']
    movies_df.columns=movies_cols

    users_df=pd.read_csv("users.dat",sep='::',header=None)
    user_cols=['UserId','Gender','Age','Occupation','Zip']
    users_df.columns=user_cols

    ratings_df=pd.read_csv("ratings.dat",sep='::',header=None)
    rating_cols=['UserId','MovieId','Rating','TimeStamp']
    ratings_df.columns=rating_cols

    inter_df=(users_df.merge(ratings_df,how='inner',on='UserId'))
    inter_df.head()
    master_df = inter_df.merge(movies_df,how='inner',on='MovieId')
    master_df.head()

    top_df=master_df.filter(items=['Title','Rating'])
    top_crss_df = pd.crosstab(index=top_df.Title,columns=top_df.Rating)
    top_crss_df

    I am attaching the ipynb for your reference , Please assist me how to visualize/get the top 25 movies with the cross tab.
     

    Attached Files:

    #47
  48. VIGNESH G_1

    VIGNESH G_1 Member

    Joined:
    Mar 16, 2019
    Messages:
    2
    Likes Received:
    0
    I have did a another way of getting top 25 movies list by viewer's rating. I guess the question seems to be not specific. Now, I calculated mean of all the ratings for each movie and sort it by the Ratings to pull the top 25. The code is below.

    import pandas as pd

    movies_df = pd.read_csv("movies.dat",sep='::',header=None)
    movies_cols=['MovieId','Title','Genres']
    movies_df.columns=movies_cols

    users_df=pd.read_csv("users.dat",sep='::',header=None)
    user_cols=['UserId','Gender','Age','Occupation','Zip']
    users_df.columns=user_cols

    ratings_df=pd.read_csv("ratings.dat",sep='::',header=None)
    rating_cols=['UserId','MovieId','Rating','TimeStamp']
    ratings_df.columns=rating_cols

    inter_df=(users_df.merge(ratings_df,how='inner',on='UserId'))
    inter_df.head()
    master_df = inter_df.merge(movies_df,how='inner',on='MovieId')
    master_df.head()

    top_df=master_df.filter(items=['Title','Rating'])
    top_df.groupby(by='Title').mean().sort_values(by="Rating",ascending=False ).head(25)

    But Here, Getting the mean of all the ratings won't be better idea because if a particular movie is viewed/reviewed by only 1 user a and if he likes it he would have rated as 5. But the good movies which is watched by many users and rated randomly will be in the lower list. Please help me in proceeding with the MovieLens Project .
     
    #48
  49. Deepak Shanthaiah

    Joined:
    Jul 25, 2019
    Messages:
    2
    Likes Received:
    0
    Hello Samridhi,

    I was doing Movielens project and I was stuck with last 2 Questions which is:

    1) Determine the features affecting the ratings of any particular movie?
    2)Develop an appropriate model to predict the movie ratings?

    I have done the concatination of dataframe and data5k and have removed the column which is of no use by using drop function and now I have to convert the Gender values which is in character form to numerical and find solutions for the queries which I have asked above.

    I know you asked to use Annova or Chisquare but still not able to understand what to do exactly.

    Can you please assist me with Syntax usage and steps of it by uploading the updated file in Google drive ASAP which will help me to complete the project.
     
    #49
    Last edited: Aug 28, 2019
  50. Pranaya Kumar Panda

    Pranaya Kumar Panda New Member

    Joined:
    Aug 6, 2019
    Messages:
    1
    Likes Received:
    0
    Hi Samridhi,

    Please find the below code for Assignment-1 of Aug-24 batch.

    def pyramid_one(n):

    # outer loop to handle number of rows
    # n in this case
    for i in range(0, n):

    # inner loop to handle number of columns
    # values changing acc. to outer loop
    print(" "*i,end="")
    for j in range(0, n-i):

    # printing 1
    print("1",end="")

    # ending line after each row
    print("\r")
    # Driver Code
    n = 4
    pyramid_one(n)

    o/p:-
    # 1111
    # 111
    # 11
    # 1
    Please let me know if any further improvement is required.

    Regards,
    Pranaya
     
    #50
    Last edited: Aug 30, 2019

Share This Page