DS Python | Anand | 1st Feb - 7th March

Discussion in 'Big Data and Analytics' started by Vikas Kumar_18, Feb 5, 2020.

  1. Vikas Kumar_18

    Vikas Kumar_18 Well-Known Member
    Simplilearn Support Alumni

    Joined:
    Dec 17, 2018
    Messages:
    205
    Likes Received:
    35
    This is the dedicated community link for this batch to discuss with your peers and faculty.
     
    #1
  2. Abhishek Kumar_41

    Abhishek Kumar_41 New Member

    Joined:
    May 31, 2018
    Messages:
    1
    Likes Received:
    0
    Hi Anand,
    as discussed yesterday that '_'(underscore) in itself is a variable in python. Are there any other special characters that can be used in variables?
    Thanks,
    Abhishek
     
    #2
  3. sonnur

    sonnur New Member

    Joined:
    Feb 16, 2016
    Messages:
    1
    Likes Received:
    0
    could you please help me how to reach your Google drive? i already downloaded Anaconda3
     
    #3
  4. EKTA TOMAR

    EKTA TOMAR New Member

    Joined:
    Nov 13, 2019
    Messages:
    1
    Likes Received:
    0
    #4
  5. Sergio Eduardo Capozzi

    Joined:
    Dec 30, 2019
    Messages:
    3
    Likes Received:
    1
    Hi Anand, i am trying to do exercise 8.16 and I am having trouble importin sklearn into the exercise, how ever I get the following message:

    ModuleNotFoundError: No module named 'sklearn.cross_validation'


    can you please help me understand what I am doing wrong?
     

    Attached Files:

    #5
  6. Swetha_62

    Swetha_62 Member

    Joined:
    Jan 23, 2020
    Messages:
    2
    Likes Received:
    0
    Hi Anand,
    I was trying with arange() with Slice notation and came across an issue and wasn't sure if we need to use arrange(start:stop:step) by separating them by : or by comma. As when i was trying in Juypter, it throws an error when i use :. Please find the below example i was trying and kindly let me know if am wrong :

    import numpy as np
    a = np.arange(0,40)
    a.shape = (4,10)
    print(a)

    Output:
    [[ 0 1 2 3 4 5 6 7 8 9]
    [10 11 12 13 14 15 16 17 18 19]
    [20 21 22 23 24 25 26 27 28 29]
    [30 31 32 33 34 35 36 37 38 39]]

    #arange using Slice notation:
    a = np.arange(1:24:3)
    print(a)

    output:
    File "<ipython-input-58-9c8e6fd1391c>", line 1
    a = np.arange(1:24:3)
    ^
    SyntaxError: invalid syntax

    But the same syntax used with comma, it works:
    a = np.arange(1,24,3)
    print(a)
    Output:
    [ 1 4 7 10 13 16 19 22]
     
    #6
  7. navinjainbca

    navinjainbca New Member

    Joined:
    Jan 20, 2020
    Messages:
    1
    Likes Received:
    0
    #7
  8. SWETA GUPTA_3

    SWETA GUPTA_3 Member

    Joined:
    Jan 17, 2020
    Messages:
    2
    Likes Received:
    1
    Hi Anand, Could you explain what does U3 means in here?

    mylist10 = ['a','b',100]
    myarr = np.array(mylist10)
    print(myarr.dtype)


    Output:
    <U3
     
    #8
    Shashank R likes this.
  9. Chirag Shah_5

    Chirag Shah_5 Member

    Joined:
    Jan 27, 2020
    Messages:
    8
    Likes Received:
    0
    Hi, how can I put two conditions to pull data. For example, in below case, I would like to pull number greater than 15 and less than 19. It's

    Code:

    print(numarr_cond)
    print(type(numarr_cond))
    print(numarr_cond[numarr_cond > 15])

    Output:

    [10 11 12 13 14 15 16 17 18 19 20]
    <class 'numpy.ndarray'>
    [16 17 18 19 20]
     
    #9
  10. Chirag Shah_5

    Chirag Shah_5 Member

    Joined:
    Jan 27, 2020
    Messages:
    8
    Likes Received:
    0
    Got the code, ignore
    print(numarr_cond[(numarr_cond > 15) & (numarr_cond <20)])
     
    #10
  11. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    126
    Likes Received:
    12
    Hi
    The dtype “<U3” refers to a Unicode string of 3 characters.
     
    #11
  12. Amol Suhas Bhimanwar

    Joined:
    Aug 31, 2019
    Messages:
    5
    Likes Received:
    0
    Hello Anand,
    Date time separation is majorly required in real world for analysing the data, can you please help & suggest, where can we get more information about various options available in pandas for converting dates (which maybe in different formats , DD-MM-YY, dd/mm/yyyy, DD-MM-YYYY, YY/MM/DD etc) in same datafile. so that all date and time in datafile can have the same format.
     
    #12
  13. Sergio Eduardo Capozzi

    Joined:
    Dec 30, 2019
    Messages:
    3
    Likes Received:
    1
    Hi Anand, hope all is well.
    I am working on the final assignment using the (311_Service_Requests_from_2010_to_Present) Data.
    I am trying to use the datetime formula but it is not changing the value from "object" to "datetime":
    Input code is:
    def convert_time(column_name):
    ## create an empty list
    y=[]
    ## pass a list of columns to be converted to date time
    for x in df_final_data_set[column_name]:
    ### create an outputlist of converted column names
    y.append(datetime.datetime.strptime(x, "%m/%d/%Y %H:%M%S"))
    df_final_data_set[column_name] = y
    df_final_data_set.info()

    Output is:
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 298024 entries, 0 to 300697
    Data columns (total 6 columns):
    Created Date 298024 non-null object
    Closed Date 298024 non-null object
    Agency Name 298024 non-null object
    Complaint Type 298024 non-null object
    City 298024 non-null object
    Incident Zip 298024 non-null float64
    dtypes: float64(1), object(5)
    memory usage: 15.9+ MB

    I have attached the zip file from my First Assignment (311NYC)V2.ipynb
     

    Attached Files:

    #13
    Last edited: Feb 23, 2020
  14. Pankaj Kumar Mourya

    Joined:
    Jan 27, 2020
    Messages:
    2
    Likes Received:
    0
    Hi Anand,
    Please upload the last day notebook on google drive.
     
    #14
  15. Prakash Meghani

    Joined:
    Jul 19, 2019
    Messages:
    13
    Likes Received:
    0
    Hi Anand and all members in this community,
    Can anybody explain me following question -

    Order the complaint types based on the average ‘Request_Closing_Time’, grouping them for different locations.

    This is question number 4 in Assessment - Project 1 -
    Customer Service Requests Analysis

    I am unable to understand what exactly this question is asking,

    Thanks
     
    #15
  16. Prakash Meghani

    Joined:
    Jul 19, 2019
    Messages:
    13
    Likes Received:
    0
    Hi,
    Second question -
    Can I do Correlation between 2 categorical variables ?
    for example, location (zipcode) and complain type ?
    if yes, may i know how ?

    Thanks
     
    #16
  17. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    126
    Likes Received:
    12
    Hi,
    cross_validation is not a part of sklearn package anymore. It is in the new version of scikit-learn. That's why you are getting the error.

    to handle it please code the below

    from sklearn.model_selection import train_test_split
    please do check the https://scikit-learn.org/stable/index.html for information about latest packages
     
    #17
  18. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    126
    Likes Received:
    12
    Hi Eduardo,

    cross_validation is not a part of sklearn package anymore. It is in the new version of scikit-learn. That's why you are getting the error.

    to handle it please code the below

    from sklearn.model_selection import train_test_split
    please do check the https://scikit-learn.org/stable/index.html for information about latest packages
     
    #18
  19. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    126
    Likes Received:
    12
    Hi Swetha,
    we covered this during the numpy session as well.

    arange is a function. not a python data type.
    a function will work with arguments.

    the arguments that the arange function takes is starting number, ending number and how many numbers can be skipped.

    slice notation does not work with arange function.


    if you are hell bent on using slice notation , then try the below please

    a = np.arange(1,24,3)
    print(a)
    print(type(a))
    <class 'numpy.ndarray'>


    output###
    [ 1 4 7 10 13 16 19 22]

    print(a.shape)

    output####
    (8,)


    ### slice the array a to extract firs 6 items and reshape it to 2X3 matrix

    a[0:6].reshape(2,3)

    ###output####
    array([[ 1, 4, 7],
    [10, 13, 16]])
     
    #19
  20. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    126
    Likes Received:
    12
  21. Ritwik Mukherjee

    Joined:
    Apr 25, 2019
    Messages:
    2
    Likes Received:
    0
    Hello. I was just going through a few videos on the LMS. Is cross_validation no more a package with scikit?

    I was trying to figure out the methodology of carrying out a linear regression model, and the error was as follows:

    cannot import name 'cross_validation' from (directory)
     
    #21
  22. Chirag Shah_5

    Chirag Shah_5 Member

    Joined:
    Jan 27, 2020
    Messages:
    8
    Likes Received:
    0
    Hi, I'm converting df date string to date. Column is having different date format. Couple of questions, 1) how can I identify unique date format in column? 2) the function we learn in the class doesn't help to convert different date format in column. Can you suggest alternate code for it?
     
    #22
  23. Sergio Eduardo Capozzi

    Joined:
    Dec 30, 2019
    Messages:
    3
    Likes Received:
    1
    # Since items in Created and Closed Dates do not have all the same format it is a better approach to use Pandas' "to_datetime" method to calculate the difference "Closed Date-Created Date" in datetime format.
    # THE LINE BELOW TAKES SOME MINUTES TO FINISH THE RUN
    dt=pd.to_datetime(df_final_data_set.iloc[:,1])-pd.to_datetime(df_final_data_set.iloc[:,0])
     
    #23
    Chirag Shah_5 likes this.
  24. Chirag Shah_5

    Chirag Shah_5 Member

    Joined:
    Jan 27, 2020
    Messages:
    8
    Likes Received:
    0
    Hi Aanand, this the query I wanted to ask in today's class. Kindly check.
     
    #24
  25. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    126
    Likes Received:
    12
    eduardo,
    give me sometime. i need to take a look at the code. will respond in a day or two
     
    #25
  26. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    126
    Likes Received:
    12
    Chirag,
    As mentioned in our pandas class, if the same column has different date formats, you cannot identify them automatically. its always recommended to convert the entire column into one format. you will have identify manually how many different date formats are there and write code for converting.
    for converting, use the code we used in pandas class.

    for further info, pls refer to following

    https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html
    https://towardsdatascience.com/basic-time-series-manipulation-with-pandas-4432afee64ea
    https://www.geeksforgeeks.org/python-pandas-to_datetime/
    https://www.datacamp.com/community/tutorials/converting-strings-datetime-objects
     
    #26
  27. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    126
    Likes Received:
    12
    already responded. you cannot automatically identify different date formats in the same column. you need to check them manually and convert.

    the maximum you can do is, consider the column as a categorical variable, convert them using one hot encoding. you will then have individual values for each date

    for example if you dataset looks like this

    date-col

    11/jan/2019
    11-jan-2019
    11-january-2019
    11-01-2019

    after applying one hot encoding,/label encoding or pd.dummies your date column can be


    1
    2
    3
    4

    now, you can easily identify that 1,2,3,4 refers to different date formats.

    you will then need to write code convert each date format to one common format.
     
    #27
  28. Puneet Biyani

    Puneet Biyani New Member

    Joined:
    Aug 6, 2019
    Messages:
    1
    Likes Received:
    0
    ***************
    - the ML team applies filters to transactions, if the transactions doesn’t trigger any filters its good to go at the backend itself.
    - if transaction triggers the filters, its send to risk team for human intervention where they check if the transaction is good or should be tagged as fraud.
    - so as per experience & data I have, I know we can reduce the inventory that’s send to risk team for analysis.
    - there are transactions which actually escapes the filters & reach risk team as they are correlated to some other variables which may not be a part of initial ML filter.
    *** so as there is a saying " every rule has exception & every exception has a rule" ***

    - hence I was asking if there is a possibility of reducing these false positives by running another code after the ML filters have done the initial job ...
    - changing the parameter to other variables will stop these transactions from going to risk team
    *************
     
    #28
  29. Somanath Kolekar

    Joined:
    Jan 26, 2020
    Messages:
    3
    Likes Received:
    0
    Hello Sir,

    I need your guidance on below task : Project 1 :
    Customer Service Requests Analysis



    1. Perform a statistical test for the following:
    Please note: For the below statements you need to state the Null and Alternate and then provide a statistical test to accept or reject the Null Hypothesis along with the corresponding ‘p-value’.

    · Whether the average response time across complaint types is similar or not (overall)

    · Are the type of complaint or service requested and location related?
     
    #29
  30. Somanath Kolekar

    Joined:
    Jan 26, 2020
    Messages:
    3
    Likes Received:
    0
    Hi Sir,

    This is what Prakash has asked earlier. Please help!
    ---------------------------------------------------------------------------

    Hi Anand and all members in this community,
    Can anybody explain me following question -

    Order the complaint types based on the average ‘Request_Closing_Time’, grouping them for different locations.

    This is question number 4 in Assessment - Project 1 -
    Customer Service Requests Analysis

    I am unable to understand what exactly this question is asking,

    Thanks
     
    #30
  31. Prakash Meghani

    Joined:
    Jul 19, 2019
    Messages:
    13
    Likes Received:
    0
    Hi Anand,
    Totally confuse how to start these hypothesis testing -
    # Whether the average response time across complaint types is similar or not (overall)
    # Are the type of complaint or service requested and location related?

    please guide some initial steps,
    Thanks
     
    #31
  32. Ritwik Mukherjee

    Joined:
    Apr 25, 2019
    Messages:
    2
    Likes Received:
    0
    Hi Anand. Hello everybody. I am working on the Amazon recommendation system project. I have downloaded the dataset, ingested into Jupyter notebook and started with the process of data exploration. The data set consists of a huge number of null values which I have dropped keeping a particular threshold limit in mind.

    My problem is after imputation of statistical values in place of NaN, how do I develop the model? I am unsure of how to start coding the algorithm. Any particular method is to be seen? Also, now I have reduced the number of columns from 206 to 36, is that alright?
     
    #32
  33. VIGNESHBABU RANGARAJ

    Joined:
    May 31, 2019
    Messages:
    2
    Likes Received:
    0
    Hello Anand

    I am from DS PYTHON Batch -1ST FEB - 7TH MARCH which is going on now.Joining this community forum for the first time.Could you please confirm if am at the right place else do I need to join at some other place for Queries related to topics which you have covered
    Thanks for your time.
     
    #33
  34. Shwetha Nayak

    Shwetha Nayak Member

    Joined:
    Jan 11, 2020
    Messages:
    3
    Likes Received:
    0
    Hi,
    I am working on movie lens project for submission and in the last lap i am getting error in model creation.
    Initially i got the "ValueError: could not convert string to float", which i was able to resolve by using LabelEncoder.

    Now i am not able to identify the below error. Any help is appreciated.

    ValueError: y contains previously unseen labels: '3254'
     
    #34
  35. VIGNESHBABU RANGARAJ

    Joined:
    May 31, 2019
    Messages:
    2
    Likes Received:
    0
    ==================================

    Hi , am on right Forum ?
     
    #35
  36. intergalacticfare

    intergalacticfare New Member

    Joined:
    Dec 23, 2016
    Messages:
    1
    Likes Received:
    0

    Hi,
    Yes, this is the Forum for members of the Python Class Batch Feb 01 - Mar 07 led by Anand S.
     
    #36
  37. Swetha_62

    Swetha_62 Member

    Joined:
    Jan 23, 2020
    Messages:
    2
    Likes Received:
    0

    Thats right you are on the right forum.
     
    #37
  38. Chirag Shah_5

    Chirag Shah_5 Member

    Joined:
    Jan 27, 2020
    Messages:
    8
    Likes Received:
    0
    Hi Anand, below question is related to Projet 1. I'm wondering how we can do hypothesis testing for two categorical variables having multiple categories in each?

    Are the type of complaint or service requested and location related?
     
    #38
  39. Shwetha Nayak

    Shwetha Nayak Member

    Joined:
    Jan 11, 2020
    Messages:
    3
    Likes Received:
    0
    Hello,
    i am working on movielens project for submission and i have coded for using different algorithms. However, i see the sandbox symbol when i run the code and waiting for results since long time, not getting any.

    Attaching the screenshot for your reference.
    Logistic regression result i got, but not for others. Highlighted the busy symbol in screenshot.

    Any suggestions?

    Regards,
    Shwetha Nayak
     

    Attached Files:

    #39
  40. Pankaj Kumar Mourya

    Joined:
    Jan 27, 2020
    Messages:
    2
    Likes Received:
    0
    Hi Anand,

    Please suggest about testing relationship between two categorical columns.

    Thanks and regards
    Pankaj Kr. Mourya
     
    #40
  41. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    126
    Likes Received:
    12

    Puneet,
    the moment you say, ml team is applying filters, i am forced to think that they are not algorithm(data driven, pattern recognized) filters, but mostly business rules.

    its because of this that some of the cases you mentioned are escaping the filters.

    here is what i would do, if i were you.

    1. work with the ml team to understand how they are applying filters, is it a a purely learning model or is it a hybrid
    2. if its a learning model, the only way we can handle your case, is, build and retrain your data for handling false positives better
    3. if its a hybrid model, they you have a chance to understand, what rules have gone wrong and so on and so forth.

    its very difficult to assume , without actually looking at the data and the filters
     
    #41
  42. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    126
    Likes Received:
    12


    Hi,
    What we are trying to establish by this question is, whether for every complaint type the average response time is different or same.

    for this, you first need to group by complaint type and get the average response time for each complaint type. its a simple group by function
    create a column or a dataframe with complaint types and average response times per complaint type

    then you can approach it in the below ways

    1) 1 sample t-test - take a mean of average response time. check if every value in the average response time column is same as that of the mean. your null hypothesis, should be what the existing values are --
    alternate hypothesis should be proven

    2. 2 sample t test- take the average response times of 2 complaint types separately and do a 2 sample t - test. this can help you prove that the the response times are not same.

    please follow the notebooks in hypothesis example.
     
    #42
  43. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    126
    Likes Received:
    12
    i think its one of the ways. i have responded to another post as well.
     
    #43
  44. anand.s.subramaniam

    anand.s.subramaniam Well-Known Member
    Alumni

    Joined:
    Mar 28, 2018
    Messages:
    126
    Likes Received:
    12
    Somanath,

    1. first group the requests on location
    2. then order them based on response time per location

    the output should be an ordered response time
     
    #44
  45. Kalva Mounika

    Kalva Mounika New Member

    Joined:
    Jan 29, 2020
    Messages:
    1
    Likes Received:
    0
    Hi Anand,

    Could you please upload the hypothesis notebook.
     
    #45
  46. Chirag Shah_5

    Chirag Shah_5 Member

    Joined:
    Jan 27, 2020
    Messages:
    8
    Likes Received:
    0
    HI Anand,

    I'm working on project 4. There is column have description of complain, from which we need to identify the complain. Looks like we need to use NLP here, though I couldn't figure out how we can do that. Is it something we're going to learn in pending class?
     
    #46
  47. Prakash Meghani

    Joined:
    Jul 19, 2019
    Messages:
    13
    Likes Received:
    0
    Hi Anand,
    Following your suggestion as per follows -
    1) 1 sample t-test - take a mean of average response time. check if every value in the average response time column is same as that of the mean. your null hypothesis, should be what the existing values are --
    alternate hypothesis should be proven

    here is my data -
    df_nyc_grouped_by_complain_no_index['Request_Closing_In_Hr']

    0 5.258333
    1 5.213240
    2 336.830000
    3 3.766486
    4 4.740904
    5 7.364152
    6 3.558846
    7 3.861859
    9 7.151062
    10 4.365609
    11 2.761190
    12 4.501152
    13 3.147167
    14 3.193283
    15 3.410721
    16 3.445222
    17 3.588984
    18 4.372820
    19 1.975941
    20 4.047500
    21 3.448645
    22 3.626503
    23 4.013897
    Name: Request_Closing_In_Hr, dtype: float64

    But when running one sample T test -
    stats.ttest_1samp(df_nyc_grouped_by_complain_no_index['Request_Closing_In_Hr'], 0)

    getting result -
    Ttest_1sampResult(statistic=1.2851609200530003, pvalue=0.21210175324268477)

    According to this p value null hypothesis is true, which is wrong
    hypothesis are -
    # null hypothesis H0 true = mean value of Complain Type is same as every Complain Type value
    # alternate hypothesis Ha true = mean value of Complain Type is not same as every Complain Type value

    Please help
     
    #47
  48. Prakash Meghani

    Joined:
    Jul 19, 2019
    Messages:
    13
    Likes Received:
    0
    Hi Anand,
    please read above post and here another for following question
    · Are the type of complaint or service requested and location related?
    For Somanath post for this question you replied -
    2. 2 sample t test- take the average response times of 2 complaint types separately and do a 2 sample t - test. this can help you prove that the the response times are not same.

    Whether in this question its not asking response time are same or not, its asking if type of complain is location related, i think they want to ask if complain x, y, z is for city 1, and complain x, y, z is for city 2 so complains are city related but if for city 2 complains are different then not same,
    can you please read thoroughly their question and confirm what they are asking ?
    thanks
     
    #48
  49. Prakash Meghani

    Joined:
    Jul 19, 2019
    Messages:
    13
    Likes Received:
    0
    No one is replying any post in community
     
    #49
  50. Stefanny Ekawati Gunawan

    Alumni

    Joined:
    Jan 22, 2020
    Messages:
    2
    Likes Received:
    0
    Hi Anand,

    I am working in Project 5. And they requesting to answer which store/s has good quarterly growth rate in Q3 2012.

    ## converting column Date to Datetime format
    df["Date2"] = pd.to_datetime(df["Date"].replace("-", "/"),format='%d-%m-%Y')
    # find quarter number for each date
    df['Year'] = pd.DatetimeIndex(df['Date2']).year
    df['Quarter'] = pd.DatetimeIndex(df['Date2']).quarter
    #Find summary sales for Quarter 3 and group by Store, Year, and Quarter
    df2 = df[df['Quarter'] == 3].groupby(['Store','Year','Quarter'])['Weekly_Sales'].sum()

    After that i want to do quarter comparison between year using this syntax :
    for i in range(1,len(df2)) :
    df2['Growth_Rate'].append((df2 - df2[i-1]) * 100.0 / df2[i-1])
    print (df2['Growth_Rate'])

    And it gives me error :
    TypeError: 'str' object cannot be interpreted as an integer
    And shows that column 'Growth Rate' is that makes error.

    What should i do?

    Thanks.
     
    #50

Share This Page