Machine Learning- Aayushi | May 19,20,26,27 Jun 2,3,9,10,16

Discussion in 'Big Data and Analytics' started by Shalini Rana, May 18, 2018.

  1. Shalini Rana

    Shalini Rana Well-Known Member
    Simplilearn Support

    Joined:
    Jul 24, 2017
    Messages:
    212
    Likes Received:
    12
    Hi Learners,

    Welcome to Simplilearn!

    Please use this thread for posting all your queries for Machine Learning batch May 19,20,26,27 Jun 2,3,9,10,16 conducted by trainer Aayushi.

    Thanks!
     
    #1
  2. Ujjwal_16

    Ujjwal_16 New Member

    Joined:
    Apr 30, 2018
    Messages:
    1
    Likes Received:
    0
    Hi,
    I am not able to start up with my project.
    Could you give me a step-by-step procedure for it?
     
    #2
  3. _29131

    _29131 New Member

    Joined:
    Apr 13, 2018
    Messages:
    1
    Likes Received:
    0
    Hello Aaushi,

    please send the link for day2 documents which was taught on 20th may 2018 by Aayushi.
     
    #3
  4. Aayushi_6

    Aayushi_6 Active Member

    Joined:
    Sep 19, 2016
    Messages:
    47
    Likes Received:
    4
    #4
  5. ATHUL B KAMMATH

    Alumni

    Joined:
    Dec 31, 2017
    Messages:
    6
    Likes Received:
    0
    Hi..

    Maam didnt get the 26/5(Day 3) documents and notes...
    Kindly put in drive.....

    ATHUL B KAMMATH
     
    #5
  6. Priyanka_Mehta

    Priyanka_Mehta Well-Known Member
    Simplilearn Support

    Joined:
    May 25, 2017
    Messages:
    656
    Likes Received:
    46
    #6
  7. _30970

    _30970 New Member

    Joined:
    May 1, 2018
    Messages:
    1
    Likes Received:
    0
    Can i remove the data frequencies which is greater than 500000 , since its showing non regular behaviour
     

    Attached Files:

    #7
  8. Aayushi_6

    Aayushi_6 Active Member

    Joined:
    Sep 19, 2016
    Messages:
    47
    Likes Received:
    4
    Yes, U can.
     
    #8
  9. _30785

    _30785 Member

    Joined:
    Apr 30, 2018
    Messages:
    7
    Likes Received:
    0
    Maam i am not able to access the drive.
    please help...
     
    #9
  10. _25696

    _25696 Member

    Joined:
    Mar 12, 2018
    Messages:
    2
    Likes Received:
    0
    Even i have the same problem in accessing the drive. i'm getting a popup to request access for the drive.
     
    #10
  11. Priyanka_Mehta

    Priyanka_Mehta Well-Known Member
    Simplilearn Support

    Joined:
    May 25, 2017
    Messages:
    656
    Likes Received:
    46
    Hi, Kindly try again, it will be accessible.

    I hope this will help you.
     
    #11
  12. _25696

    _25696 Member

    Joined:
    Mar 12, 2018
    Messages:
    2
    Likes Received:
    0
    #12
  13. Rakesh Biswas

    Rakesh Biswas Member
    Alumni

    Joined:
    May 14, 2018
    Messages:
    10
    Likes Received:
    0
    from sklearn.preprocessing import LabelEncoder
    X_labelencoder=LabelEncoder()
    X[:,0]=X_labelencoder.fit_transform(X[:,0])



    ---------------------------------------------------------------------------
    IndexError Traceback (most recent call last)
    <ipython-input-57-63c4e56f2900> in <module>()
    1 from sklearn.preprocessing import LabelEncoder
    2 X_labelencoder=LabelEncoder()
    ----> 3X[:,0]=X_labelencoder.fit_transform(X[:,0])

    IndexError: too many indices for array

    how can i solve?
     
    #13
  14. K Manoj

    K Manoj Moderator
    Staff Member Simplilearn Support

    Joined:
    Aug 4, 2017
    Messages:
    196
    Likes Received:
    18
    This error basically means that your data is incorrectly formatted.

    What is X ? is it a dataframe or numpy array.
    print X and see.

    Check that all your columns are correctly separated by tabs or any other consistent separator.
     
    #14
  15. Aayushi_6

    Aayushi_6 Active Member

    Joined:
    Sep 19, 2016
    Messages:
    47
    Likes Received:
    4
    I have uploaded Day1 & Day2 again. The drive is updated with all the content. Please check at your end.
     
    #15
  16. _30785

    _30785 Member

    Joined:
    Apr 30, 2018
    Messages:
    7
    Likes Received:
    0
    Maam can u please explain the 6th point of the california housing project...

    6) Limit the data stratum for median income values-
    >>1. To create limited and workable stratums or intervals of median income, derive income_cat feature from median income and within this, mark the ones that are above category 5.
    >>2. This is to eliminate the long tail but limited data at the end of the income_cat scale.
     
    #16
  17. Rakesh Biswas

    Rakesh Biswas Member
    Alumni

    Joined:
    May 14, 2018
    Messages:
    10
    Likes Received:
    0
    from sklearn.preprocessing import Imputer
    missingvalueimputer=Imputer(missing_values='nan',strategy='median',axis=0)
    X[:,:4]=missingvalueimputer.fit_transform(X[:,:4])
    i got this types of error--------


    TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

    how can i solve?
     
    #17
  18. Rakesh Biswas

    Rakesh Biswas Member
    Alumni

    Joined:
    May 14, 2018
    Messages:
    10
    Likes Received:
    0
    from sklearn.preprocessing import Imputer
    missingvalueimputer=Imputer(missing_values='nan',strategy='median',axis=0)
    X[:,4]=missingvalueimputer.fit_transform(X[:,4])



    ValueError: Expected 2D array, got 1D array instead:
    array=[ 129. 1106. 190. ... 485. 409. 616.].
    Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

    how can i solve?
     
    #18
  19. Rakesh Biswas

    Rakesh Biswas Member
    Alumni

    Joined:
    May 14, 2018
    Messages:
    10
    Likes Received:
    0
    Screenshot (27).png
    how can i solve this problem?
     
    #19
  20. Aayushi_6

    Aayushi_6 Active Member

    Joined:
    Sep 19, 2016
    Messages:
    47
    Likes Received:
    4

    Hi Rakesh,

    For missing values imputation, I would recommend you to use fillna() function of pandas which we discussed in class.
    syntax is as follows,

    df['column_name'] = df['column_name'] .fillna(mean(df['column_name']))

    Let me know if after using this approach, you are still facing an issue.
    Hope it helps !
     
    #20
  21. Aayushi_6

    Aayushi_6 Active Member

    Joined:
    Sep 19, 2016
    Messages:
    47
    Likes Received:
    4
    Hi,
    I couldn't understand your question. Can you please elaborate?
    Are you referring to the bonus exercise (point 9 of the document)?
     
    #21
  22. Aayushi_6

    Aayushi_6 Active Member

    Joined:
    Sep 19, 2016
    Messages:
    47
    Likes Received:
    4
    Rakesh,

    I would recommend to follow the way we have performed label encoding in the class, by selecting the categorical columns and applied for loop.
    Hope that will avoid your error.
     
    #22
  23. Vicky Wu

    Vicky Wu Customer
    Customer

    Joined:
    Jan 10, 2018
    Messages:
    1
    Likes Received:
    0
    Hi Aayushi,

    I also have the same question for point 6 on the Housing Project--"Limit the data stratum for median income values". Could you clarify what stratums we are supposed to create from the median income?
     
    #23
  24. Rakesh Biswas

    Rakesh Biswas Member
    Alumni

    Joined:
    May 14, 2018
    Messages:
    10
    Likes Received:
    0
    Screenshot (29).png

    how can i solve this error
     
    #24
  25. _30785

    _30785 Member

    Joined:
    Apr 30, 2018
    Messages:
    7
    Likes Received:
    0
    No.. it is of the final project we have to submit- the california housing project.
    I couldn't understand what the data stratum for median income values is being referred here.
    So can u plz explain both the sub-points of the 6th point.
     
    #25
  26. Aryan_Singh_97

    Joined:
    Jun 7, 2018
    Messages:
    5
    Likes Received:
    0
    Hi,
    i am not clear with the following step in the California housing project:


    step 6:

    Limit the data stratum for median income values
    1. To create limited and workable stratums or intervals of median income, derive income_cat feature from median income and within this, mark the ones that are above category 5.
    2. This is to eliminate the long tail but limited data at the end of the income_cat scale.
     
    #26
  27. Rakesh Biswas

    Rakesh Biswas Member
    Alumni

    Joined:
    May 14, 2018
    Messages:
    10
    Likes Received:
    0
    I want to say that how can I solve: Expected 2D array, but got the 1D array
     
    #27
  28. _26856

    _26856 Member

    Joined:
    Mar 22, 2018
    Messages:
    2
    Likes Received:
    1
    Hi Ayushi,
    The final project that you have shown in class 4 is different from what we have in "Projects for Submission" folder.
    Old project is far different from that of New updated one. Please confirm is it the new one we have to submit on final day ?
     
    #28
  29. Abhijit Ghosh_1

    Alumni

    Joined:
    May 10, 2016
    Messages:
    6
    Likes Received:
    0
    Hi Ayushi,

    1. I have completed the missing value imputation and divided the dataset into X and Y. Before further dividing the dataset into train and test, I am a bit confused if I need to apply encoding on the column "Ocean Proximity".
    Since there is only 1 string column, I am planning to apply One Hot Encoding. Will that be okay ?
    2. After performing the standardization through Standard Scalar, do we need to apply PCA algorithm? (Not sure if this is a valid question)
    3. Once encoding and PCA is applied, do I need to apply all the regression algorithms and find out which fits best by MSE and plotting?

    Thanks in advance.
     
    #29
  30. Aayushi_6

    Aayushi_6 Active Member

    Joined:
    Sep 19, 2016
    Messages:
    47
    Likes Received:
    4
    As it is suggesting, try reshaping your data.
     
    #30
  31. Aayushi_6

    Aayushi_6 Active Member

    Joined:
    Sep 19, 2016
    Messages:
    47
    Likes Received:
    4
    Hi Abhijit,

    1. Yes, u need to apply encoding either label encoding or one-hot encoding
    2. No there is no need to apply PCA as the no of columns are not so huge in number.
    3. Yes u need to apply all the regression algorithms and find out the best one with min MSE
    and do the plots as discussed in class
    4. Do tune the hyper-parameters as well.

    Hope it helps!
    Thank you
     
    #31
  32. Aayushi_6

    Aayushi_6 Active Member

    Joined:
    Sep 19, 2016
    Messages:
    47
    Likes Received:
    4
    Everyone,

    Explanation of point 6
    6) Limit the data stratum for median income values-
    >>1. To create limited and workable stratums or intervals of median income, derive income_cat feature from median income and within this, mark the ones that are above category 5.
    >>2. This is to eliminate the long tail but limited data at the end of the income_cat scale.

    Basically, you have to convert the numerical column values into intervals like 1-2,2-3,3-4
    or 1-3,3-5,5-7 or any other way and that new derived column will be known as income_cat.
    This process is known as feature engineering.
     
    #32
  33. _30394

    _30394 Member
    Alumni

    Joined:
    Apr 26, 2018
    Messages:
    3
    Likes Received:
    0
    Hello Aayushi,

    1. I was trying to apply OLS on the housing dataset. When I tried the first time I got the R-squared value as 0.152 (details can be viewed from the attachment upload_2018-6-8) and after dropping couple of columns based on the result, I got the R-squared value as 1 (details can be viewed from attachment afterOLS). The R-squared value has increased, but in both the cases the column standard error is having very high values for all the Xs and also the confidence interval are all exponential. Could you please let me know by looking into it if I am going in the right direction?

    2. Can we apply MSE on the LREG object that we create for OLS? In the example that you have shown last weekend, I didn't see MSE calculated for the OLS object.

    3. The explanation for point 6 (which you posted for everyone), is it a new topic that you will be explain in the coming weekend or is it something that you have already explained but I missed. If you have already explained, can you please give me some hints on how to approach this problem?
     

    Attached Files:

    #33
  34. Rakesh Biswas

    Rakesh Biswas Member
    Alumni

    Joined:
    May 14, 2018
    Messages:
    10
    Likes Received:
    0
  35. Aayushi_6

    Aayushi_6 Active Member

    Joined:
    Sep 19, 2016
    Messages:
    47
    Likes Received:
    4
    Hi Rakesh,

    It would be tough for me to resolve error without checking the input to the encoder.
    So, I would suggest you to go for pandas get_dummies() function to perform the same task.

    Syntax is as follows:
    pd.get_dummies(data= your_dataframe, columns=['column_name1', 'column_name2'])

    Hope it helps!
     
    #35
  36. Aayushi_6

    Aayushi_6 Active Member

    Joined:
    Sep 19, 2016
    Messages:
    47
    Likes Received:
    4
    Hi,

    Please find the answers below:

    1. Your model is good if R2 value is increasing even if std values are higher.
    2. We havn't computed MSE for OLS as we have only used fit() function, If you want to compute MSE for it, that also can be done.
    You need to first apply the predict() function after fit() function with OLS statsmodel to predict values. Then MSE can be computed.
    Both Sklearn regression and statsmodel OLS gives similar results, however their inbuilt algorithm is different.
    3. Ignore it.

    Hope it helps!
     
    #36
  37. Aayushi_6

    Aayushi_6 Active Member

    Joined:
    Sep 19, 2016
    Messages:
    47
    Likes Received:
    4
    Everyone,

    FEATURE ENGINEERING TECHNIQUES

    LINK: https://www.quora.com/What-are-some-best-practices-in-Feature-Engineering

    Another link: http://www.feat.engineering/

    Reference Book: http://shop.oreilly.com/product/0636920049081.do

    For more details: (Good to have knowledge)
    Refer below:

    Feature selection and feature extraction are playing such a vital role in creating an effective predictive model.

    In FEATURE SELECTION we try to find the best subset of the input feature set.

    Feature Selection Algorithms:
    1. filter methods,
    2. wrapper methods
    3. embedded methods.

    In FEATURE EXTRACTION we create new features based on transformation or combination of the original feature set.

    Feature extraction algorithms:
    1. color histogram,
    2. FAST (Features from Accelerated Segment Test),
    3. SIFT (Scale Invariant Feature Transform),
    4. PCA-SIFT (Principal Component Analysis-SIFT),
    5. F-SIFT (fast-SIFT)
    6. SURF (speeded up robust features).

    Example:
    Consider a+b+c+d=e Where a,b,c,d are features

    If we find ab=a+b Then we can write ab+c+d=e This is called Feature extraction

    If c=0 Then there is no use of using c in our equation which means we can drop c and select only required features Then ab+d=e Which is called as Feature selection

    Top reasons to use feature selection and feature extraction:
    1. It enables the machine learning algorithm to train faster.
    2. It reduces the complexity of a model and makes it easier to interpret.
    3. It improves the accuracy of a model if the right subset is chosen.
    4. It reduces over-fitting.
     
    #37
    Rakesh Biswas likes this.
  38. Aayushi_6

    Aayushi_6 Active Member

    Joined:
    Sep 19, 2016
    Messages:
    47
    Likes Received:
    4
    IMPACT Of False Positives and False Negatives (Real Life Applications)

    Case1:

    In case of diseases we treat consider False Positive are worst as giving medications to a cancer patient is not that harmful than not giving a treatment to a cancer patient.

    OR consider a health prediction case, where you want to diagnostic breast cancer based on your patients mammography. Imagine that detecting cancer will trigger further analysis (you will not immediately treat your patient) whereas if you don't detect cancer, you send your patient home with a big smile on your face, telling her "Everything is fine, see you in 5 years !".
    This case is thus unsymmetric, since you definitely want to avoid sending home a sick patient (False Negative). You can however make the patient wait a little more (and worrying more) by asking him to take more tests even if he does not have cancer (False Positive).
    In that situation, you would prefer False Positives to False Negatives.

    Case2:

    consider a movie recommendation engine that tries to predict whether you will watch Sacha Baron Cohen's movies. In that case, the False Positive (that is to say, wrongly predicting that the user is a big fan of these movies) has a strong deceiving effect on the perception value of your recommendation.
    For that situation, you would prefer False Negatives to False Positives.

    Case3:

    In spam filter example where spam is positive class then False negatives are much worse than false positive. As you may click on Email thinking it is not a spam and end up loosing a huge chuck of money for nothing.

    Case4:

    consider a model that shortlist resumes for a job interview at a company. Assuming you get more promising candidates than the number of positions you want to fill, a false negative amounts to rejecting a good candidate, which is not so much of an issue, given that you will get other such candidates. However, a false positive means you shortlist someone who is not good enough, which will waste company resources in the interview process.

    Therefore there is no intrinsic hierarchy between False Positives and False Negatives. You absolutely need to consider their impacts on your specific problem in order to make a smart trade-off.
     
    #38
  39. Aayushi_6

    Aayushi_6 Active Member

    Joined:
    Sep 19, 2016
    Messages:
    47
    Likes Received:
    4
    For K-Fold and SVR Regression, Refer Google Drive.
    Will post there by EOD.
     
    #39
  40. _30785

    _30785 Member

    Joined:
    Apr 30, 2018
    Messages:
    7
    Likes Received:
    0
    Ma'am can you please explain the error and how to solve it...
    Capture.JPG
     
    #40
  41. _30394

    _30394 Member
    Alumni

    Joined:
    Apr 26, 2018
    Messages:
    3
    Likes Received:
    0
    Hello Aayushi,

    I was trying to solve the classification problem available in the project folder (Phishing detector with LR). I have the below doubts in that problem statement:
    1. The dataset is a text file. I am able to read it, but I am not able to find the column heading for them. Is there some other file which we need to read for reading the headers or we need to apply some headers by ourselves?
    2. In the exercise 1, could you please explain what is the meaning of "Printing count of misclassified samples"?
    3. In exercise 2:
    a) Which columns are they referring to by Prefix_Suffix and URL_OF_Anchor?
    b) I didn't understand what did they exactly meant for the plotting part. Can you please explain.

    Thanks,
    Sanghamitra
     
    #41
  42. Abhijit Ghosh_1

    Alumni

    Joined:
    May 10, 2016
    Messages:
    6
    Likes Received:
    0
    Hi Aayushi,

    Could you please share all possible interview questions (FAQ, Tricky ones etc ) with us may be in your drive or here ? That would really help me to prepare for Machine Learning interview and know where do I stand with respect to my learning .

    Thanks in advance,

    Abhijit
     
    #42
  43. Aayushi_6

    Aayushi_6 Active Member

    Joined:
    Sep 19, 2016
    Messages:
    47
    Likes Received:
    4
    I couldn't see the exact error, what is there at the end.
    Please share it.
     
    #43
  44. Aayushi_6

    Aayushi_6 Active Member

    Joined:
    Sep 19, 2016
    Messages:
    47
    Likes Received:
    4
    Hi Abhijit,

    I will share that in the coming class.
     
    #44
  45. Aayushi_6

    Aayushi_6 Active Member

    Joined:
    Sep 19, 2016
    Messages:
    47
    Likes Received:
    4
    #45
  46. Aryan_Singh_97

    Joined:
    Jun 7, 2018
    Messages:
    5
    Likes Received:
    0
    Hi Aayushi ma'am,
    I have run the Linear Regression model for the California Housing project. And found the following error where the train and test accuracy are coming out to be 0.00%. I cannot figure out what is going wrong.
    Also, i have got high values of MSE (where: Test > Train). I re-Run the model by dropping the columns(less-related ones) but still the MSE(Test> Train). So is the model good?

    -Aryan
     

    Attached Files:

    #46
  47. _30785

    _30785 Member

    Joined:
    Apr 30, 2018
    Messages:
    7
    Likes Received:
    0
    error in standardization...plz tell how to solve it
    Capture 2.JPG
     
    #47
  48. Aayushi_6

    Aayushi_6 Active Member

    Joined:
    Sep 19, 2016
    Messages:
    47
    Likes Received:
    4
    Hi,

    Make sure the columns on which you are applying standardization are numeric only.
    Before applying it, do the following
    df['column_name']= df['column_name'].astype(float)
     
    #48
  49. Aayushi_6

    Aayushi_6 Active Member

    Joined:
    Sep 19, 2016
    Messages:
    47
    Likes Received:
    4
    Hi Aryan,

    You are getting strange results, we need to troubleshoot it
    1) Firstly, dont apply astype(int) to prediction results
    2) Check the R2 value using OLS statsmodel
    3) We prefer to have less RMSE for Test data, however if there is small difference, its okay.
    4) You need to check your data cleaning and feature engineering steps once again.
    5) Try applying other models like DT, RF etc, if results got improved or not.

    Hope it helps!
     
    #49
    Aryan_Singh_97 likes this.
  50. Aayushi_6

    Aayushi_6 Active Member

    Joined:
    Sep 19, 2016
    Messages:
    47
    Likes Received:
    4
    Hi Sanghamitra,

    I need to cross check. Will get back to you.
     
    #50

Share This Page