ML | Amit Jain |

Discussion in 'Big Data and Analytics' started by Vikas Kumar_18, Jun 14, 2019.

1. Vikas Kumar_18 Well-Known Member Simplilearn SupportAlumni

Joined:
Dec 17, 2018
Messages:
172
30
This is the dedicated community link to discuss with your peers and with TA and trainer.

#1
2. Vikas Kumar_18 Well-Known Member Simplilearn SupportAlumni

Joined:
Dec 17, 2018
Messages:
172
30
#2
_3292 and Veronica Leong like this.

Joined:
Nov 2, 2016
Messages:
36
3
#3
4. _3292 TEJAS_PHASE Alumni

Joined:
Nov 2, 2016
Messages:
36
3
Hello Amit Sir,
The question is about 'Correlation'. After plotting a heatmap of correlation between columns(features), on which basis we can decide a particular column is useful for prediction or not. I understood until now that if there is a correlation(either positive or negative) between 2 columns then we will choose one out of them. But need more clarification on this issue.
Do we need to perform 'Pearson Correlation' for this? I also observed that the 'Pearson Correlation' is only useful for 'Continous Numeric Variables'. Is it true?
In 'Titanic Data' problem 'sex_male' and 'sex_female' are completely negatively correlated. But still, we have chosen both of these columns for prediction. So, which criteria we have used in this?
Thank You.

#4
5. _3292 TEJAS_PHASE Alumni

Joined:
Nov 2, 2016
Messages:
36
3
Hello Amit Sir,
I am currently working on Titanic Dataset and getting the following Confusion Matrix:
array([[142, 19],
[ 23, 83]], dtype=int64)

So, my question is how do I interpret with this Matrix? I mean to say which are TP, FP, TN and FN values in that matrix?
Are those 142 - TP and 83 - TN, 19 - FN and 23 - FP?
Thank You.

#5
6. _3292 TEJAS_PHASE Alumni

Joined:
Nov 2, 2016
Messages:
36
3
Hello Sir,
In Titanic Dataset problem when I fit model with data(include 'Fare' column and without 'Fare' column) I am getting values for roc_auc_score in the first case(with 'Fare' column) is -- 0.7431094383323683 and in the second case(i.e. without 'Fare' column)-- 0.83250322278214. As we discussed in our session that roc is the ratio of True Positive Rate vs False Positive Rate. So, from these values can I say like with "Fare" column as feature our Model is giving less accuracy in prediction and without 'Fare' column as feature Model is giving more accuracy in prediction? Or I can say when used 'Fare' column my model is overfitting and that's the reason it's getting less score on 'Test' data. I am correct or doing wrong something?
Thank You.

#6
Last edited: Jul 5, 2019
7. _3292 TEJAS_PHASE Alumni

Joined:
Nov 2, 2016
Messages:
36
3
Hello Amit Sir,
In the above example the first case(with 'Fare' column case) I am getting auc_score = 0.7431 and in the second case(without 'Fare' column) I am getting auc_score = 0.8325. So, can I interpret these scores as a percentage of occupies of the area_under_curve? I mean, the ideal auc_score is 1 or I can say 100% area_under_curve. Now in the first case, I am getting auc_score = 0.7431 or 74% area_under_curve and in the second case, I am getting auc_score = 0.8325 or 83% area_under_curve. Am I doing correct interpretation?

#7
Last edited: Jul 6, 2019
8. _3292 TEJAS_PHASE Alumni

Joined:
Nov 2, 2016
Messages:
36
3
Hello Sir,
Confuse on how we choose 'Root-Node' in Decision Tree using 'Entropy' and 'Information-Gain?'

#8
9. _3292 TEJAS_PHASE Alumni

Joined:
Nov 2, 2016
Messages:
36
3
Hello Sir,
What is the use of the parameter ' n_jobs ' while creating the object of particular Model?

#9
10. _3292 TEJAS_PHASE Alumni

Joined:
Nov 2, 2016
Messages:
36
3
Hello Amit Sir,
I am currently working on the income_qualification project and facing some issues. The shape of the dataset is (9557, 143) and it's getting hard to find out which columns should keep and which should be drop. As we discussed in our sessions that by using correlation we can determine this part. But in this case how I can find the correlation between such huge no of columns. Is there any other way to find this?

#10
11. _3292 TEJAS_PHASE Alumni

Joined:
Nov 2, 2016
Messages:
36
3
My Second question after that, is how to check biases in the dataset? Can you provide some hint to solve this question?
Thank You.

#11
12. _3292 TEJAS_PHASE Alumni

Joined:
Nov 2, 2016
Messages:
36
3
Hello Sir,
I am struggling with reducing the feature columns in the income_qualification dataset. Based on our discussion in the classes, this part falls under Feature Engineering. I just recall the PCA technique at this movement for this but don't know how to use this? I am finding the video lecture of the class related to it but don't get it. So, can you tell me in which session we had covered this part?
Thank You.

#12
13. _3292 TEJAS_PHASE Alumni

Joined:
Nov 2, 2016
Messages:
36
3
Hello Vikas Kumar Sir,
Is this link is active? Should I post any further questions on this?

#13

Joined:
Nov 2, 2016
Messages:
36
3
#14
15. _3292 TEJAS_PHASE Alumni

Joined:
Nov 2, 2016
Messages:
36
3
Hello Sir,
I have checked all our session's recording for Feature Engineering discussion. But I didn't find any such kind of discussion on it. In sessions 5 or maybe 6 you were told that you will cover this topic after completing 'Supervised Learning'. But I think we forgot to discuss it.
So, can you cover this part in the upcoming session?
Because without Feature Engineering I can't proceed further in my Project.
Thanks.

#15
16. sharmistha datta Member

Joined:
Apr 21, 2016
Messages:
3
0
Hi, While discussing Machine Learning techniques, you had mentioned about "Categorization" in "Machine Learning Techniques". Do you have any code to describe this.

File size:
82.6 KB
Views:
0
#16
17. sharmistha datta Member

Joined:
Apr 21, 2016
Messages:
3
0
I want to read about Model Life cycle for Machine Learning. Could you please suggest any study material for the same. Thanks.

#17
18. sharmistha datta Member

Joined:
Apr 21, 2016
Messages:
3
0
Hi, I have checked with Simplilearn team on the Project submission deadline. They mentioned that I can submit anytime before my course expiry deadline. As I will not be able to attend tomorrows session, I will post my doubt here if I have any while working on the project submission.

#18
19. _3292 TEJAS_PHASE Alumni

Joined:
Nov 2, 2016
Messages:
36
3
Hello Amit Sir,
What does exactly mean "Check for biases in the dataset?". Does it mean "Check for Outliers in the feature columns?"
Extremely Sorry for this trouble for asking questions again and again.
--Tejas

#19
20. Amit_505 Member Trainer

Joined:
Jan 8, 2018
Messages:
3
1
HI Tejas,

It means,exploratory data analysis and see if algorithmâ€™s tendency is to consistently learn the wrong thing by not considering complete information into account or wrong information (noise) into account.It is part of data pre-processing

#20
_3292 likes this.
21. Amit_505 Member Trainer

Joined:
Jan 8, 2018
Messages:
3
1
As we discussed in this class

#21
22. Amit_505 Member Trainer

Joined:
Jan 8, 2018
Messages:
3
1
This does not look to be an error ,rather a warning.It essentially means,that some of the functions/libraries being used may have been upgraded so it prompts to update them.But,it is just a warning and ca
Alright Sharmi.Thanks.Wish you the best

#22
23. _3292 TEJAS_PHASE Alumni

Joined:
Nov 2, 2016
Messages:
36
3
Hello Amit Sir,
In which situations do we use Standardization and in which situations use Normalization?

#23
24. _3292 TEJAS_PHASE Alumni

Joined:
Nov 2, 2016
Messages:
36
3
Hello Amit Sir,
In our Project 'Income Qualification' there is a column name 'dependency' which is categorical in nature(yes and no). But it also contains some numerical values(.5,2,1.5, .33333334 and so on).
So, do I need to keep as it is?
Because my intuition is, after Normalizing them all those values will convert into the range 0 to 1. Am I correct?

#24
Last edited: Jul 16, 2019
25. _3292 TEJAS_PHASE Alumni

Joined:
Nov 2, 2016
Messages:
36
3
Hello Sir,
After applying PCA to the income qualification project dataset my feature columns are drastically reduced from 143 to 50. But now, the problem raise is that the new data frame is without any columns. So, how can I interpret with this new data frame? I am attaching here the screenshot of previous df(before applying PCA) and new df(after applying PCA).
1.Df before PCA:

2. DF after PCA:

So in this situation what should I do further? Or Shall I start building Model?

#25

Joined:
Nov 2, 2016
Messages:
36