Welcome to the Simplilearn Community

Want to join the rest of our members? Sign up right away!

Sign Up

ML | VLAD |

Vlad Skorokhod

Customer
Customer
I made a typo in Session 2 while showing how to slice a data frame to select columns by column name which resulted in an error due to a missing square bracket. The correct notation is df[['column_name_1', 'column_name_2', ...., 'column_name_n']]. The outer square brackets are for the slicing condition, and the inner square brackets are to create a list of column names. Thank you to the student who later on pointed it out in the chat box.

Alternatively, use column indices instead of column names with iloc, e.g., df.iloc[:, [1, 5, 7]], or df.iloc[:, 0:3]
 
Hi Vlad, I am working on the California House Price Prediction. The dataset comprises of 20640 rows. Is it suggested to take a subset of this dataset while building the training model and prediction? If so, should the Exploratory Data Analysis be done on the entire dataset or on the selected subset?
For converting the categorical column ocean_proximity what is preferred - Label Encoder or get_dummies? Any advantage of one choice over the other?
 

Vlad Skorokhod

Customer
Customer
Thanks for your question - pd.get_dummies performs one-hot encoding, i.e., it creates a column of 1s and 0s for each category of your categorical variable. Label Encoder replaces the categories with integers (0, 1, 2, ...) - this may trick your model into the assumption of some sort of ordinality in your variable which will probably not make much sense. One-hot is better if you don't mind the extra columns.

Hi Vlad, I am working on the California House Price Prediction. The dataset comprises of 20640 rows. Is it suggested to take a subset of this dataset while building the training model and prediction? If so, should the Exploratory Data Analysis be done on the entire dataset or on the selected subset?
For converting the categorical column ocean_proximity what is preferred - Label Encoder or get_dummies? Any advantage of one choice over the other?
 
Thanks for your question - pd.get_dummies performs one-hot encoding, i.e., it creates a column of 1s and 0s for each category of your categorical variable. Label Encoder replaces the categories with integers (0, 1, 2, ...) - this may trick your model into the assumption of some sort of ordinality in your variable which will probably not make much sense. One-hot is better if you don't mind the extra columns.

Thank you, Vlad. Some more queries:
1. Should I take a subset of the entire dataset or work on training the entire dataset.
2. I could also see columns where the correlation is not strong: Running the below commands:

corr_matrix=housing_df.corr()
corr_matrix['median_house_value'].sort_values(ascending=False)
Result:
median_house_value 1.000000
median_income 0.688075
total_rooms 0.134153
housing_median_age 0.105623
households 0.065843
total_bedrooms 0.049686
population -0.024650
longitude -0.045967
latitude -0.144160

Should I continue using all the columns or drop the ones with low correlation?

3. Also I see outliers on most features and median_house_value, but not sure how to eliminate them. Should it be a simple dropping based on the value of the particular column or a combination of columns.

4. Should I apply Standard Scalar on the features only or also the y value (median_house_value)?

Apologize, if these queries seem trivial, but this is my initial attempt to a ML problem.
 

Vlad Skorokhod

Customer
Customer
Thank you, Vlad. Some more queries:
1. Should I take a subset of the entire dataset or work on training the entire dataset.
2. I could also see columns where the correlation is not strong: Running the below commands:

corr_matrix=housing_df.corr()
corr_matrix['median_house_value'].sort_values(ascending=False)
Result:
median_house_value 1.000000
median_income 0.688075
total_rooms 0.134153
housing_median_age 0.105623
households 0.065843
total_bedrooms 0.049686
population -0.024650
longitude -0.045967
latitude -0.144160

Should I continue using all the columns or drop the ones with low correlation?

3. Also I see outliers on most features and median_house_value, but not sure how to eliminate them. Should it be a simple dropping based on the value of the particular column or a combination of columns.

4. Should I apply Standard Scalar on the features only or also the y value (median_house_value)?

Apologize, if these queries seem trivial, but this is my initial attempt to a ML problem.

Excellent questions!

1. If the size of your dataset is reasonable to run it on a stand-alone PC (up to 1 GB?) there is no apparent reason to work with a subset instead of the entire data population.
2. Don't judge the predictive power of your features based on the lack of correlation between the output and individual features. Correlation only indicates an obvious linear relationship. The effects of multiple variables can be confounded so that you will start seeing a structure only when you include more than one variable. Also, the relationship between the output and an individual feature can be well structured but non-linear which may result in a low correlation coefficient. Try for example corr(x,y) where x^2+y^2 = 1.
3. I wouldn't remove the outliers right away. There could be important insights hidden in the "anomalous" data. Most models can handle them. If you realize later on that the outliers are affecting the accuracy of your model, you can then remove the entire rows.
4. No need to scale the output variable.

Good luck with the dataset
 
Excellent questions!

1. If the size of your dataset is reasonable to run it on a stand-alone PC (up to 1 GB?) there is no apparent reason to work with a subset instead of the entire data population.
2. Don't judge the predictive power of your features based on the lack of correlation between the output and individual features. Correlation only indicates an obvious linear relationship. The effects of multiple variables can be confounded so that you will start seeing a structure only when you include more than one variable. Also, the relationship between the output and an individual feature can be well structured but non-linear which may result in a low correlation coefficient. Try for example corr(x,y) where x^2+y^2 = 1.
3. I wouldn't remove the outliers right away. There could be important insights hidden in the "anomalous" data. Most models can handle them. If you realize later on that the outliers are affecting the accuracy of your model, you can then remove the entire rows.
4. No need to scale the output variable.

Good luck with the dataset

Thank you so much for the response, Vlad.
 
Hi Vlad,

Some more questions while I am working on the Mercedes-Benz Greener Manufacturing project.

1. There are 12 columns in the training dataset which only have the value 0. I am thinking of dropping these columns, but I did a further analysis of the test dataset. I found that the corresponding columns are non-zero in the test dataset. I am going ahead and dropping these columns as retaining them will not add to the model information (no variance in these columns in the training dataset). Is this a logical decision?

2. Columns X0, X1, X2, X3, X4, X5, X6 and X8 are categorical . I am applying Label Encoder on these columns. Initially I was doing a LabelEncoder.fit() operation on the training dataset features and using the same instance to transform the test dataset. However the testing set features have values that are not in the training set. So I am concatenating the training and testing set and applying Label Encoder fit and transform on the entire set. After encoding is done, I am splitting the training and testing set back again. Is this a right approach?

3. For dimensionality reduction we were taught two options Principal Component Analysis and Linear Discriminant Analysis. I am not sure what would work best. PCA is unsupervised method, LDA is supervised method. But LDA seems to work on classification problems, whereas the current one is a supervised regression problem. Should I apply PCA and then use the transformed set for Linear Regression and XGBoost?

Can you please share your thoughts.
 

Vlad Skorokhod

Customer
Customer
Hi Atreyi,

1. I agree with your reasoning - even if the all-zero columns have more variance in the test set, these variables will not contribute anything to the model, therefore it should be safe to remove them.

2. Some of the X_ columns have over 40 unique values. Have you looked at their frequency of occurrence? Do some of the unique values occur only once or twice? Remember that each unique category will produce a new feature upon one-hot encoding.. The rarely occurring categories will result in sparse features with all zeros except one or two "1"s. Instead, you can consolidate rare categories into one "other" category - this can also take care of the previously unseen unique values in the test set - just put them in the "other" bucket.

3. LDA can be used only with categorical labels so it's not applicable here. If you decide to use PCA, make sure that you use only one-hot encoding for your categorical variables so that all your features are either binary (0 or 1) or scaled numerical, if here are any. You should apply PCA only to the features, don't include the label column.

Thank you.


Hi Vlad,

Some more questions while I am working on the Mercedes-Benz Greener Manufacturing project.

1. There are 12 columns in the training dataset which only have the value 0. I am thinking of dropping these columns, but I did a further analysis of the test dataset. I found that the corresponding columns are non-zero in the test dataset. I am going ahead and dropping these columns as retaining them will not add to the model information (no variance in these columns in the training dataset). Is this a logical decision?

2. Columns X0, X1, X2, X3, X4, X5, X6 and X8 are categorical . I am applying Label Encoder on these columns. Initially I was doing a LabelEncoder.fit() operation on the training dataset features and using the same instance to transform the test dataset. However the testing set features have values that are not in the training set. So I am concatenating the training and testing set and applying Label Encoder fit and transform on the entire set. After encoding is done, I am splitting the training and testing set back again. Is this a right approach?

3. For dimensionality reduction we were taught two options Principal Component Analysis and Linear Discriminant Analysis. I am not sure what would work best. PCA is unsupervised method, LDA is supervised method. But LDA seems to work on classification problems, whereas the current one is a supervised regression problem. Should I apply PCA and then use the transformed set for Linear Regression and XGBoost?

Can you please share your thoughts.
 

Vlad Skorokhod

Customer
Customer
Inverse PCA:

There was a question regarding inverse transformation from PCA back to the original variables. I posted an example in Session 4 shared folder showing that inverse transformation reverts precisely to the original variables only if you keep all PCs. If you remove low-variance PCs, the inverse transformation will still work but there will be some error in restored values. Look for "inverse_PCA.ipynb" https://drive.google.com/drive/u/3/folders/1BRndz2OHz0BCXLN14EuOFvkSny4oYRr9
 
Hi Atreyi,

1. I agree with your reasoning - even if the all-zero columns have more variance in the test set, these variables will not contribute anything to the model, therefore it should be safe to remove them.

2. Some of the X_ columns have over 40 unique values. Have you looked at their frequency of occurrence? Do some of the unique values occur only once or twice? Remember that each unique category will produce a new feature upon one-hot encoding.. The rarely occurring categories will result in sparse features with all zeros except one or two "1"s. Instead, you can consolidate rare categories into one "other" category - this can also take care of the previously unseen unique values in the test set - just put them in the "other" bucket.

3. LDA can be used only with categorical labels so it's not applicable here. If you decide to use PCA, make sure that you use only one-hot encoding for your categorical variables so that all your features are either binary (0 or 1) or scaled numerical, if here are any. You should apply PCA only to the features, don't include the label column.

Thank you.

Thank you, Vlad.
I will try the Label Encoder as you suggested.
For dimensionality reduction I am applying PCA on the features and then using the transformed features for training the model. In which situation is Factor Analysis used? And what are other dimensionality reduction techniques for regression datasets with labels (supervised regression).

Thank you.
 

Vlad Skorokhod

Customer
Customer
Hi Atreyi,
Factor Analysis is basically a generic umbrella term. PCA is a rigorous algebraic procedure for factor analysis. Otherwise, factor analysis and dimensionality reduction can be performed manually based on the similarity among the columns (as discussed, if a subset of columns have identical values, remove the duplicate and retain only one). If the values are *almost* identical, you can use the same approach at your discretion keeping in mind what you mean by "almost" - perhaps, only one or two values out of 4000 are different.... There is no "regression" equivalent of LDA as far as I know.
Thanks,
Vlad
 

Vlad Skorokhod

Customer
Customer
Hello ML Class,
This Sunday, June 16, we will be using xgboost to build ensemble models. Could those of you using local Jupyter for in-class labs please make sure that you have xgboost up and running. If you don't have it installed, try "pip install xgboost" from your Anaconda prompt.
Thank you.
Vlad
 
Hi Vlad, can you please clarify where the results of kFold is used in the Lesson_8_Assisted_Practice_Cross Validation notebook. We calculated different train and test sets, but I cannot understand how we used them later. I mean we alway use data_input and data_output, but never the results of KFold. Sorry, if I missed it!

kf = KFold(n_splits = 10, shuffle=True)
print("Train Set Test Set ")
for train_set,test_set in kf.split(data_output):
print(train_set, test_set)
print(" ")
 
Hi Vlad,

Wanted a clarification on Mercedes-Benz assessment. The last instruction in the assessment is to predict test_df values using xgboost.
1. Since the target data,y, is continuous, not categorical, should we use parameter booster = gblinear?
2. Using XGBClassifier(booster = 'gbtree', n_estimators=5, random_state=2, tree_depth=2) takes as long as 10 mins to predict the output and the RMSE is worse than Linear Regresssion. RMSE using XGBoost, gbtree is 15.08 and that using Linear Regression is 12.12. So don't see much benefit of using XGBoost here.
3. Using XGBClassifier(booster = 'gblinear', n_estimators=5, random_state=2, tree_depth=2) the process is a little faster but RMSE is 16.16 compared to Linear Regression where RMSE is 12.12.

Is XGBoost a preferred classifier in case of continuous (non-categorical) target data?

Regards,
Atreyi
 

Vlad Skorokhod

Customer
Customer
Hi Olga,

Please scroll down to Cell 10 where we instantiated the RF classifier as rf_class:

from sklearn.ensemble import RandomForestClassifier
rf_class = RandomForestClassifier(n_estimators=10)


in the next cell we print out the 5 different metrics calculated on each of the 5 cv folds:

print(cross_val_score(rf_class, data_input, data_output, scoring='accuracy', cv = 5))

cross_val_score is wrapped around rf_class because we want to iterate over it 5 times.

Then, print out the average accuracy in %:

accuracy = cross_val_score(rf_class, data_input, data_output, scoring='accuracy', cv = 5).mean() * 100

Let me know if this helps. Thank you.

Hi Vlad, can you please clarify where the results of kFold is used in the Lesson_8_Assisted_Practice_Cross Validation notebook. We calculated different train and test sets, but I cannot understand how we used them later. I mean we alway use data_input and data_output, but never the results of KFold. Sorry, if I missed it!

kf = KFold(n_splits = 10, shuffle=True)
print("Train Set Test Set ")
for train_set,test_set in kf.split(data_output):
print(train_set, test_set)
print(" ")
 

Vlad Skorokhod

Customer
Customer
Hi Atreyi,

Decision trees can be used for regression (we didn't talk much about regression trees in class). Each split or stump creates a sort of step function: if x < x0, y = y1, if x > x0, y = y2. With a sufficient number of these, you can model complex non-linear relationships. Take a look at the example I showed in class: http://uc-r.github.io/public/images/analytics/gbm/boosted_stumps.gif . I suspect that at n_estimators = 5 you may be underfitting, and I'm not really sure why it is running so slow. Can you still try and increase the number of trees? Also, for each trial keep track of MSE on both the training and test sets? Thank you.

Hi Vlad,

Wanted a clarification on Mercedes-Benz assessment. The last instruction in the assessment is to predict test_df values using xgboost.
1. Since the target data,y, is continuous, not categorical, should we use parameter booster = gblinear?
2. Using XGBClassifier(booster = 'gbtree', n_estimators=5, random_state=2, tree_depth=2) takes as long as 10 mins to predict the output and the RMSE is worse than Linear Regresssion. RMSE using XGBoost, gbtree is 15.08 and that using Linear Regression is 12.12. So don't see much benefit of using XGBoost here.
3. Using XGBClassifier(booster = 'gblinear', n_estimators=5, random_state=2, tree_depth=2) the process is a little faster but RMSE is 16.16 compared to Linear Regression where RMSE is 12.12.

Is XGBoost a preferred classifier in case of continuous (non-categorical) target data?

Regards,
Atreyi
 
Hi Vlad,
I was working on the Income Qualification (Proxy Means Test) dataset. It has 143 columns out of which 5 are categorical. I got the description of the columns from kaggle.
1. The columns 'dependency', 'edjefe', 'edjefa' has values 'yes' and 'no' apart from numerical values. I am thinking of replacing 'yes' and 'no' by 1 and 0 respectively. That would convert these columns to numerical columns.
2. I can drop the Id column as it will not be required.
3. How do I handle the ‘idhogar’ column. This is the unique identifier for each household. So it ties several records from the same household together. Should it be part of the features dataset? And converted to numerical value through Label Encoding? My take is, after using 'idhogar' to fill the missing Target values or doing any other computation, I should delete it before training the model, since it is another id.
4. One column has all 0 value, that can be dropped. The remaining 139 columns can be reduced through dimensionality reduction. I am using LDA, over PCA, as we have categorical labels in the training dataset. This step is not mentioned in the instruction, but I am including it as the number of columns are quite large. Should I go with this approach, or manually understand the columns and come up with reduced features?
Actually after writing this, I tried with LDA. Dimension reduced to 3 but accuracy was quite low. So I went ahead with the original 139 dimensions for model training. The next step will be to manually understand each of the features to see if any of them can be removed due to redundant information.
5. Applying Random Forest Classifier on the dataset to predict y values for the test dataset and check accuracy. Accuracy was lower when I used cross-validation split compared to Train-test split.
Hope this is the right approach.

Regards,
Atreyi
 
Last edited:
Hi Atreyi,

Decision trees can be used for regression (we didn't talk much about regression trees in class). Each split or stump creates a sort of step function: if x < x0, y = y1, if x > x0, y = y2. With a sufficient number of these, you can model complex non-linear relationships. Take a look at the example I showed in class: http://uc-r.github.io/public/images/analytics/gbm/boosted_stumps.gif . I suspect that at n_estimators = 5 you may be underfitting, and I'm not really sure why it is running so slow. Can you still try and increase the number of trees? Also, for each trial keep track of MSE on both the training and test sets? Thank you.
Thank you, Vlad. I used XGBClassifier with booster=gbtree. I increased the number of trees, RMSE improved, but speed of execution was not satisfactory. Using gblinear, helped improve the time performance but RMSE was higher.
 

Vlad Skorokhod

Customer
Customer
Hi Atreyi,
How many features do you have in your model? Is there a possibility of further reducing the number of features? You can try forward-stepwise feature selection (i.e., find one feature that produces the lowest error, than add another one that, in combination with the first one, produces the lowest error etc.) or remove all low-variance features. I would try to make the gbtree model work rather than trade it off for faster computation with gblin.
Thanks,
Vlad

Thank you, Vlad. I used XGBClassifier with booster=gbtree. I increased the number of trees, RMSE improved, but speed of execution was not satisfactory. Using gblinear, helped improve the time performance but RMSE was higher.
 

Vlad Skorokhod

Customer
Customer
Atreyi,

Good questions and great reasoning. You seem to be on a right track.

1. Your approach is certainly valid, however you could potentially do even better. "No" most likely means 0, however "yes" could be treated as a missing value rather than "one" which is an arbitrary choice for the number of education years and dependents. Can you impute "Yes" using conditional means (by 'idhogar' for example or by some other variable)?

2. Agree

3. Totally agree - use it for NaN impute and don't include in the model. Excellent!

4. All-zeros or all-ones etc. are obvious candidates for removal. How about low variance columns (V14a has only 50 zeros , only 0.5% - the remaining 95.5% are ones)? Or those with lots NaNs (e.g., v2a1)? Don't seem too useful to me. I agree with your next step - understanding your variables and manually weeding out obvious candidates should generally be done before algorithmic dimensionality reduction (PCA, LDA).

5. N-fold cross-validated accuracy is more representative of your model's performance on unseen data than hold-out validation (i.e., only one test set). The accuracy is higher due to a "luck factor", and it could also be the other way around.


Thank you.
Vlad



Hi Vlad,
I was working on the Income Qualification (Proxy Means Test) dataset. It has 143 columns out of which 5 are categorical. I got the description of the columns from kaggle.
1. The columns 'dependency', 'edjefe', 'edjefa' has values 'yes' and 'no' apart from numerical values. I am thinking of replacing 'yes' and 'no' by 1 and 0 respectively. That would convert these columns to numerical columns.
2. I can drop the Id column as it will not be required.
3. How do I handle the ‘idhogar’ column. This is the unique identifier for each household. So it ties several records from the same household together. Should it be part of the features dataset? And converted to numerical value through Label Encoding? My take is, after using 'idhogar' to fill the missing Target values or doing any other computation, I should delete it before training the model, since it is another id.
4. One column has all 0 value, that can be dropped. The remaining 139 columns can be reduced through dimensionality reduction. I am using LDA, over PCA, as we have categorical labels in the training dataset. This step is not mentioned in the instruction, but I am including it as the number of columns are quite large. Should I go with this approach, or manually understand the columns and come up with reduced features?
Actually after writing this, I tried with LDA. Dimension reduced to 3 but accuracy was quite low. So I went ahead with the original 139 dimensions for model training. The next step will be to manually understand each of the features to see if any of them can be removed due to redundant information.
5. Applying Random Forest Classifier on the dataset to predict y values for the test dataset and check accuracy. Accuracy was lower when I used cross-validation split compared to Train-test split.
Hope this is the right approach.

Regards,
Atreyi
 
Atreyi,

Good questions and great reasoning. You seem to be on a right track.

1. Your approach is certainly valid, however you could potentially do even better. "No" most likely means 0, however "yes" could be treated as a missing value rather than "one" which is an arbitrary choice for the number of education years and dependents. Can you impute "Yes" using conditional means (by 'idhogar' for example or by some other variable)?

2. Agree

3. Totally agree - use it for NaN impute and don't include in the model. Excellent!

4. All-zeros or all-ones etc. are obvious candidates for removal. How about low variance columns (V14a has only 50 zeros , only 0.5% - the remaining 95.5% are ones)? Or those with lots NaNs (e.g., v2a1)? Don't seem too useful to me. I agree with your next step - understanding your variables and manually weeding out obvious candidates should generally be done before algorithmic dimensionality reduction (PCA, LDA).

5. N-fold cross-validated accuracy is more representative of your model's performance on unseen data than hold-out validation (i.e., only one test set). The accuracy is higher due to a "luck factor", and it could also be the other way around.


Thank you.
Vlad

Thank you for the directions, Vlad.
 
Hi Atreyi,
How many features do you have in your model? Is there a possibility of further reducing the number of features? You can try forward-stepwise feature selection (i.e., find one feature that produces the lowest error, than add another one that, in combination with the first one, produces the lowest error etc.) or remove all low-variance features. I would try to make the gbtree model work rather than trade it off for faster computation with gblin.
Thanks,
Vlad
Post application of PCA, I had only 6 transformed features in my model. Should I use forward-stepwise feature selection on these? Even if I don't do it here, this is a learning for me. Can apply in other situations.
I am not using gblinear because of the RMSE. I will continue working using gbtree and try to reduce the execution time. Was checking on the net, gbtree does sometime tend to have such long execution time on regression datasets.
Thanks,
Atreyi
 
Hi Vlad,

I am attempting the Phishing Detector with LR. A part of the exercise asks the below:

Plot the test samples along with the decision boundary when trained with index 5 and index 13 parameters.

These parameters have values {-1,1} and {-1,0,1} and the target values are {-1,1}. So any scatter plot will not give more than 6 points and some will be overlapping. The decision boundary will also be ambiguous. Can you please suggest how to go about this task.

Thanks,
Atreyi
 

Vlad Skorokhod

Customer
Customer
Hi Vlad,

I am attempting the Phishing Detector with LR. A part of the exercise asks the below:

Plot the test samples along with the decision boundary when trained with index 5 and index 13 parameters.

These parameters have values {-1,1} and {-1,0,1} and the target values are {-1,1}. So any scatter plot will not give more than 6 points and some will be overlapping. The decision boundary will also be ambiguous. Can you please suggest how to go about this task.

Thanks,
Atreyi


Hi Atreyi,

You can introduce random jitter (specifically, for plotting, don't use it in analysis) to avoid data points getting superimposed and make each data point look like a blob of multiple points scattered around their real positions {1,1} etc.. Take a look at this post in stack overflow, or simply add np.random.normal in both dimensions:
https://stackoverflow.com/questions...ing-datapoints-in-a-scatter-dot-beeswarm-plot

It will also help if you color-code your plots by the output class.

Let me know if this works.

Thank you.
 
Did anyone upload the project? When I try to upload the project I see three option
Writeup* , Screenshots *, Source Code *

What needs to go in each(source code is understood)
 

KUNTAL PAUL

Member
Alumni
Hi Vlad, I am getting an error while running the perceptron.ipynb file in JupyterLab. I have attached the screenshot of the error message. Please let me know what is the issue here. Is it okay to ignore this error?
 

Attachments

  • Capture.JPG
    Capture.JPG
    60.6 KB · Views: 2
Top