Hi everyone and this is the dedicated link to discuss the ML concepts with your peers, TA and Trainer.
Recommended. Know people from your network.
Don't have an account?Sign up Now
To reset your password, enter the email address you registered with and we"ll send your instructions on their way.
Don't have an account?Sign up Now
Want to join the rest of our members? Sign up right away!
Sign UpHi Vlad, I am working on the California House Price Prediction. The dataset comprises of 20640 rows. Is it suggested to take a subset of this dataset while building the training model and prediction? If so, should the Exploratory Data Analysis be done on the entire dataset or on the selected subset?
For converting the categorical column ocean_proximity what is preferred - Label Encoder or get_dummies? Any advantage of one choice over the other?
Hello, Vlad, can you please upload the notebook for unassisted practice? If it is possible. Thank you!
Thanks for your question - pd.get_dummies performs one-hot encoding, i.e., it creates a column of 1s and 0s for each category of your categorical variable. Label Encoder replaces the categories with integers (0, 1, 2, ...) - this may trick your model into the assumption of some sort of ordinality in your variable which will probably not make much sense. One-hot is better if you don't mind the extra columns.
Thank you, Vlad. Some more queries:
1. Should I take a subset of the entire dataset or work on training the entire dataset.
2. I could also see columns where the correlation is not strong: Running the below commands:
corr_matrix=housing_df.corr()
corr_matrix['median_house_value'].sort_values(ascending=False)
Result:
median_house_value 1.000000
median_income 0.688075
total_rooms 0.134153
housing_median_age 0.105623
households 0.065843
total_bedrooms 0.049686
population -0.024650
longitude -0.045967
latitude -0.144160
Should I continue using all the columns or drop the ones with low correlation?
3. Also I see outliers on most features and median_house_value, but not sure how to eliminate them. Should it be a simple dropping based on the value of the particular column or a combination of columns.
4. Should I apply Standard Scalar on the features only or also the y value (median_house_value)?
Apologize, if these queries seem trivial, but this is my initial attempt to a ML problem.
Excellent questions!
1. If the size of your dataset is reasonable to run it on a stand-alone PC (up to 1 GB?) there is no apparent reason to work with a subset instead of the entire data population.
2. Don't judge the predictive power of your features based on the lack of correlation between the output and individual features. Correlation only indicates an obvious linear relationship. The effects of multiple variables can be confounded so that you will start seeing a structure only when you include more than one variable. Also, the relationship between the output and an individual feature can be well structured but non-linear which may result in a low correlation coefficient. Try for example corr(x,y) where x^2+y^2 = 1.
3. I wouldn't remove the outliers right away. There could be important insights hidden in the "anomalous" data. Most models can handle them. If you realize later on that the outliers are affecting the accuracy of your model, you can then remove the entire rows.
4. No need to scale the output variable.
Good luck with the dataset
Hi Vlad,
Some more questions while I am working on the Mercedes-Benz Greener Manufacturing project.
1. There are 12 columns in the training dataset which only have the value 0. I am thinking of dropping these columns, but I did a further analysis of the test dataset. I found that the corresponding columns are non-zero in the test dataset. I am going ahead and dropping these columns as retaining them will not add to the model information (no variance in these columns in the training dataset). Is this a logical decision?
2. Columns X0, X1, X2, X3, X4, X5, X6 and X8 are categorical . I am applying Label Encoder on these columns. Initially I was doing a LabelEncoder.fit() operation on the training dataset features and using the same instance to transform the test dataset. However the testing set features have values that are not in the training set. So I am concatenating the training and testing set and applying Label Encoder fit and transform on the entire set. After encoding is done, I am splitting the training and testing set back again. Is this a right approach?
3. For dimensionality reduction we were taught two options Principal Component Analysis and Linear Discriminant Analysis. I am not sure what would work best. PCA is unsupervised method, LDA is supervised method. But LDA seems to work on classification problems, whereas the current one is a supervised regression problem. Should I apply PCA and then use the transformed set for Linear Regression and XGBoost?
Can you please share your thoughts.
Hi Atreyi,
1. I agree with your reasoning - even if the all-zero columns have more variance in the test set, these variables will not contribute anything to the model, therefore it should be safe to remove them.
2. Some of the X_ columns have over 40 unique values. Have you looked at their frequency of occurrence? Do some of the unique values occur only once or twice? Remember that each unique category will produce a new feature upon one-hot encoding.. The rarely occurring categories will result in sparse features with all zeros except one or two "1"s. Instead, you can consolidate rare categories into one "other" category - this can also take care of the previously unseen unique values in the test set - just put them in the "other" bucket.
3. LDA can be used only with categorical labels so it's not applicable here. If you decide to use PCA, make sure that you use only one-hot encoding for your categorical variables so that all your features are either binary (0 or 1) or scaled numerical, if here are any. You should apply PCA only to the features, don't include the label column.
Thank you.
Hi Vlad, can you please clarify where the results of kFold is used in the Lesson_8_Assisted_Practice_Cross Validation notebook. We calculated different train and test sets, but I cannot understand how we used them later. I mean we alway use data_input and data_output, but never the results of KFold. Sorry, if I missed it!
kf = KFold(n_splits = 10, shuffle=True)
print("Train Set Test Set ")
for train_set,test_set in kf.split(data_output):
print(train_set, test_set)
print(" ")
Hi Vlad,
Wanted a clarification on Mercedes-Benz assessment. The last instruction in the assessment is to predict test_df values using xgboost.
1. Since the target data,y, is continuous, not categorical, should we use parameter booster = gblinear?
2. Using XGBClassifier(booster = 'gbtree', n_estimators=5, random_state=2, tree_depth=2) takes as long as 10 mins to predict the output and the RMSE is worse than Linear Regresssion. RMSE using XGBoost, gbtree is 15.08 and that using Linear Regression is 12.12. So don't see much benefit of using XGBoost here.
3. Using XGBClassifier(booster = 'gblinear', n_estimators=5, random_state=2, tree_depth=2) the process is a little faster but RMSE is 16.16 compared to Linear Regression where RMSE is 12.12.
Is XGBoost a preferred classifier in case of continuous (non-categorical) target data?
Regards,
Atreyi
Thank you, Vlad. I used XGBClassifier with booster=gbtree. I increased the number of trees, RMSE improved, but speed of execution was not satisfactory. Using gblinear, helped improve the time performance but RMSE was higher.Hi Atreyi,
Decision trees can be used for regression (we didn't talk much about regression trees in class). Each split or stump creates a sort of step function: if x < x0, y = y1, if x > x0, y = y2. With a sufficient number of these, you can model complex non-linear relationships. Take a look at the example I showed in class: http://uc-r.github.io/public/images/analytics/gbm/boosted_stumps.gif . I suspect that at n_estimators = 5 you may be underfitting, and I'm not really sure why it is running so slow. Can you still try and increase the number of trees? Also, for each trial keep track of MSE on both the training and test sets? Thank you.
Thank you, Vlad. I used XGBClassifier with booster=gbtree. I increased the number of trees, RMSE improved, but speed of execution was not satisfactory. Using gblinear, helped improve the time performance but RMSE was higher.
Hi Vlad,
I was working on the Income Qualification (Proxy Means Test) dataset. It has 143 columns out of which 5 are categorical. I got the description of the columns from kaggle.
1. The columns 'dependency', 'edjefe', 'edjefa' has values 'yes' and 'no' apart from numerical values. I am thinking of replacing 'yes' and 'no' by 1 and 0 respectively. That would convert these columns to numerical columns.
2. I can drop the Id column as it will not be required.
3. How do I handle the ‘idhogar’ column. This is the unique identifier for each household. So it ties several records from the same household together. Should it be part of the features dataset? And converted to numerical value through Label Encoding? My take is, after using 'idhogar' to fill the missing Target values or doing any other computation, I should delete it before training the model, since it is another id.
4. One column has all 0 value, that can be dropped. The remaining 139 columns can be reduced through dimensionality reduction. I am using LDA, over PCA, as we have categorical labels in the training dataset. This step is not mentioned in the instruction, but I am including it as the number of columns are quite large. Should I go with this approach, or manually understand the columns and come up with reduced features?
Actually after writing this, I tried with LDA. Dimension reduced to 3 but accuracy was quite low. So I went ahead with the original 139 dimensions for model training. The next step will be to manually understand each of the features to see if any of them can be removed due to redundant information.
5. Applying Random Forest Classifier on the dataset to predict y values for the test dataset and check accuracy. Accuracy was lower when I used cross-validation split compared to Train-test split.
Hope this is the right approach.
Regards,
Atreyi
Atreyi,
Good questions and great reasoning. You seem to be on a right track.
1. Your approach is certainly valid, however you could potentially do even better. "No" most likely means 0, however "yes" could be treated as a missing value rather than "one" which is an arbitrary choice for the number of education years and dependents. Can you impute "Yes" using conditional means (by 'idhogar' for example or by some other variable)?
2. Agree
3. Totally agree - use it for NaN impute and don't include in the model. Excellent!
4. All-zeros or all-ones etc. are obvious candidates for removal. How about low variance columns (V14a has only 50 zeros , only 0.5% - the remaining 95.5% are ones)? Or those with lots NaNs (e.g., v2a1)? Don't seem too useful to me. I agree with your next step - understanding your variables and manually weeding out obvious candidates should generally be done before algorithmic dimensionality reduction (PCA, LDA).
5. N-fold cross-validated accuracy is more representative of your model's performance on unseen data than hold-out validation (i.e., only one test set). The accuracy is higher due to a "luck factor", and it could also be the other way around.
Thank you.
Vlad
Post application of PCA, I had only 6 transformed features in my model. Should I use forward-stepwise feature selection on these? Even if I don't do it here, this is a learning for me. Can apply in other situations.Hi Atreyi,
How many features do you have in your model? Is there a possibility of further reducing the number of features? You can try forward-stepwise feature selection (i.e., find one feature that produces the lowest error, than add another one that, in combination with the first one, produces the lowest error etc.) or remove all low-variance features. I would try to make the gbtree model work rather than trade it off for faster computation with gblin.
Thanks,
Vlad
Hi Vlad,
I am attempting the Phishing Detector with LR. A part of the exercise asks the below:
Plot the test samples along with the decision boundary when trained with index 5 and index 13 parameters.
These parameters have values {-1,1} and {-1,0,1} and the target values are {-1,1}. So any scatter plot will not give more than 6 points and some will be overlapping. The decision boundary will also be ambiguous. Can you please suggest how to go about this task.
Thanks,
Atreyi