Welcome to the Simplilearn Community

Want to join the rest of our members? Sign up right away!

Sign Up

MACHINE LEARNING| MAR 24 - APR 15 | JAYANTH

Sheik Mohamed

Member
Alumni
Hello Jayanth,

We had looked at dimensional reduction during data pre-processing using PCA (Principal Component Analysis). I've got a question on this part.

Principal components are derived by feeding the independent columns in to SKLEARN PCA fit and transform in python in Jupyter notebook. Cumulative variance helps to check the variance proportion for each independent column in entire data set that has all the independent columns fed in to PCA model. As i got from you during class, component to feature mapping can be identified using eigenvalues.

My question on this part is do we have any technique in python to find the relation between component returned from PCA model to features, so that we can drop those columns that has lesser variance proportion?

Thank you.
Sheik
 
Suppose we have dataset and some independent variables are correlated so do we need to remove the correlated variable first then perform the PCA?
how to check the correlated variable? By p value only?
Simply we find the p values and if p values is greater than 0.05 than we neglect that variable.

Thanks
Mohammad Sharib khan
 

karandeeparora

New Member
Alumni
Good morning, at the end of session 3 you had given us a bonus question on logistic regression via health data. has anybody got the solution? i cant seem to work it out
 

J Balaram

Member
Alumni
Jayant,

You have explained to us that regularization function (L1 & L2) can be applied to the cost function in Decision Tree regression algorithm. What will be the values for the weights to be used in regularization function?

Thanks
Balaram
 

Jayanth_14

Member
Alumni
Trainer
Hello Jayanth,

We had looked at dimensional reduction during data pre-processing using PCA (Principal Component Analysis). I've got a question on this part.

Principal components are derived by feeding the independent columns in to SKLEARN PCA fit and transform in python in Jupyter notebook. Cumulative variance helps to check the variance proportion for each independent column in entire data set that has all the independent columns fed in to PCA model. As i got from you during class, component to feature mapping can be identified using eigenvalues.

My question on this part is do we have any technique in python to find the relation between component returned from PCA model to features, so that we can drop those columns that has lesser variance proportion?

Thank you.
Sheik

It's useful to drop the transformed columns only after applying PCA. A column that might not seem interesting might be very important after transformation, and vice versa.
 

Jayanth_14

Member
Alumni
Trainer
Jayant,

You have explained to us that regularization function (L1 & L2) can be applied to the cost function in Decision Tree regression algorithm. What will be the values for the weights to be used in regularization function?

Thanks
Balaram
So the values of Lambda in both Ridge and Lasso aren't that important. It's important to know what accuracies/RMSE's we get after we toggle either into L1/L2. e.g.

sklearn.LinearRegression(vanilla) RMSE 200
sklearn.LinearRegression(L1) RMSE 190
sklearn.LinearRegression(L2) RMSE 180

I would choose the L2 in the above for my production code
 

Sheik Mohamed

Member
Alumni
It's useful to drop the transformed columns only after applying PCA. A column that might not seem interesting might be very important after transformation, and vice versa.

Thank you Jayanth for your response. However, my actual question was how do we know which column to drop and which ones to consider post PCA transformation? Thank you for helping out.
 

Jayanth_14

Member
Alumni
Trainer
Suppose we have dataset and some independent variables are correlated so do we need to remove the correlated variable first then perform the PCA?
how to check the correlated variable? By p value only?
Simply we find the p values and if p values is greater than 0.05 than we neglect that variable.

Thanks
Mohammad Sharib khan
As part of pre processing you could look into correlated variables (but that's just EDA). (You might want to drop or leave out heavily correlated variables, e.g. 'batting strike rate' is heavily correlated 'number_of_boundaries"

PCA is a completely different technique. Its transforms each and every feature into new components
 

Jayanth_14

Member
Alumni
Trainer
Thank you Jayanth for your response. However, my actual question was how do we know which column to drop and which ones to consider post PCA transformation? Thank you for helping out.

Typically drop the components that aren't contributing to beyond 90% of the cumulative variance
 

_25675

New Member
Alumni
Hello

I don't get the download option for the kaggle datasets.Please activate the access for me.

thanks,
Sneha
 

_26874

Member
Alumni
I too need support to download kaggle data set.
Hi Sneha - Did you manage to download? If yes, do guide me pl

Thanks
Sugathan
 
Hi Jayant,
I have one query regarding to R^2_score and R^2.
What is the difference between R^2_score and R^2. I am just calculating the R^2_score from the metrics then value is coming 0.638250878783 and when i am calculating R^2 by sse and sst then value is coming .999.
Please let me know which one is correct and what are the difference between.
Code
from sklearn.metrics import r2_score
r2_score(Y_test,Y_pred)
#print (r2_score(Y_test,Y_pred)) # Value 0.638250878783
from sklearn.metrics import mean_squared_error
from math import sqrt
rms = sqrt(mean_squared_error(Y_test, Y_pred)) # value 68680.74985167495
print (rms)
sst= np.sum((Y_test - np.mean(Y_test))**2) # value 53827258367554.0
sse=mean_squared_error(Y_test, Y_pred) # 4717045400.1883478
r2= 1-(sse/sst) # .999

Thanks & Regards
Mohammad Sharib Khan
 

_14589

Member
Alumni
Hi Jayanth,

Would like to ask for one favor on the submitted project for further study.

Could you share an ipynb (for the submitted project ) with codes for graph plot illustrating each result from linear regression, decision tree regression and random forest regression?


Thanks
Cheers/Bho
 
Top