### Welcome to the Simplilearn Community

Want to join the rest of our members? Sign up right away!

# MACHINE LEARNING| MAR 24 - APR 15 | JAYANTH

#### Priyanka_Mehta

##### Well-Known Member
Simplilearn Support
Hi Learners,

This thread is locked for this batch learners.

#### naresh.dogra

##### New Member
Alumni
Hi
Pls share link for drive for the material discussed on 1st April class

Thanks

#### Sheik Mohamed

##### Member
Alumni
Hello Jayanth,

We had looked at dimensional reduction during data pre-processing using PCA (Principal Component Analysis). I've got a question on this part.

Principal components are derived by feeding the independent columns in to SKLEARN PCA fit and transform in python in Jupyter notebook. Cumulative variance helps to check the variance proportion for each independent column in entire data set that has all the independent columns fed in to PCA model. As i got from you during class, component to feature mapping can be identified using eigenvalues.

My question on this part is do we have any technique in python to find the relation between component returned from PCA model to features, so that we can drop those columns that has lesser variance proportion?

Thank you.
Sheik

#### Sheik Mohamed

##### Member
Alumni
Hi Team, Is this thread really active? No responses yet!

Hi @Priyanka_Mehta , Can you please assist?

##### Member
Alumni
Suppose we have dataset and some independent variables are correlated so do we need to remove the correlated variable first then perform the PCA?
how to check the correlated variable? By p value only?
Simply we find the p values and if p values is greater than 0.05 than we neglect that variable.

Thanks

#### karandeeparora

##### New Member
Alumni
Good morning, at the end of session 3 you had given us a bonus question on logistic regression via health data. has anybody got the solution? i cant seem to work it out

#### J Balaram

##### Member
Alumni
Jayant,

You have explained to us that regularization function (L1 & L2) can be applied to the cost function in Decision Tree regression algorithm. What will be the values for the weights to be used in regularization function?

Thanks
Balaram

#### Jayanth_14

##### Member
Alumni
Trainer
Hello Jayanth,

We had looked at dimensional reduction during data pre-processing using PCA (Principal Component Analysis). I've got a question on this part.

Principal components are derived by feeding the independent columns in to SKLEARN PCA fit and transform in python in Jupyter notebook. Cumulative variance helps to check the variance proportion for each independent column in entire data set that has all the independent columns fed in to PCA model. As i got from you during class, component to feature mapping can be identified using eigenvalues.

My question on this part is do we have any technique in python to find the relation between component returned from PCA model to features, so that we can drop those columns that has lesser variance proportion?

Thank you.
Sheik

It's useful to drop the transformed columns only after applying PCA. A column that might not seem interesting might be very important after transformation, and vice versa.

#### Jayanth_14

##### Member
Alumni
Trainer
Jayant,

You have explained to us that regularization function (L1 & L2) can be applied to the cost function in Decision Tree regression algorithm. What will be the values for the weights to be used in regularization function?

Thanks
Balaram
So the values of Lambda in both Ridge and Lasso aren't that important. It's important to know what accuracies/RMSE's we get after we toggle either into L1/L2. e.g.

sklearn.LinearRegression(vanilla) RMSE 200
sklearn.LinearRegression(L1) RMSE 190
sklearn.LinearRegression(L2) RMSE 180

I would choose the L2 in the above for my production code

#### Sheik Mohamed

##### Member
Alumni
It's useful to drop the transformed columns only after applying PCA. A column that might not seem interesting might be very important after transformation, and vice versa.

Thank you Jayanth for your response. However, my actual question was how do we know which column to drop and which ones to consider post PCA transformation? Thank you for helping out.

#### Jayanth_14

##### Member
Alumni
Trainer
Suppose we have dataset and some independent variables are correlated so do we need to remove the correlated variable first then perform the PCA?
how to check the correlated variable? By p value only?
Simply we find the p values and if p values is greater than 0.05 than we neglect that variable.

Thanks
As part of pre processing you could look into correlated variables (but that's just EDA). (You might want to drop or leave out heavily correlated variables, e.g. 'batting strike rate' is heavily correlated 'number_of_boundaries"

PCA is a completely different technique. Its transforms each and every feature into new components

#### Sheik Mohamed

##### Member
Alumni
Thank you very much.

#### Jayanth_14

##### Member
Alumni
Trainer
Thank you Jayanth for your response. However, my actual question was how do we know which column to drop and which ones to consider post PCA transformation? Thank you for helping out.

Typically drop the components that aren't contributing to beyond 90% of the cumulative variance

#### Sheik Mohamed

##### Member
Alumni
Typically drop the components that aren't contributing to beyond 90% of the cumulative variance
Ok Jay. Thanks a lot. One final question - Components returned out of PCA model is in any order with the actual independent features?

Alumni
Hello

thanks,
Sneha

#### _26874

##### Member
Alumni
Hi Sneha - Did you manage to download? If yes, do guide me pl

Thanks
Sugathan

##### Member
Alumni
Hi Jayant,
I have one query regarding to R^2_score and R^2.
What is the difference between R^2_score and R^2. I am just calculating the R^2_score from the metrics then value is coming 0.638250878783 and when i am calculating R^2 by sse and sst then value is coming .999.
Please let me know which one is correct and what are the difference between.
Code
from sklearn.metrics import r2_score
r2_score(Y_test,Y_pred)
#print (r2_score(Y_test,Y_pred)) # Value 0.638250878783
from sklearn.metrics import mean_squared_error
from math import sqrt
rms = sqrt(mean_squared_error(Y_test, Y_pred)) # value 68680.74985167495
print (rms)
sst= np.sum((Y_test - np.mean(Y_test))**2) # value 53827258367554.0
sse=mean_squared_error(Y_test, Y_pred) # 4717045400.1883478
r2= 1-(sse/sst) # .999

Thanks & Regards

#### _14589

##### Member
Alumni
Hi Jayanth,

Would like to ask for one favor on the submitted project for further study.

Could you share an ipynb (for the submitted project ) with codes for graph plot illustrating each result from linear regression, decision tree regression and random forest regression?

Thanks
Cheers/Bho