Welcome to the Simplilearn Community

Want to join the rest of our members? Sign up right away!

Sign Up

Machine Learning Advanced Certification - March 13 - April 17, 2021 batch

CHAN Siu Chung

Active Member
Dear Vaishali,
For outlier, I got the idea from the class that those observation below Q1 and above Q3 shall be considered to ignore.
I read more on materials showing another idea of boundaries shall be set as:
Q1- 1.5x IQR and Q3 + 1.5x IQR.
Then the drop out percentage will then be around 2.5% + 2.5% if Normal distribution is assumed for the dataset.

Could you help to confirm this outlier issue?
 
Hello!
I have a question: When I go through the self-learning part Lesson 3 & 4, they often talk about specific files that they use for the examples. In reality these files don't exist or at least I can't find them. One example of that is: 4.14 Demo: Linear Regression. The file is supposed to be: 'Advertising.csv'. Can I find that file somewhere? I would love to be able to follow along and compute those steps myself. Thank you in advance for anybody's help!
Dominik
 
I also need help with this:

Boston Homes I
DESCRIPTION
A real estate company wants to build homes at different locations in Boston. They have data for historical prices but haven’t decided the actual prices yet. They want to price the homes so that they are affordable to the general public.
Objective:
• Import the Boston data from sklearn and read the description using DESCR.
• Analyze the data and predict the approximate prices for the houses.

The required data set for this project is inbuilt in Python sklearn package

What does it mean that the data set is inbuilt?
How do I import sklearn? I tried import sklearn as sk. Is that correct?
And what is the next step?

I don't understand this and I hope someone can help me. I would love to get this exercise done before we meet next time, but I need help.

Thank you,

Dominik
 
While I still hope to get some feedback on my other posts, i think I was able to narrow down what is important for me in this class.

Machine Leaning What I Need To Know:
  • List of algorithms used in Machine Learning
  • Libraries we need to import to use them
  • Functions used for each algorithm
  • How to use the functions
  • How to interpret the results – know if a model is good or not
  • How to improve a model
  • What tools to use to decide if something is relevant or not
  • How to interpret the results of those tools
AlgorithmLibrariesFunctionsUse of FuncModel IntImprove MToolsInterpret T
Linear Reg

I hope this makes sense. :)
 
While I still hope to get some feedback on my other posts, i think I was able to narrow down what is important for me in this class.

Machine Leaning What I Need To Know:
  • List of algorithms used in Machine Learning
  • Libraries we need to import to use them
  • Functions used for each algorithm
  • How to use the functions
  • How to interpret the results – know if a model is good or not
  • How to improve a model
  • What tools to use to decide if something is relevant or not
  • How to interpret the results of those tools
AlgorithmLibrariesFunctionsUse of FuncModel IntImprove MToolsInterpret T
Linear Reg

I hope this makes sense. :)

I forgot to mention: Knowing when to use which model
 
Hello All,

i'm trying to find out which holiday season has more sales. The plan is to create new column in my dataframe which is going to tell me the holiday type based on the date.

For example, if the date is '28-12-2012' then the new column should be assigned as "Christmas".

Sample dataset:
1616214540860.png

My dictionary:
1616214590015.png

To achieve this, i'm trying to compare the "Date" column (Type -> non-null object) in my dataframe with the dictionary "holiday_dict". If the dates are matching, the dictionary key needs to be assigned in the new column. This column can be used in further processing to find the sales in each holiday event. For example, total sale for Christmas is xxxxxx.xx

I tried the below code and it's not working as expected. Can someone help me to identify the issue here?
1616214638319.png

Or is there any other way, we can achieve this? Any thoughts or inputs?
Thanks!!
 
Hello!

I am trying to do Project 3, movie ratings. I know the screenshot shows something that is incorrect, but I don't know how to do it. :-/

Problem #1: How do I create the target? I figured the target should be average rating.

Problem #2: It is mentioned that some movies have not been rated - but hey all have been rated! That confuses me. Should I worry about this at all?

Problem #3: I don't know what algorithm to use. I think I know how to split the data and I understand this is a regression issue.

Thank you for anybody's help!



1617378252831.png
 
I am really trying to solve all three projects, and I need major help! I hope that in the next couple of sessions we can have a look at them, thank you!
 

Amit Sahay_1

Member
Alumni
Hello All, For the income qualification project in the data set, I could not find any poverty level column. Do we need to create a poverty level column based on logics for the each 'ID' or family members. Can any help on this?
 
Hello!

I have troubles with the Project 1: I called the data:

mercedes_train = pd.read_csv("mercedes_train.csv")
mercedes_test = pd.read_csv("mercedes_test.csv")

I don't know how to code:
6. Perform Dimensionality Reduction

and
7. Predict test_df values using XGBoost

for 6. Perform Dimensionality Reduction I suspect we start with

from sklearn.decomposition import PCA

pca = PCA()

But what then? I don't know how to integrate and code for train or test:

X_train = mercedes_train.csv features
y_train = mercedes_train.csv target column alone

for 7. Predict test_df values using XGBoost I suspect we start with

import xgboost as xgb
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

Is that correct?

I don't know where or how to do the train split, in 6 or 7?

My apologies, I know this must be easy, and when I look at our Iris examples, I can follow along. But with this data set that is already split, I am really clueless. Any help is much welcomed as I truly want to be able to submit this in time.

Thank you,

Dominik
 

Amit Sahay_1

Member
Alumni
Hi Vaishali, I am getting error with below codes. Can you please help with guidance on this

# convert the dataframe to a list
records = []

for i in range(0,7501): # i refers to the row index
records.append([str(store_data.values[i,j]) for j in range(0,20)]) #j refers to the column index

print(records)

Error -
IndexError Traceback (most recent call last)
<ipython-input-6-5e8e37069247> in <module>
3
4 for i in range(0,7501): # i refers to the row index
----> 5 records.append([str(store_data.values[i,j]) for j in range(0,20)]) #j refers to the column index
6
7 print(records)

<ipython-input-6-5e8e37069247> in <listcomp>(.0)
3
4 for i in range(0,7501): # i refers to the row index
----> 5 records.append([str(store_data.values[i,j]) for j in range(0,20)]) #j refers to the column index
6
7 print(records)

IndexError: index 7500 is out of bounds for axis 0 with size 7500
 
Hi again!

I made some progress in the meanwhile in the form of the following code:

from sklearn.model_selection import train_test_split

X = mercedes_train.drop(columns='y')
y = mercedes_train['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 , random_state = 21)

print("Shape of the training features is ", X_train.shape)
print("Shape of the training target is ", y_train.shape)
print("Shape of the testing features is ", X_test.shape)
print("Shape of the testing target is ", y_test.shape)

Shape of the training features is (3367, 359)
Shape of the training target is (3367,)
Shape of the testing features is (842, 359)
Shape of the testing target is (842,)
 
I also wanted to post...

from sklearn.decomposition import PCA


from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.fit_transform(X_test)

pca = PCA()

X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.fit_transform(X_test_scaled)

pca.explained_variance_ratio_ (I spare you the details! :))

I wish the numbers were higher, but I want to reach at least 0.75, so I am going with 50!

Variance of PC 1-10 = 0.406533
Variance of PC 1-20 = 0.548485
Variance of PC 1-30 = 0.641701
Variance of PC 1-40 = 0.710125
Variance of PC 1-50 = 0.76448
 
as well as ...

pca2 = PCA (n_components = 50)


X_train_pca2 = pca2.fit_transform(X_train_scaled)
X_test_pca2 = pca2.fit_transform(X_test_scaled)

array([0.07628714, 0.06179395, 0.05368609, 0.04006838, 0.03712833,
0.03666001, 0.03194647, 0.02558459, 0.02258519, 0.02079329,
0.01915542, 0.0178314 , 0.01631591, 0.01418476, 0.01411336,
0.01374535, 0.01253127, 0.01165091, 0.01124655, 0.0111733 ,
0.01078417, 0.01032414, 0.00989786, 0.00970184, 0.00946372,
0.00914253, 0.00890688, 0.0085615 , 0.00826295, 0.00812991,
0.0076667 , 0.007577 , 0.00725707, 0.00705873, 0.00684972,
0.00664393, 0.00641778, 0.00632845, 0.00627182, 0.00607875,
0.00593005, 0.00574376, 0.00557756, 0.00549602, 0.00544791,
0.00529223, 0.00508359, 0.0050175 , 0.00491525, 0.00484817])

And this is how far I got! I feel that I still have the following to do:

1. PCA on X_test,
2. Build D_train and D_test using PCAed X_train and X_test
3. XGB model train

Is that correct? And how does the code look like? How do I do PCA on X-Test? Have I not already done it??


And afterwards: ...

import xgboost as xgb
from sklearn.metrics import r2_score

D_train = xgb.DMatrix(X_train, label = y_train)
D_test = xgb.DMatrix(X_test)

does the above look right?



And how do I finish the XGBoost?? The below is from our example:

import xgboost as xgb
from sklearn.metrics import r2_score

D_train = xgb.DMatrix(X_train, label = y_train)
D_test = xgb.DMatrix(X_test)

What would be the best numbers here?
param ={"eta" : 0.02, "max_depth" : 4 , "objective" : "multi:softmax" , "num_class" : 3}

The below line does not work, why?
xgb_model = xgb.train(param, D_train, 20)


I was hoping to wrap it up with:

y_test_pred_xgb = xgb_model.predict(D_test)

from sklearn.metrics import accuracy_score

print("The testing accuracy for the xgb model is:" , accuracy_score(y_test, y_test_pred_xgb))

Thank you in advance for your help!

Dominik
 

CHAN Siu Chung

Active Member
Hi Vaishali,
Great pleasure to join your class.
Could you please post last day (Day12) notes and notebooks in google drive?
Thank you!
 

Vaishali_26

Well-Known Member
Alumni
Dear Vaishali,
For outlier, I got the idea from the class that those observation below Q1 and above Q3 shall be considered to ignore.
I read more on materials showing another idea of boundaries shall be set as:
Q1- 1.5x IQR and Q3 + 1.5x IQR.
Then the drop out percentage will then be around 2.5% + 2.5% if Normal distribution is assumed for the dataset.

Could you help to confirm this outlier issue?
Hi Chan ,
Yes the exact condition to detect outliers is the one that you have mentioned here. any point that lies above Q3+1.5(IQR) and any point that lies below Q1-1.5(IQR) is an outlier.
 

Vaishali_26

Well-Known Member
Alumni

Vaishali_26

Well-Known Member
Alumni
as well as ...

pca2 = PCA (n_components = 50)


X_train_pca2 = pca2.fit_transform(X_train_scaled)
X_test_pca2 = pca2.fit_transform(X_test_scaled)

array([0.07628714, 0.06179395, 0.05368609, 0.04006838, 0.03712833,
0.03666001, 0.03194647, 0.02558459, 0.02258519, 0.02079329,
0.01915542, 0.0178314 , 0.01631591, 0.01418476, 0.01411336,
0.01374535, 0.01253127, 0.01165091, 0.01124655, 0.0111733 ,
0.01078417, 0.01032414, 0.00989786, 0.00970184, 0.00946372,
0.00914253, 0.00890688, 0.0085615 , 0.00826295, 0.00812991,
0.0076667 , 0.007577 , 0.00725707, 0.00705873, 0.00684972,
0.00664393, 0.00641778, 0.00632845, 0.00627182, 0.00607875,
0.00593005, 0.00574376, 0.00557756, 0.00549602, 0.00544791,
0.00529223, 0.00508359, 0.0050175 , 0.00491525, 0.00484817])

And this is how far I got! I feel that I still have the following to do:

1. PCA on X_test,
2. Build D_train and D_test using PCAed X_train and X_test
3. XGB model train

Is that correct? And how does the code look like? How do I do PCA on X-Test? Have I not already done it??


And afterwards: ...

import xgboost as xgb
from sklearn.metrics import r2_score

D_train = xgb.DMatrix(X_train, label = y_train)
D_test = xgb.DMatrix(X_test)

does the above look right?



And how do I finish the XGBoost?? The below is from our example:

import xgboost as xgb
from sklearn.metrics import r2_score

D_train = xgb.DMatrix(X_train, label = y_train)
D_test = xgb.DMatrix(X_test)

What would be the best numbers here?
param ={"eta" : 0.02, "max_depth" : 4 , "objective" : "multi:softmax" , "num_class" : 3}

The below line does not work, why?
xgb_model = xgb.train(param, D_train, 20)


I was hoping to wrap it up with:

y_test_pred_xgb = xgb_model.predict(D_test)

from sklearn.metrics import accuracy_score

print("The testing accuracy for the xgb model is:" , accuracy_score(y_test, y_test_pred_xgb))

Thank you in advance for your help!

Dominik
Hi Dominik,

We had discussed this query of yours in the class elaborately. I hope that you have submitted your project by now.
 

Vaishali_26

Well-Known Member
Alumni
Hi Vaishali, I am getting error with below codes. Can you please help with guidance on this

# convert the dataframe to a list
records = []

for i in range(0,7501): # i refers to the row index
records.append([str(store_data.values[i,j]) for j in range(0,20)]) #j refers to the column index

print(records)

Error -
IndexError Traceback (most recent call last)
<ipython-input-6-5e8e37069247> in <module>
3
4 for i in range(0,7501): # i refers to the row index
----> 5 records.append([str(store_data.values[i,j]) for j in range(0,20)]) #j refers to the column index
6
7 print(records)

<ipython-input-6-5e8e37069247> in <listcomp>(.0)
3
4 for i in range(0,7501): # i refers to the row index
----> 5 records.append([str(store_data.values[i,j]) for j in range(0,20)]) #j refers to the column index
6
7 print(records)

IndexError: index 7500 is out of bounds for axis 0 with size 7500
Hi Amit,
Kindly check the shape of the dataset. And also your import statement.

It should be like below:
1. import statement :
store_data = pd.read_csv("D:\\SimpliLearn\\Machine+Learning\\store_data.csv" , header = None)

2. Shape of the dataset:
(7501,20)
 

Vaishali_26

Well-Known Member
Alumni
I also need help with this:

Boston Homes I
DESCRIPTION
A real estate company wants to build homes at different locations in Boston. They have data for historical prices but haven’t decided the actual prices yet. They want to price the homes so that they are affordable to the general public.
Objective:
• Import the Boston data from sklearn and read the description using DESCR.
• Analyze the data and predict the approximate prices for the houses.

The required data set for this project is inbuilt in Python sklearn package

What does it mean that the data set is inbuilt?
How do I import sklearn? I tried import sklearn as sk. Is that correct?
And what is the next step?

I don't understand this and I hope someone can help me. I would love to get this exercise done before we meet next time, but I need help.

Thank you,

Dominik
Hi Dominik,

We have already worked on scikit learn's built in datasets like iris. You will have to import boston dataset also the same way.

(i.e) from sklearn.datasets import load_boston
 

CHAN Siu Chung

Active Member
You all were wonderful participants! All the best for your future endeavors !!
Thank you so much Vaishali.
You have given us a very practical approach to understand the problem, while on theoretical side you have "reinforced" a lots!
I appreciate very much to have your as my mentor!
 
Top