### Welcome to the Simplilearn Community

Want to join the rest of our members? Sign up right away!

# Data Science Capstone | Rohit | May 07 - May 28 | 2021

#### Support Simplilearn(4685)

##### Moderator
Staff member
Alumni
Hi Learners,

This thread is for you to discuss the queries and concepts related to the Data Science Capstone course only.

Happy Learning !!!

#### RajyaLakshmi Kunchala

##### Member
Good Morning All..

Started working on Health care project.

While doing the activities i have little bit confusion on the following two questions. Can you please provide what was the difference between those 2 questions

Week1: Q3. There are integer and float data type variables in this dataset. Create a count (frequency) plot describing the data types and the count of variables.
Week2: Q1.Check the balance of the data by plotting the count of outcomes by their value. Describe your findings and plan future course of action.

For week1-Q3 have to consider all the data
For week2 -Q1 have to consider the data like group by outcome values
Ex: BMI.Value_count() based on coutcome values

Is my understanding correct ?

#### rohitrs_jam

##### Member
Week 1 Q3 : Do a value count plot for df.info.
Week2 Q1 : Consider only outcome column in dataset. How many 0's and how many 1's. Is the data set imbalanced.

#### ROHIT RANJAN SRIVASTAVA

##### Active Member
Good Morning All..

Started working on Health care project.

While doing the activities i have little bit confusion on the following two questions. Can you please provide what was the difference between those 2 questions

Week1: Q3. There are integer and float data type variables in this dataset. Create a count (frequency) plot describing the data types and the count of variables.
Week2: Q1.Check the balance of the data by plotting the count of outcomes by their value. Describe your findings and plan future course of action.

For week1-Q3 have to consider all the data
For week2 -Q1 have to consider the data like group by outcome values
Ex: BMI.Value_count() based on coutcome values

Is my understanding correct ?
Week 1 - Q3 use df.info (hint)
Week 2 - Q1 Consider only outcome column of dataset. Do a count plot , how many 1's and 0's. Is the data set balanced or imbalanced?

#### RajyaLakshmi Kunchala

##### Member
Week 1 Q3 : Do a value count plot for df.info.
Week2 Q1 : Consider only outcome column in dataset. How many 0's and how many 1's. Is the data set imbalanced.
Thank you Sir

#### Apoorv Sinha_1

##### Member
For project 1, do we have to remove the BLOCKID column as all the values are null in it.

#### ROHIT RANJAN SRIVASTAVA

##### Active Member
For project 1, do we have to remove the BLOCKID column as all the values are null in it.
Yes , you can remove

#### Krishan Kumar Sharma_1

##### Member
Hello Everyone,

I have a general query, how can I see the complete output in python without it getting truncated?

#### Suresh A

##### Member
Hello Sir, for healthcare project Insulin and SkinThickness have very high number of zero values(374 and 227). Is it okay to replace these zero values with mean ? Would it not affect the accuracy of training data ?

#### ROHIT RANJAN SRIVASTAVA

##### Active Member
Hello Everyone,

I have a general query, how can I see the complete output in python without it getting truncated?
def print_full(x):
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 2000)
pd.set_option('display.float_format', '{:20,.2f}'.format)
pd.set_option('display.max_colwidth', None)
print(x)
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')
pd.reset_option('display.width')
pd.reset_option('display.float_format')
pd.reset_option('display.max_colwidth')

#### Apoorv Sinha_1

##### Member
I am unable to geo visualize data for the 1st Project. what should I do?

#### ROHIT RANJAN SRIVASTAVA

##### Active Member
Hello Sir, for healthcare project Insulin and SkinThickness have very high number of zero values(374 and 227). Is it okay to replace these zero values with mean ? Would it not affect the accuracy of training data ?
An approach you can take is to use data from similar cases to estimate a replacement value for the missing feature.
To get an idea of how to choose similar cases, plot a correlation matrix. This shows the relationship between every pair of features in the dataset.
To plot this correlation matrix, eliminate all records with those zero values.

Observations: High correlation between skin thickness and BMI, and between insulin and glucose.

Option 1) Try to fill in the missing values using the observation above.. Make scatterplots to get an idea of how exactly they are related.
Option 2) Independent variables are highly co-related (Multi-collinearity) . Consider removing the variables (use VIF).

Last edited:

#### ROHIT RANJAN SRIVASTAVA

##### Active Member
I am unable to geo visualize data for the 1st Project. what should I do?

#### Balavishwakumar S

##### Member
Hello sir, I am doing 2nd project. The below code is working for other variable when I try to remove zeroes. However showing error for Skinthickness. Also attached the pdf. Code in page 4 & error in page 5

plt.hist(df_health['SkinThickness']) drop_D=df_health["SkinThickness"].index[df_health["SkinThickness"] == 0]. ,→tolist() df_health=df_health.drop(df_health.index[drop_D])

---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
1 plt.hist(df_health['SkinThickness'])
2 drop_D=df_health["SkinThickness"].index[df_health["SkinThickness"] == 0].tolist()
----> 3 df_health=df_health.drop(df_health.index[drop_D])
4 df_health.describe()

/usr/local/lib/python3.7/site-packages/pandas/core/indexes/base.py in __getitem__(self, key)
3938
3939 key = com.values_from_object(key)
-> 3940 result = getitem(key)
3941 if not is_scalar(result):
3942 if np.ndim(result) > 1:

IndexError: index 714 is out of bounds for axis 0 with size 710

#### Attachments

• Untitled.pdf
126 KB · Views: 10

#### ROHIT RANJAN SRIVASTAVA

##### Active Member
Hello sir, I am doing 2nd project. The below code is working for other variable when I try to remove zeroes. However showing error for Skinthickness. Also attached the pdf. Code in page 4 & error in page 5

plt.hist(df_health['SkinThickness']) drop_D=df_health["SkinThickness"].index[df_health["SkinThickness"] == 0]. ,→tolist() df_health=df_health.drop(df_health.index[drop_D])

---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
1 plt.hist(df_health['SkinThickness'])
2 drop_D=df_health["SkinThickness"].index[df_health["SkinThickness"] == 0].tolist()
----> 3 df_health=df_health.drop(df_health.index[drop_D])
4 df_health.describe()

/usr/local/lib/python3.7/site-packages/pandas/core/indexes/base.py in __getitem__(self, key)
3938
3939 key = com.values_from_object(key)
-> 3940 result = getitem(key)
3941 if not is_scalar(result):
3942 if np.ndim(result) > 1:

IndexError: index 714 is out of bounds for axis 0 with size 710
Try:
1) df_filtered = df[df['SkinThickness'] != 0] -----(df_filtered will have non zero values)

2) df.drop(df[df['SkinThickness'] == 0].index, inplace = True)

3) df_test = df.drop(df[df.'SkinThickness' == 0].index)

#### Harsh Kohli_1

##### Member
Hi Rohit,

I am working on the First project (Real Estate) and for week 4 tasks a) Run a model at a Nation level. If the accuracy levels and R square are not satisfactory proceed to below step., when I run the Linear Regression and use fit (as below) :

linereg=LinearRegression()
linereg.fit(x_train_scaled,y_train)

I am getting the following error :

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The y_train dataset has no NaN values as below :

y_train.isnull().sum()
0

Can you please guide me how to fix this error ?

My data looks like this :

x_train_scaled
array([[-0.33201039, 0.47192704, -1.24414603, ..., 0.51135423,
-0.32259219, -0.26063075],
[ 0.5625587 , -0.6265974 , -0.11809286, ..., -1.18569352,
-0.2326788 , -0.20721166],
[-0.23035481, -0.6265974 , -0.13481274, ..., 1.02080109,
-0.14518244, 0.12706858],
...,
[ 0.01361858, -1.23688875, 1.03392011, ..., 1.44245815,
-0.67064403, -0.47800536],
[ 3.59189497, 1.20427667, 0.87758589, ..., 1.81578649,
-0.47969893, -0.98158303],
[-0.84028829, 0.2278105 , 1.32059491, ..., -1.2196862 ,
0.53061822, 0.67420335]])

267822 1414.80295
246444 864.41390
245683 1506.06758
279653 1175.28642
247218 1192.58759
...
279212 770.11560
277856 2210.84055
233000 1671.07908
287425 3074.83088
265371 1455.42340
Name: hc_mortgage_mean, Length: 27161, dtype: float64>

y_train.describe()
count 27161.000000
mean 1629.260500
std 617.420278
min 234.650000
25% 1162.179990
50% 1471.288370
75% 1969.768880
max 4462.342290
Name: hc_mortgage_mean, dtype: float64

#### Ahamika Banerjee

##### Active Member
Hello @ROHIT RANJAN SRIVASTAVA sir,

In the 2nd project which is healthcare, week 4, 1st question which is : Create a classification report by analyzing sensitivity, specificity, AUC (ROC curve), etc. Please be descriptive to explain what values of these parameter you have used.

Do we need to create a classification report of every model like logistic reg, KNN, ensemble learning or do we need to create report and ROC of any one model?

#### Harsh Kohli_1

##### Member
Hi Rohit,

Can you please reply to my query (Post #16 above) posted on Saturday May 29th, 2021? I am getting an error and not sure what the issue is ?

#### Apoorv Sinha_1

##### Member
How to create a heatmap in Tableau, having some error in creating it for project 1

#### Harsh Kohli_1

##### Member
Hi Rohit,

Can you please reply to my query (Post #16 above) posted on Saturday May 29th, 2021? I am getting an error and not sure what the issue is ?

#### Gaanashree S Patil_1

Alumni
Hi Rohit,

I am working on the First project (Real Estate) and for week 4 tasks a) Run a model at a Nation level. If the accuracy levels and R square are not satisfactory proceed to below step., when I run the Linear Regression and use fit (as below) :

linereg=LinearRegression()
linereg.fit(x_train_scaled,y_train)

I am getting the following error :

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The y_train dataset has no NaN values as below :

y_train.isnull().sum()
0

Can you please guide me how to fix this error ?

My data looks like this :

x_train_scaled
array([[-0.33201039, 0.47192704, -1.24414603, ..., 0.51135423,
-0.32259219, -0.26063075],
[ 0.5625587 , -0.6265974 , -0.11809286, ..., -1.18569352,
-0.2326788 , -0.20721166],
[-0.23035481, -0.6265974 , -0.13481274, ..., 1.02080109,
-0.14518244, 0.12706858],
...,
[ 0.01361858, -1.23688875, 1.03392011, ..., 1.44245815,
-0.67064403, -0.47800536],
[ 3.59189497, 1.20427667, 0.87758589, ..., 1.81578649,
-0.47969893, -0.98158303],
[-0.84028829, 0.2278105 , 1.32059491, ..., -1.2196862 ,
0.53061822, 0.67420335]])

267822 1414.80295
246444 864.41390
245683 1506.06758
279653 1175.28642
247218 1192.58759
...
279212 770.11560
277856 2210.84055
233000 1671.07908
287425 3074.83088
265371 1455.42340
Name: hc_mortgage_mean, Length: 27161, dtype: float64>

y_train.describe()
count 27161.000000
mean 1629.260500
std 617.420278
min 234.650000
25% 1162.179990
50% 1471.288370
75% 1969.768880
max 4462.342290
Name: hc_mortgage_mean, dtype: float64
When you're passing the values to the models fit function, both x_train and y_train will be the inputs, So I suggest you check if x-train is free from NaN values.

#### ROHIT RANJAN SRIVASTAVA

##### Active Member
Hello @ROHIT RANJAN SRIVASTAVA sir,

In the 2nd project which is healthcare, week 4, 1st question which is : Create a classification report by analyzing sensitivity, specificity, AUC (ROC curve), etc. Please be descriptive to explain what values of these parameter you have used.

Do we need to create a classification report of every model like logistic reg, KNN, ensemble learning or do we need to create report and ROC of any one model?
Yes. I am sorry for the delay in responding.

#### ROHIT RANJAN SRIVASTAVA

##### Active Member
How to create a heatmap in Tableau, having some error in creating it for project 1
You can refer to recording of Tableau session or you can watch the self learning videos. Sorry for the delay in replying.

#### ROHIT RANJAN SRIVASTAVA

##### Active Member
Hi Rohit,

I am working on the First project (Real Estate) and for week 4 tasks a) Run a model at a Nation level. If the accuracy levels and R square are not satisfactory proceed to below step., when I run the Linear Regression and use fit (as below) :

linereg=LinearRegression()
linereg.fit(x_train_scaled,y_train)

I am getting the following error :

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The y_train dataset has no NaN values as below :

y_train.isnull().sum()
0

Can you please guide me how to fix this error ?

My data looks like this :

x_train_scaled
array([[-0.33201039, 0.47192704, -1.24414603, ..., 0.51135423,
-0.32259219, -0.26063075],
[ 0.5625587 , -0.6265974 , -0.11809286, ..., -1.18569352,
-0.2326788 , -0.20721166],
[-0.23035481, -0.6265974 , -0.13481274, ..., 1.02080109,
-0.14518244, 0.12706858],
...,
[ 0.01361858, -1.23688875, 1.03392011, ..., 1.44245815,
-0.67064403, -0.47800536],
[ 3.59189497, 1.20427667, 0.87758589, ..., 1.81578649,
-0.47969893, -0.98158303],
[-0.84028829, 0.2278105 , 1.32059491, ..., -1.2196862 ,
0.53061822, 0.67420335]])

267822 1414.80295
246444 864.41390
245683 1506.06758
279653 1175.28642
247218 1192.58759
...
279212 770.11560
277856 2210.84055
233000 1671.07908
287425 3074.83088
265371 1455.42340
Name: hc_mortgage_mean, Length: 27161, dtype: float64>

y_train.describe()
count 27161.000000
mean 1629.260500
std 617.420278
min 234.650000
25% 1162.179990
50% 1471.288370
75% 1969.768880
max 4462.342290
Name: hc_mortgage_mean, dtype: float64
get rid of null values the way you like or create your own function to impute missing values.