Welcome to the Simplilearn Community

Want to join the rest of our members? Sign up right away!

Sign Up

Data Science Capstone | Rohit | May 07 - May 28 | 2021

Good Morning All..

Started working on Health care project.

While doing the activities i have little bit confusion on the following two questions. Can you please provide what was the difference between those 2 questions

Week1: Q3. There are integer and float data type variables in this dataset. Create a count (frequency) plot describing the data types and the count of variables.
Week2: Q1.Check the balance of the data by plotting the count of outcomes by their value. Describe your findings and plan future course of action.


For week1-Q3 have to consider all the data
For week2 -Q1 have to consider the data like group by outcome values
Ex: BMI.Value_count() based on coutcome values

Is my understanding correct ?
 
Week 1 Q3 : Do a value count plot for df.info.
Week2 Q1 : Consider only outcome column in dataset. How many 0's and how many 1's. Is the data set imbalanced.
 
Good Morning All..

Started working on Health care project.

While doing the activities i have little bit confusion on the following two questions. Can you please provide what was the difference between those 2 questions

Week1: Q3. There are integer and float data type variables in this dataset. Create a count (frequency) plot describing the data types and the count of variables.
Week2: Q1.Check the balance of the data by plotting the count of outcomes by their value. Describe your findings and plan future course of action.


For week1-Q3 have to consider all the data
For week2 -Q1 have to consider the data like group by outcome values
Ex: BMI.Value_count() based on coutcome values

Is my understanding correct ?
Week 1 - Q3 use df.info (hint)
Week 2 - Q1 Consider only outcome column of dataset. Do a count plot , how many 1's and 0's. Is the data set balanced or imbalanced?
 

Suresh A

Member
Hello Sir, for healthcare project Insulin and SkinThickness have very high number of zero values(374 and 227). Is it okay to replace these zero values with mean ? Would it not affect the accuracy of training data ?
 
Hello Everyone,

I have a general query, how can I see the complete output in python without it getting truncated?
def print_full(x):
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 2000)
pd.set_option('display.float_format', '{:20,.2f}'.format)
pd.set_option('display.max_colwidth', None)
print(x)
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')
pd.reset_option('display.width')
pd.reset_option('display.float_format')
pd.reset_option('display.max_colwidth')
 
Hello Sir, for healthcare project Insulin and SkinThickness have very high number of zero values(374 and 227). Is it okay to replace these zero values with mean ? Would it not affect the accuracy of training data ?
An approach you can take is to use data from similar cases to estimate a replacement value for the missing feature.
To get an idea of how to choose similar cases, plot a correlation matrix. This shows the relationship between every pair of features in the dataset.
To plot this correlation matrix, eliminate all records with those zero values.

Observations: High correlation between skin thickness and BMI, and between insulin and glucose.

Option 1) Try to fill in the missing values using the observation above.. Make scatterplots to get an idea of how exactly they are related.
Option 2) Independent variables are highly co-related (Multi-collinearity) . Consider removing the variables (use VIF).
 
Last edited:
Hello sir, I am doing 2nd project. The below code is working for other variable when I try to remove zeroes. However showing error for Skinthickness. Also attached the pdf. Code in page 4 & error in page 5

plt.hist(df_health['SkinThickness']) drop_D=df_health["SkinThickness"].index[df_health["SkinThickness"] == 0]. ,→tolist() df_health=df_health.drop(df_health.index[drop_D])

---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-7-abcaaddf42d2> in <module>
1 plt.hist(df_health['SkinThickness'])
2 drop_D=df_health["SkinThickness"].index[df_health["SkinThickness"] == 0].tolist()
----> 3 df_health=df_health.drop(df_health.index[drop_D])
4 df_health.describe()

/usr/local/lib/python3.7/site-packages/pandas/core/indexes/base.py in __getitem__(self, key)
3938
3939 key = com.values_from_object(key)
-> 3940 result = getitem(key)
3941 if not is_scalar(result):
3942 if np.ndim(result) > 1:

IndexError: index 714 is out of bounds for axis 0 with size 710
 

Attachments

  • Untitled.pdf
    126 KB · Views: 9
Hello sir, I am doing 2nd project. The below code is working for other variable when I try to remove zeroes. However showing error for Skinthickness. Also attached the pdf. Code in page 4 & error in page 5

plt.hist(df_health['SkinThickness']) drop_D=df_health["SkinThickness"].index[df_health["SkinThickness"] == 0]. ,→tolist() df_health=df_health.drop(df_health.index[drop_D])

---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-7-abcaaddf42d2> in <module>
1 plt.hist(df_health['SkinThickness'])
2 drop_D=df_health["SkinThickness"].index[df_health["SkinThickness"] == 0].tolist()
----> 3 df_health=df_health.drop(df_health.index[drop_D])
4 df_health.describe()

/usr/local/lib/python3.7/site-packages/pandas/core/indexes/base.py in __getitem__(self, key)
3938
3939 key = com.values_from_object(key)
-> 3940 result = getitem(key)
3941 if not is_scalar(result):
3942 if np.ndim(result) > 1:

IndexError: index 714 is out of bounds for axis 0 with size 710
Try:
1) df_filtered = df[df['SkinThickness'] != 0] -----(df_filtered will have non zero values)

2) df.drop(df[df['SkinThickness'] == 0].index, inplace = True)

3) df_test = df.drop(df[df.'SkinThickness' == 0].index)
 
Hi Rohit,

I am working on the First project (Real Estate) and for week 4 tasks a) Run a model at a Nation level. If the accuracy levels and R square are not satisfactory proceed to below step., when I run the Linear Regression and use fit (as below) :

linereg=LinearRegression()
linereg.fit(x_train_scaled,y_train)

I am getting the following error :

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The y_train dataset has no NaN values as below :

y_train.isnull().sum()
0

Can you please guide me how to fix this error ?

My data looks like this :

x_train_scaled
array([[-0.33201039, 0.47192704, -1.24414603, ..., 0.51135423,
-0.32259219, -0.26063075],
[ 0.5625587 , -0.6265974 , -0.11809286, ..., -1.18569352,
-0.2326788 , -0.20721166],
[-0.23035481, -0.6265974 , -0.13481274, ..., 1.02080109,
-0.14518244, 0.12706858],
...,
[ 0.01361858, -1.23688875, 1.03392011, ..., 1.44245815,
-0.67064403, -0.47800536],
[ 3.59189497, 1.20427667, 0.87758589, ..., 1.81578649,
-0.47969893, -0.98158303],
[-0.84028829, 0.2278105 , 1.32059491, ..., -1.2196862 ,
0.53061822, 0.67420335]])

y_train.head
267822 1414.80295
246444 864.41390
245683 1506.06758
279653 1175.28642
247218 1192.58759
...
279212 770.11560
277856 2210.84055
233000 1671.07908
287425 3074.83088
265371 1455.42340
Name: hc_mortgage_mean, Length: 27161, dtype: float64>

y_train.describe()
count 27161.000000
mean 1629.260500
std 617.420278
min 234.650000
25% 1162.179990
50% 1471.288370
75% 1969.768880
max 4462.342290
Name: hc_mortgage_mean, dtype: float64
 

Ahamika Banerjee

Active Member
Hello @ROHIT RANJAN SRIVASTAVA sir,

In the 2nd project which is healthcare, week 4, 1st question which is : Create a classification report by analyzing sensitivity, specificity, AUC (ROC curve), etc. Please be descriptive to explain what values of these parameter you have used.

Do we need to create a classification report of every model like logistic reg, KNN, ensemble learning or do we need to create report and ROC of any one model?
 
Hi Rohit,

Can you please reply to my query (Post #16 above) posted on Saturday May 29th, 2021? I am getting an error and not sure what the issue is ?
 
Hi Rohit,

Can you please reply to my query (Post #16 above) posted on Saturday May 29th, 2021? I am getting an error and not sure what the issue is ?
 

Gaanashree S Patil_1

Administrator
Alumni
Hi Rohit,

I am working on the First project (Real Estate) and for week 4 tasks a) Run a model at a Nation level. If the accuracy levels and R square are not satisfactory proceed to below step., when I run the Linear Regression and use fit (as below) :

linereg=LinearRegression()
linereg.fit(x_train_scaled,y_train)

I am getting the following error :

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The y_train dataset has no NaN values as below :

y_train.isnull().sum()
0

Can you please guide me how to fix this error ?

My data looks like this :

x_train_scaled
array([[-0.33201039, 0.47192704, -1.24414603, ..., 0.51135423,
-0.32259219, -0.26063075],
[ 0.5625587 , -0.6265974 , -0.11809286, ..., -1.18569352,
-0.2326788 , -0.20721166],
[-0.23035481, -0.6265974 , -0.13481274, ..., 1.02080109,
-0.14518244, 0.12706858],
...,
[ 0.01361858, -1.23688875, 1.03392011, ..., 1.44245815,
-0.67064403, -0.47800536],
[ 3.59189497, 1.20427667, 0.87758589, ..., 1.81578649,
-0.47969893, -0.98158303],
[-0.84028829, 0.2278105 , 1.32059491, ..., -1.2196862 ,
0.53061822, 0.67420335]])

y_train.head
267822 1414.80295
246444 864.41390
245683 1506.06758
279653 1175.28642
247218 1192.58759
...
279212 770.11560
277856 2210.84055
233000 1671.07908
287425 3074.83088
265371 1455.42340
Name: hc_mortgage_mean, Length: 27161, dtype: float64>

y_train.describe()
count 27161.000000
mean 1629.260500
std 617.420278
min 234.650000
25% 1162.179990
50% 1471.288370
75% 1969.768880
max 4462.342290
Name: hc_mortgage_mean, dtype: float64
When you're passing the values to the models fit function, both x_train and y_train will be the inputs, So I suggest you check if x-train is free from NaN values.
 
Hello @ROHIT RANJAN SRIVASTAVA sir,

In the 2nd project which is healthcare, week 4, 1st question which is : Create a classification report by analyzing sensitivity, specificity, AUC (ROC curve), etc. Please be descriptive to explain what values of these parameter you have used.

Do we need to create a classification report of every model like logistic reg, KNN, ensemble learning or do we need to create report and ROC of any one model?
Yes. I am sorry for the delay in responding.
 
Hi Rohit,

I am working on the First project (Real Estate) and for week 4 tasks a) Run a model at a Nation level. If the accuracy levels and R square are not satisfactory proceed to below step., when I run the Linear Regression and use fit (as below) :

linereg=LinearRegression()
linereg.fit(x_train_scaled,y_train)

I am getting the following error :

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The y_train dataset has no NaN values as below :

y_train.isnull().sum()
0

Can you please guide me how to fix this error ?

My data looks like this :

x_train_scaled
array([[-0.33201039, 0.47192704, -1.24414603, ..., 0.51135423,
-0.32259219, -0.26063075],
[ 0.5625587 , -0.6265974 , -0.11809286, ..., -1.18569352,
-0.2326788 , -0.20721166],
[-0.23035481, -0.6265974 , -0.13481274, ..., 1.02080109,
-0.14518244, 0.12706858],
...,
[ 0.01361858, -1.23688875, 1.03392011, ..., 1.44245815,
-0.67064403, -0.47800536],
[ 3.59189497, 1.20427667, 0.87758589, ..., 1.81578649,
-0.47969893, -0.98158303],
[-0.84028829, 0.2278105 , 1.32059491, ..., -1.2196862 ,
0.53061822, 0.67420335]])

y_train.head
267822 1414.80295
246444 864.41390
245683 1506.06758
279653 1175.28642
247218 1192.58759
...
279212 770.11560
277856 2210.84055
233000 1671.07908
287425 3074.83088
265371 1455.42340
Name: hc_mortgage_mean, Length: 27161, dtype: float64>

y_train.describe()
count 27161.000000
mean 1629.260500
std 617.420278
min 234.650000
25% 1162.179990
50% 1471.288370
75% 1969.768880
max 4462.342290
Name: hc_mortgage_mean, dtype: float64
get rid of null values the way you like or create your own function to impute missing values.
 
Top