# DATA SCIENCE WITH R | Nov 28 - Jan 10 | Sabyasanchi (2020)

#### Arun Hosh

Hi Friends ,
reg: Health cost project

For Question 4, what approach is best ( a linear model or just categorize & find sum of charges )

#QUESTION 4
#To properly utilize the costs, the agency has to analyze the severity of the hospital costs by age and gender for proper allocation of resources.

#### Ananya Sharma_1

Hi Ananya - Actually , your point is so good, I did the same & I get below summary for the model, are you able to interpret the results & advise.

Call:
lm(formula = LOS ~ ., data = trainingSet1)
Residuals:
Min 1Q Median 3Q Max
-2.873 -0.873 -0.872 0.128 36.128
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.871942 0.248774 11.544 <0.0000000000000002 ***
AGE -0.018145 0.024068 -0.754 0.451
FEMALE 0.001483 0.333596 0.004 0.996
race2 -0.506172 1.241672 -0.408 0.684
race3 1.126575 2.999127 0.376 0.707
race4 0.623710 1.734672 0.360 0.719
race5 -0.871942 1.741984 -0.501 0.617
race6 -0.727522 2.119271 -0.343 0.732
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.986 on 336 degrees of freedom
Multiple R-squared: 0.004317, Adjusted R-squared: -0.01643
F-statistic: 0.2081 on 7 and 336 DF, p-value: 0.9835
Arun, I haven't prepared my model yet, so can't say properly about your result..
I think your both R^2 values are low & all the independent variables (except Age) have quite high P values & low Estimate values so you can make changes in any of these variables & see if you get better results...

Also, as you mentioned we don't have to treat outliers in this case so by looking into your output values you can say if variables are affecting dependent variable or not..

#### Ananya Sharma_1

I did this by using the later option which you mentioned because sir also explained in that way.

#### Ananya Sharma_1

Hi Ananya - This is the reply that I received from simpli learn team regarding outlier treatment for this data set.

"
Please note that outliers can be considered only if the features are of the numerical data type. We do not consider outliers when it comes to categorical variables. Please note that Race and Gender are categorical variables and Age is a numerical variable. We can consider Outliers when it comes to age. But in this case study, you don't have to remove the outliers.

From the output, you need to look at the individual P values and decide if the features are significant or not and are playing a major role in predicting the output variable."

plz read & let me know
I did age outlier analysis as I wanted to increase the efficiency of model (I'm quite unsure if this works or not)

#### Arun Hosh

Arun, I haven't prepared my model yet, so can't say properly about your result..
I think your both R^2 values are low & all the independent variables (except Age) have quite high P values & low Estimate values so you can make changes in any of these variables & see if you get better results...

Also, as you mentioned we don't have to treat outliers in this case so by looking into your output values you can say if variables are affecting dependent variable or not..

Thanks Ananya , you are giving me more insights & I believe your knowledge in the statistics interpretation is very good.
I am bit weak in this part,
can you help me or guide me how to interpret the Summary outcome of the linear model..( using the r^2 value or f-stat or P value ) . really want to know it.... can you light my knowledge in any way ( link or any ppt or ebook etcc )

#### Arun Hosh

Hi Ananya - Also in the health care project there is a "FEMALE" column in data set, while building a linear model how do we treat this binary data.....

#### Arun Hosh

I did age outlier analysis as I wanted to increase the efficiency of model (I'm quite unsure if this works or not)

Ananya , your thought is good, actually we cant categorize age if we categorize then will it come under linear model,
also there is an approach which I can think of from your comment, if we can omit 5% data as outlier then we can remove the most least frequency value in age, I mean it is not significant in the model or it make least change to the dependent variable..... do you think it will work

#### Arun Hosh

I did this by using the later option which you mentioned because sir also explained in that way.

Hi Ananya
I used the same option & got this result , did you also find the same result, ( just wanna see if you we both are doing it on same page "hahah")

age_Group Gender_Cat Severity_Age_expenditure
<fct> <chr> <int>
1 Age group(0-1) Male 408356 ( highest value )
2 Age group (>10) Female 317568
3 Age group(0-1) Female 306350
4 Age group (>10) Male 203045
5 Age group (6-10) Male 77212
6 Age group (2-5) Male 46778
7 Age group (2-5) Female 25569
8 Age group (6-10) Female 1160

#### Ananya Sharma_1

Thanks Ananya , you are giving me more insights & I believe your knowledge in the statistics interpretation is very good.
I am bit weak in this part,
can you help me or guide me how to interpret the Summary outcome of the linear model..( using the r^2 value or f-stat or P value ) . really want to know it.... can you light my knowledge in any way ( link or any ppt or ebook etcc )
I don't have any knowledge of statistics, I'm doing all these analysis based on what sir taught in the class

#### Ananya Sharma_1

Hi Ananya - Also in the health care project there is a "FEMALE" column in data set, while building a linear model how do we treat this binary data.....
I am completely new to this course so I'm not very sure about my way of moving ahead with these problems...Currently I'm just following what sir told in class.
Sir also had one variable, 'View' (binary data) in his dataset, he converted it into dummy & proceeded with model building.

#### Harvinder Kaur

When i am trying to run recorded live classes sessions then in "network recording converter" tool, it is asking for site url. Can anyone share site url?

#### Arun Hosh

Hi All ,

Hospital project 3rd question.

#QUESTION 3
#To make sure that there is no malpractice, the agency needs to analyze if the race of the patient is related to the hospitalization costs.

I ran annova & got below result

> summary(race_annova)
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(health\$RACE) 5 1.859e+07 3718656 0.244 0.943
Residuals 493 7.524e+09 15260687

Below is my interpretation :
#Annova results shows that the race definitely has relation to the Hospitalization cost ,Higher P value shows that the hospitalization cost is definitely not equally distributed among various race categories

is this right interpretation, plz comment

#### Harvinder Kaur

I am trying to play live class recordings. Installed Network Recording Player. Do i need to convert downloaded session to Mp4 version? While doing so, it asks for siteURL. In SiteURL, it is asking to input meetingcenter.webex.com. Please suggest what would be meetingcenter here?

#### Shashank AV

Hello All,
im new here so forgive me if i say something stupid. and i do need some help with the project. I was doing the Health care analysis one. So the 1st question says "to find the age category of people who frequently visit the hospital and has the maximum expenditure.". so i read the data set, selected the datas , arranged by age and the Hospital discharge costs(Totchg) and grouped by the age. Then i just did the summarize of the age group with Totchg. and for the requently visit the hospital i just take count ? can someone help. thanks.

Later Edit :
i used the aggregate function and the count function. but i used 2 different line of code for both so how do i join the aggregate and the count in one line of code?

aggregate(TOTCHG~AGE,my_df,sum)
count(my_df,AGE)

i would like to show the count and the aggregate in one table to make it more neat .

#### Arun Hosh

When i am trying to run recorded live classes sessions then in "network recording converter" tool, it is asking for site url. Can anyone share site url?
Hi Harvinder, To find the frequency of the ages we can simply make a simple histogram to represent or you can use the count function in "plyr package "
to calculate which age group has max expenditure , categorize the age into 4 or 5 categories using cut function & then group by age category & make summary based on the sum of total charges..
this will give you the result...

#### Shashank AV

Hello,
for the final question in the Healthcare cost analysis, " To perform a complete analysis, the agency wants to find the variable that mainly affects hospital costs. ",
is it enough we find and correlation ?
cr = cor(my_df)
and then just display it.
and then just do a corrplot ? or do we make a heatmap ?

#### Arpita Mitra_1

Hi Friends ,
reg: Health cost project

For Question 4, what approach is best ( a linear model or just categorize & find sum of charges )

#QUESTION 4
#To properly utilize the costs, the agency has to analyze the severity of the hospital costs by age and gender for proper allocation of resources.
Hi Arun, I used Liner regression model here.

#### Arun Hosh

Hi Arun, I used Liner regression model here.

Hi Arpita - How did the model perform , can you share the summary o\p & how did you manipulate the data

#### Arun Hosh

Hello,
for the final question in the Healthcare cost analysis, " To perform a complete analysis, the agency wants to find the variable that mainly affects hospital costs. ",
is it enough we find and correlation ?
cr = cor(my_df)
and then just display it.
and then just do a corrplot ? or do we make a heatmap ?

Try making a heat map

#### Arpita Mitra_1

Hi Arpita - How did the model perform , can you share the summary o\p & how did you manipulate the data
Hi Arun ..p value is less in this model, here I attached my Q4. summery for ur reference.

Hi All,
Hope this thread is still active.
Due to certain challenges, I am still working on the Health Project.
I am getting the below error while using Group_By despite of using library(dplyr) at the very top of my code

no applicable method for 'group_by_' applied to an object of class "c('integer', 'numeric')"

#### Mukund Ramasubramanian

I am getting 'NULL' as the output.

Code:

first_name<-c('First1', 'FIrst2', 'First3')
second_name<-c('Second1','Second2','Second3')
roll_no<-c('1','2','3')
class1<-data.frame(roll_no,first_name,second_name)
class1
class(class1)
str(class1)
summary(class1)
class1\$roll_no
levels(class1\$roll_no)

Output:

> class1<-data.frame(roll_no,first_name,second_name)
> class1
roll_no first_name second_name
1 1 First1 Second1
2 2 FIrst2 Second2
3 3 First3 Second3
> class(class1)
[1] "data.frame"
> str(class1)
'data.frame': 3 obs. of 3 variables:
\$ roll_no : chr "1" "2" "3"
\$ first_name : chr "First1" "FIrst2" "First3"
\$ second_name: chr "Second1" "Second2" "Second3"
> summary(class1)
roll_no first_name second_name
Length:3 Length:3 Length:3
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
> levels(class\$roll_no)
NULL
> levels(class1\$roll_no)
NULL
> summary(class1)
roll_no first_name second_name
Length:3 Length:3 Length:3
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
> class1\$roll_no
[1] "1" "2" "3"
> levels(class1\$roll_no)
NULL