Welcome to the Simplilearn Community

Want to join the rest of our members? Sign up right away!

Sign Up

DATA SCIENCE WITH R | Nov 28 - Jan 10 | Sabyasanchi (2020)

Ananya Sharma_1

Active Member
I use R studio. After 1/2 successful run of codes, the console window stops showing output even though the code is completely right. To work it properly, I have to restart the whole studio again, then it works.. And again the same problem starts..
Can anybody suggest me why this problem is coming & how to sort this out?
 

ajinkyapandharpatte

Member
Alumni
assignment :
create a dummy column Date with data in the format '02/15/2020 20:31:15'. and create another column name with values like "Mr John James" and " John James"


1 - Extract the month out of the date in the format "Jan" "Feb" etc and create a separate column

#my_try

data1 <- read.csv('USA_cars_datasets.csv') #imported data


date <- data1[1,4] #Extracting date cell
date

sp <- substring (date,0,10) #extracting date & time separately
sp

date_format12 <- as.Date(sp,"%m/%d/%y") #Changing format to d/Month/2020
date_format12
result_date <- format(date_format12, "%d-%b-%Y")
result_date #Format changed

data1[1,16] <- result_date ##data added to cell successfully

it worked , please post if anybody done this in other short way, thanks
 

Sabyasachi Tripathy

Customer
Customer
assignment :
create a dummy column Date with data in the format '02/15/2020 20:31:15'. and create another column name with values like "Mr John James" and " John James"


1 - Extract the month out of the date in the format "Jan" "Feb" etc and create a separate column

#my_try

data1 <- read.csv('USA_cars_datasets.csv') #imported data


date <- data1[1,4] #Extracting date cell
date

sp <- substring (date,0,10) #extracting date & time separately
sp

date_format12 <- as.Date(sp,"%m/%d/%y") #Changing format to d/Month/2020
date_format12
result_date <- format(date_format12, "%d-%b-%Y")
result_date #Format changed

data1[1,16] <- result_date ##data added to cell successfully

it worked , please post if anybody done this in other short way, thanks
This is good . Try to read the data as a time stamp instead of using substring .
 

ajinkyapandharpatte

Member
Alumni
##Assignment 2 #" 2 - Extract the time out of it and create
#a separate column with values " afternoon , morning , evening ,
#night " based on the hour .


library(lubridate)
data1 <- read.csv('USA_cars_datasets.csv')


xy <- "24/03/1992 20:31:15"
class(xy)
xy3 <- dmy_hms(xy)


xy2<- format(xy3,format = "%H:%M:%S") ##Time Extracted
xy2

result2 <- if(xy2<= 12) { #if else for "Afetrnoon"
print("MOnring")
}else if(xy2>12) {
print("Afternoon")
}else if(xy2 >17) {
print("Evening")
}

data1[1,17] <- result2 #Afternoon Printed in excel file
 

Pallavi_93

Active Member
This is good . Try to read the data as a time stamp instead of using substring .

my try

car_df <- read_excel("USA_cars_datasets_assign.xlsx")
names(car_df)[4]<-'date' #class(car_df$date) ---->"character

date_split<-strsplit(car_df$date,split=' ') #class(date_split ) ----> "list" # is.vector(date_split[[1]]) ---> TRUE
date_split<- as.Date(date_split[[1]][1],format="%m/%d/%Y") #class(date_split) ------> Date
res_data <- strsplit(format(date_split,'%b-%d-%Y'),split='-')
month_car <- res_data[[1]][1] #class(res_data)----> list of character vectors
car_df[1,16]<-month_car

this above code is working for particular cell of the data frame, but how to perform on entire date column of the data frame ? do we need to create custom defined function or do we need to run any loop?
 
Last edited:

Arun Hosh

Active Member
Hi All , Did anyone start \ complete the project..... health care
I did start & get some results for the 2 question , while working on 3rd question ( shown below )
 To make sure that there is no malpractice, the agency needs to analyze if the race of the patient is related to the hospitalization costs.

for this I made a group_by & summarized the data based on the sum of the total cost for each individual races....
My doubt is do I need to do a correlation for both the Race & hospitalization cost.... to analayze the relation between them...
 

amit_319

Member
Hi All,

As mentioned in self learning tutorial 1 there is a tutorial for Statistics.
This is said to be prerequisite course for Statistics named (Statistics for Data Science) , where can I find that course material.

Regards,

Amit
 

Arun Hosh

Active Member
Hi All , Did anyone start \ complete the project..... health care
I did start & get some results for the 2 question , while working on 3rd question ( shown below )
 To make sure that there is no malpractice, the agency needs to analyze if the race of the patient is related to the hospitalization costs.

for this I made a group_by & summarized the data based on the sum of the total cost for each individual races....
My doubt is do I need to do a correlation for both the Race & hospitalization cost.... to analayze the relation between them...



Dear All ,

Did any one start with projects, I have worked with 4 questions
 

Arun Hosh

Active Member
Hi All - For the health care project , regarding the 5th question

# QUESTION 5
# Since the length of stay is the crucial factor for inpatients, the agency wants to find if the length of stay can be predicted from age, gender, and race.


how to handle outliers here or do we need to check outliers & how to check outliers in this.. plz help meeeee...
 

Ananya Sharma_1

Active Member
Hi All,

As mentioned in self learning tutorial 1 there is a tutorial for Statistics.
This is said to be prerequisite course for Statistics named (Statistics for Data Science) , where can I find that course material.

Regards,

Amit
As far as I know, there are 2 courses in Self learning videos named; Statistics for Data Science Part 1 & Statistics for Data Science Part2 . These are under the course Data Science with R only.
 

Ananya Sharma_1

Active Member
Hi All , Did anyone start \ complete the project..... health care
I did start & get some results for the 2 question , while working on 3rd question ( shown below )
 To make sure that there is no malpractice, the agency needs to analyze if the race of the patient is related to the hospitalization costs.

for this I made a group_by & summarized the data based on the sum of the total cost for each individual races....
My doubt is do I need to do a correlation for both the Race & hospitalization cost.... to analayze the relation between them...
Sir told that we can find mean/ sum of total charge & see if any particular race has been charged very high..
 

Ananya Sharma_1

Active Member
Hi All - For the health care project , regarding the 5th question

# QUESTION 5
# Since the length of stay is the crucial factor for inpatients, the agency wants to find if the length of stay can be predicted from age, gender, and race.


how to handle outliers here or do we need to check outliers & how to check outliers in this.. plz help meeeee...
I have not done this yet. You can check recordings, sir discussed this issue in class
 

Arun Hosh

Active Member
What to be concluded if correlation is coming -0.7867 , if variables are affecting each other?
Actually , i ran a model for LOS against others & found the below summary.. I am not able to interpret this result.. plz help to read this correctly.....

Residuals:
Min 1Q Median 3Q Max
-12.8701 -0.3463 -0.1663 0.3050 15.1572
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.32451110 0.56765802 -5.857 0.0000000123 ***
AGE -0.11558628 0.09458443 -1.222 0.2226
FEMALE 0.12761003 0.21492428 0.594 0.5531
RACE 0.06850460 0.19844679 0.345 0.7302
TOTCHG 0.00092100 0.00003794 24.276 < 0.0000000000000002 ***
APRDRG 0.00671019 0.00074522 9.004 < 0.0000000000000002 ***
age_GroupAge group (2-5) -6.28570501 1.33594514 -4.705 0.0000038587 ***
age_GroupAge group (6-10) -2.26017291 1.04062565 -2.172 0.0306 *
age_GroupAge group (>10) 0.24776494 1.39713442 0.177 0.8594
Gender_CatMale NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.787 on 304 degrees of freedom
Multiple R-squared: 0.6611, Adjusted R-squared: 0.6522
F-statistic: 74.13 on 8 and 304 DF, p-value: < 0.00000000000000022
 

Arun Hosh

Active Member
I ran a heat map , attached the same. plz check & advise the results
 

Attachments

  • heat_map health_corr.png
    heat_map health_corr.png
    14.3 KB · Views: 7

Arun Hosh

Active Member
I am not even sure if the community forum is still active......
Regarding health care project
# QUESTION 5
# Since the length of stay is the crucial factor for inpatients, the agency wants to find if the length of stay can be predicted from age, gender, and race.


1, how do I treat the age data ( its from 0 to 17 & i gets it continious & fine )
2, how to treat the gender data ( data has only 0 & 1 & I beleive 0 is male & 1 is female, how do I treat this data,, do I need to change it to character or I can still hold the values as 0 & 1 while I run the model )
3, race is also from 0 - 6 ( how to treat this data ).
Does multiple linear regression really work for this data to predict LOS from age , gender & race....

I need some help...
 

amit_319

Member
# 1- Extract all the elements whose number of characters is more than 5
for this assignment , I was able to write the piece of code below

assign_vec <- c('Ravi','Satish','Ashutosh','Subhashree','Elizabeth','Vishwakarma')
char_vec <- nchar(assign_vec)
char_vec
for( i in char_vec){
if( i > 5 )
print(i)
}

i got output
[1] 6
[1] 8
[1] 10
[1] 9
[1] 11

plz suggest , how do I relate the corresponding element based on number of char.
I think am doing something wrong in the loop or print statement , plz advise


You are creating a new vector char_vec which has only character count. So the output will only show you the count.
You can use below code
assign_v <-c('Ravi','Satish','Áshutosh','Subhashree','Élizabeth','Vishwakarma')

for (val in assign_v){
if(nchar(val) >5){print(val)}
}

Hope this is helpful.
 

Ananya Sharma_1

Active Member
I am not even sure if the community forum is still active......
Regarding health care project
# QUESTION 5
# Since the length of stay is the crucial factor for inpatients, the agency wants to find if the length of stay can be predicted from age, gender, and race.


1, how do I treat the age data ( its from 0 to 17 & i gets it continious & fine )
2, how to treat the gender data ( data has only 0 & 1 & I beleive 0 is male & 1 is female, how do I treat this data,, do I need to change it to character or I can still hold the values as 0 & 1 while I run the model )
3, race is also from 0 - 6 ( how to treat this data ).
Does multiple linear regression really work for this data to predict LOS from age , gender & race....

I need some help...
You can divide age into some groups, then make dummy variables of age, race & gender & proceed further with model building after doing proper analysis of these independent columns...
Yes, linear regression will work here
 

Ananya Sharma_1

Active Member
You can divide age into some groups, then make dummy variables of age, race & gender & proceed further with model building after doing proper analysis of these independent columns...
Yes, linear regression will work here
Oh, I forgot that you have already divided age into groups..
Also may I know, why have you not included age group 0-2 in your analysis?
 

Ananya Sharma_1

Active Member
Actually , i ran a model for LOS against others & found the below summary.. I am not able to interpret this result.. plz help to read this correctly.....

Residuals:
Min 1Q Median 3Q Max
-12.8701 -0.3463 -0.1663 0.3050 15.1572
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.32451110 0.56765802 -5.857 0.0000000123 ***
AGE -0.11558628 0.09458443 -1.222 0.2226
FEMALE 0.12761003 0.21492428 0.594 0.5531
RACE 0.06850460 0.19844679 0.345 0.7302
TOTCHG 0.00092100 0.00003794 24.276 < 0.0000000000000002 ***
APRDRG 0.00671019 0.00074522 9.004 < 0.0000000000000002 ***
age_GroupAge group (2-5) -6.28570501 1.33594514 -4.705 0.0000038587 ***
age_GroupAge group (6-10) -2.26017291 1.04062565 -2.172 0.0306 *
age_GroupAge group (>10) 0.24776494 1.39713442 0.177 0.8594
Gender_CatMale NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.787 on 304 degrees of freedom
Multiple R-squared: 0.6611, Adjusted R-squared: 0.6522
F-statistic: 74.13 on 8 and 304 DF, p-value: < 0.00000000000000022

Arun, what I think is we have to build model just including LOS, AGE, FEMALE & RACE. For q.6 we will include all columns, find the correlation either by 'cor' function or by heatmap or we can use both 'cor' & 'corrplot' functions for finding which variable is affecting 'hospital costs' more!

Arun have you made dummy variables of Race & gender before making your model? (As I can see dummy variable of only Age..)

From your current model analysis & considering all independent variables for your model, I find the following observations:
1. Your R^2 value is quite good, if you want to reduce the difference between actual R^2 & adjusted R^2, then you can reconsider the columns 'FEMALE ' , 'RACE', 'AGE_Group(>10) (as these have high Pr value & also low 'Estimate value') , '
 
Last edited:

Ananya Sharma_1

Active Member
I am not able to get outlier values from the following code:
boxplot(h_data2$LOS ~ h_data2$AGE,main='Looking for outliers',col=c('orange','yellow'),sub=paste('Outliers: ', boxplot.stats(h_data2$AGE)$out))
Here, h_data2 is my modified dataset which has all the input records.

I also tried doing this;
boxplot(h_data2$LOS~ h_data2$AGE,main='Looking for outliers',col=c('orange','yellow'),sub=paste('Outliers: ', boxplot.stats(h_data2$AGE,do.out = TRUE)))

can anyone tell me how to get outlier values?
 

Pallavi_93

Active Member
Dear All ,

Did any one start with projects, I have worked with 4 questions

Hi Arun,

for health care project for first question i worked this way ..if any other ways are there,can you help me here,

hospital_costs <- read_excel("1555054100_hospitalcosts.xlsx")
age_count <- count(hospital_costs$AGE)
# count function given me which age group count is high, age group '0' has high count of 307 who visits hospital frequently
names(age_count)<-c('age','count')

tot_sum <- aggregate(hospital_costs$TOTCHG,by= list(hospital_costs$AGE),sum)
## using aggregate function got total expenditure for age group "0"
select(filter(tot_sum,tot_sum$agegroup==0),agegroup,expendituresum)
 

Ananya Sharma_1

Active Member
Using the second code (which I mentioned above), I'm getting plot like the one attached below:

Here, for h_data2$Group ==1, I have highlighted two values, for blue outlier, the h_data$LOS value is 7, but I have 5 records in my dataset with h_data$LOS value equal to 7 . How will I know, which record among these 5 records is the correct outlier value? I'm attaching a glance of my dataset here for your reference..
upload_2021-1-5_17-15-38.png
 

Attachments

  • P2.jpg
    P2.jpg
    43.6 KB · Views: 5

Pallavi_93

Active Member
## Hospital project, 2nd question

count_dia<- as.data.frame(table(hospital_costs$APRDRG))
View(count_dia)
class(count_dia)
names(count_dia)<- c('diagnosis','countD')
sort_dia<- count_dia[order(count_dia$countD),]

diag_sum<- aggregate(hospital_costs$TOTCHG,by =list(hospital_costs$APRDRG),sum)
View(diag_sum)
names(diag_sum)<- c('diagnosis','total_sum')
diag_sum <- diag_sum[order(-diag_sum$total_sum),]

select(filter(diag_sum,diag_sum$diagnosis==640),diagnosis,total_sum)

# please some body explain me, here i used count function and used aggregate function accordingly.
but using cut() how to analyze, i mean if we use cut function and made into buckets(age) then those results are differ.
can anyone please explain me how to use cut() and how to aggregate based on buckets
 

Arun Hosh

Active Member
Using the second code (which I mentioned above), I'm getting plot like the one attached below:

Here, for h_data2$Group ==1, I have highlighted two values, for blue outlier, the h_data$LOS value is 7, but I have 5 records in my dataset with h_data$LOS value equal to 7 . How will I know, which record among these 5 records is the correct outlier value? I'm attaching a glance of my dataset here for your reference..
View attachment 13235

Hi Ananya - I think the outlier treatment is not required in this data set, age is a meaningful data & we see most of the data lie on 0-1 age range..... i beleive this data don't need outlier treatment.... yet you may need to check the NA.
Plz share your comment
 

Arun Hosh

Active Member
## Hospital project, 2nd question

count_dia<- as.data.frame(table(hospital_costs$APRDRG))
View(count_dia)
class(count_dia)
names(count_dia)<- c('diagnosis','countD')
sort_dia<- count_dia[order(count_dia$countD),]

diag_sum<- aggregate(hospital_costs$TOTCHG,by =list(hospital_costs$APRDRG),sum)
View(diag_sum)
names(diag_sum)<- c('diagnosis','total_sum')
diag_sum <- diag_sum[order(-diag_sum$total_sum),]

select(filter(diag_sum,diag_sum$diagnosis==640),diagnosis,total_sum)

# please some body explain me, here i used count function and used aggregate function accordingly.
but using cut() how to analyze, i mean if we use cut function and made into buckets(age) then those results are differ.
can anyone please explain me how to use cut() and how to aggregate based on buckets


Hi Pallavi ,
Below is what I used
#QUESTION 2
#In order of severity of the diagnosis and treatments and to find out the expensive treatments,
#the agency wants to find the diagnosis related group that has maximum hospitalization and expenditure
#group the highest diagnostic group with total cost
diagnosis_expenditure <- health %>% group_by(APRDRG) %>% summarise(diag_exp = sum(TOTCHG))
diagnosis_expenditure
length(unique(health$APRDRG))
#sort the highest diagnostic in descending order
highest_diag_expenditure <- diagnosis_expenditure[order(diagnosis_expenditure$diag_exp , decreasing = TRUE),]
highest_diag_expenditure
#Below code shows the diagnostic group 640 has the highest expenditure 436822
highest_diag_expenditure1 <- highest_diag_expenditure[which.max(highest_diag_expenditure$diag_exp),c('APRDRG','diag_exp')]
highest_diag_expenditure1
 

Arun Hosh

Active Member
How to delete multiple rows which don't share any common condition? Or any other alternative to get rid of these rows from dataset?

Hi Ananya - I found this, hope this helps you

You cannot actually delete a row, but you can access a dataframe without some rows specified by negative index. This process is also called subsetting in R language. To delete a row, provide the row number as index to the Dataframe. The syntax is shown below: mydataframe [ -c ( row_index_1 , row_index_2 ),] where. mydataframe is the dataframe.
 

Arun Hosh

Active Member
Hi Arun,

for health care project for first question i worked this way ..if any other ways are there,can you help me here,

hospital_costs <- read_excel("1555054100_hospitalcosts.xlsx")
age_count <- count(hospital_costs$AGE)
# count function given me which age group count is high, age group '0' has high count of 307 who visits hospital frequently
names(age_count)<-c('age','count')

tot_sum <- aggregate(hospital_costs$TOTCHG,by= list(hospital_costs$AGE),sum)
## using aggregate function got total expenditure for age group "0"
select(filter(tot_sum,tot_sum$agegroup==0),agegroup,expendituresum)


This looks cool. I actually categorized age into categories using the cut function then used a groupby & summarize
 

Arun Hosh

Active Member
Arun, what I think is we have to build model just including LOS, AGE, FEMALE & RACE. For q.6 we will include all columns, find the correlation either by 'cor' function or by heatmap or we can use both 'cor' & 'corrplot' functions for finding which variable is affecting 'hospital costs' more!

Arun have you made dummy variables of Race & gender before making your model? (As I can see dummy variable of only Age..)

From your current model analysis & considering all independent variables for your model, I find the following observations:
1. Your R^2 value is quite good, if you want to reduce the difference between actual R^2 & adjusted R^2, then you can reconsider the columns 'FEMALE ' , 'RACE', 'AGE_Group(>10) (as these have high Pr value & also low 'Estimate value') , '


Great point , do you mean to convert the categorical variables to dummy ones.... is that what you mean...
 

Ananya Sharma_1

Active Member
Hi Ananya - I found this, hope this helps you

You cannot actually delete a row, but you can access a dataframe without some rows specified by negative index. This process is also called subsetting in R language. To delete a row, provide the row number as index to the Dataframe. The syntax is shown below: mydataframe [ -c ( row_index_1 , row_index_2 ),] where. mydataframe is the dataframe.
Yes, through subsetting we can delete rows.
But I have around 50 rows & it's unlogical writing down each row index.. What should be done in this case?
 

Ananya Sharma_1

Active Member
Hi Ananya - I think the outlier treatment is not required in this data set, age is a meaningful data & we see most of the data lie on 0-1 age range..... i beleive this data don't need outlier treatment.... yet you may need to check the NA.
Plz share your comment
I divided the whole age column into 6 groups & then used boxplot for finding outlier of each group for optimum prediction (i'm refering to q5.). You remember sir also removed outliers of bedroom & bathroom (they are also categorical data) & said; do this for every independent column. That's why I am finding outliers..

please you too share your thoughts.
 

Arun Hosh

Active Member
Yes, I remember sir told in the class to convert categorical data into dummy variables.

Hi Ananya - Actually , your point is so good, I did the same & I get below summary for the model, are you able to interpret the results & advise.

Call:
lm(formula = LOS ~ ., data = trainingSet1)
Residuals:
Min 1Q Median 3Q Max
-2.873 -0.873 -0.872 0.128 36.128
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.871942 0.248774 11.544 <0.0000000000000002 ***
AGE -0.018145 0.024068 -0.754 0.451
FEMALE 0.001483 0.333596 0.004 0.996
race2 -0.506172 1.241672 -0.408 0.684
race3 1.126575 2.999127 0.376 0.707
race4 0.623710 1.734672 0.360 0.719
race5 -0.871942 1.741984 -0.501 0.617
race6 -0.727522 2.119271 -0.343 0.732
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.986 on 336 degrees of freedom
Multiple R-squared: 0.004317, Adjusted R-squared: -0.01643
F-statistic: 0.2081 on 7 and 336 DF, p-value: 0.9835
 

Arun Hosh

Active Member
I divided the whole age column into 6 groups & then used boxplot for finding outlier of each group for optimum prediction (i'm refering to q5.). You remember sir also removed outliers of bedroom & bathroom (they are also categorical data) & said; do this for every independent column. That's why I am finding outliers..

please you too share your thoughts.
Hi Ananya - This is the reply that I received from simpli learn team regarding outlier treatment for this data set.

"
Please note that outliers can be considered only if the features are of the numerical data type. We do not consider outliers when it comes to categorical variables. Please note that Race and Gender are categorical variables and Age is a numerical variable. We can consider Outliers when it comes to age. But in this case study, you don't have to remove the outliers.

From the output, you need to look at the individual P values and decide if the features are significant or not and are playing a major role in predicting the output variable."

plz read & let me know
 
Top