Data Science with R | Sonal Ghanshani

Discussion in 'Big Data and Analytics' started by Nishant_Singh, Jun 1, 2019.

  1. Nishant_Singh

    Nishant_Singh Well-Known Member
    Simplilearn Support

    Joined:
    Aug 1, 2018
    Messages:
    239
    Likes Received:
    36
    #1
  2. Shaikh Mohd Shakeal

    Joined:
    May 14, 2019
    Messages:
    3
    Likes Received:
    0
    Hello Team,

    I am part of Batch 12 and we completed session 2 on 2nd June 2019. However, there were few concepts from session 3 which were also conducted. Can we have the session 3 concepts repeated in 8th June 2019 session as some part of it are still unclear.

    Thanks,
    Shakeal
     
    #2
  3. Manish Kumar Keshari

    Joined:
    May 16, 2019
    Messages:
    1
    Likes Received:
    0
    #3
  4. Meenali Dussa

    Meenali Dussa Member

    Joined:
    May 24, 2019
    Messages:
    4
    Likes Received:
    0
    Please help in constructing the code for below questions:-

    Generate a 5 x 5 matrix ‘mat2’ with elements 1:5 in the diagonal and other elements being 0.
    Add another column of elements 6:10 in mat2, making it 5 x 6 matrix.
     
    #4
  5. Meenali Dussa

    Meenali Dussa Member

    Joined:
    May 24, 2019
    Messages:
    4
    Likes Received:
    0
    can we use select, filter,arrange,mutate function in a dataset other than mtcars
     
    #5
  6. Meenali Dussa

    Meenali Dussa Member

    Joined:
    May 24, 2019
    Messages:
    4
    Likes Received:
    0
    Can I have an update on the above??
     
    #6
  7. Ashok Narayan

    Ashok Narayan New Member

    Joined:
    May 23, 2019
    Messages:
    1
    Likes Received:
    0

    mat2 <- diag(c(1:5),5,5); mat2

    [,1] [,2] [,3] [,4] [,5]
    [1,] 1 0 0 0 0
    [2,] 0 2 0 0 0
    [3,] 0 0 3 0 0
    [4,] 0 0 0 4 0
    [5,] 0 0 0 0 5


    mat2 <- cbind(mat2,c(6:10));mat2

    [,1] [,2] [,3] [,4] [,5] [,6]
    [1,] 1 0 0 0 0 6
    [2,] 0 2 0 0 0 7
    [3,] 0 0 3 0 0 8
    [4,] 0 0 0 4 0 9
    [5,] 0 0 0 0 5 10
     
    #7
  8. indhuabinaya

    indhuabinaya Member

    Joined:
    Sep 4, 2019
    Messages:
    3
    Likes Received:
    0
    hai sonal mam
     
    #8
  9. indhuabinaya

    indhuabinaya Member

    Joined:
    Sep 4, 2019
    Messages:
    3
    Likes Received:
    0
    > rec <- read.table("C:/Users/User/Desktop/r_test/test001/recurimentdata.txt" header = TRUE,fill = TRUE)
    Error: unexpected symbol in "rec <- read.table("C:/Users/User/Desktop/r_test/test001/recurimentdata.txt" header" [once again error]
     
    #9
  10. indhuabinaya

    indhuabinaya Member

    Joined:
    Sep 4, 2019
    Messages:
    3
    Likes Received:
    0
    > SRM <- read.table("C:/Users/User/Desktop/r_test/test001/SALESREPORTFORTHEMONTH.txt" header = TRUE,fill = TRUE)
    Error: unexpected symbol in "SRM <- read.table("C:/Users/User/Desktop/r_test/test001/SALESREPORTFORTHEMONTH.txt" header"
    > [i trying another internal data]
     
    #10
  11. Abhinay_7

    Abhinay_7 Member

    Joined:
    Aug 18, 2019
    Messages:
    2
    Likes Received:
    0
    #Using tapply for multiple columns is incorrect? even thought the length is same but getting error.

    > length(rsc$debtinc)
    [1] 1500> length(rsc$creddebt)[1] 1500> tapply(rsc[,c("creddebt","debtinc" )], rsc$branch, mean)Error in tapply(rsc[, c("creddebt", "debtinc")], rsc$branch, mean) :
    arguments must have same length
     
    #11
  12. Abhinay_7

    Abhinay_7 Member

    Joined:
    Aug 18, 2019
    Messages:
    2
    Likes Received:
    0
    > virginica <- filter(iris, Species == "virginica")
    > head(virginica)[1] Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    <0 rows> (or 0-length row.names)

    Not able to filter iris dataset
     
    #12
  13. Shabaan Shaikh

    Shabaan Shaikh New Member

    Joined:
    Sep 7, 2019
    Messages:
    1
    Likes Received:
    0
    #13
  14. Sonal Ghanshani_1

    Sonal Ghanshani_1 Active Member
    Alumni Customer

    Joined:
    Oct 31, 2018
    Messages:
    25
    Likes Received:
    6
    #14
    Shabaan Shaikh likes this.
  15. Sonal Ghanshani_1

    Sonal Ghanshani_1 Active Member
    Alumni Customer

    Joined:
    Oct 31, 2018
    Messages:
    25
    Likes Received:
    6
    Hi,

    tapply can be used fro one categorical and one continuous column.

    regards,
    Sonal Ghanshani
     
    #15
  16. Sonal Ghanshani_1

    Sonal Ghanshani_1 Active Member
    Alumni Customer

    Joined:
    Oct 31, 2018
    Messages:
    25
    Likes Received:
    6
    hi,

    Comma after the file name is missing. Please try

    "rec <- read.table("C:/Users/User/Desktop/r_test/test001/recurimentdata.txt", header = TRUE,fill = TRUE)

    regards,
    Sonal Ghanshani
     
    #16
  17. Ayush Heda

    Ayush Heda New Member

    Joined:
    Aug 29, 2019
    Messages:
    1
    Likes Received:
    0
    Hi Sonal,

    Can you check the code whether its right or wrong? This is for project 5, Web Analytics

    av1 <- aov(p5$Uniquepageviews ~ p5$Visits, data=p5)
    summary(av1) #Ques2


    av2 <- aov(p5$Exits ~ p5$Bounces, data=p5)
    summary(av2) #Ques3


    av3 <- aov(p5$Timeinpage ~ p5$Exits, data=p5)
    summary(av3) #Ques4

    Thank you.
    Regards,
    Ayush
     
    #17
  18. anindya dutta

    anindya dutta New Member

    Joined:
    Aug 23, 2019
    Messages:
    1
    Likes Received:
    0
    Hi Sonal,
    I will not be able to attend 19 oct session i am uploading the answer of the project 9 questions can you please check and tell me and i also made an alternate solution where i changed age to numeric in question 5 and performed the rest of the code.Shall I upload it also?
     

    Attached Files:

    #18
    Last edited: Oct 16, 2019
  19. Akshay Madan

    Akshay Madan Member

    Joined:
    Aug 31, 2019
    Messages:
    2
    Likes Received:
    0
    Hi Sonal,
    I am attaching my working on the Project 9. However I have some doubts in the 6th question.
    When we use the Linear regression model in 6th question, the summary gives values with respect to all the factors of a variable.
    to choose which which variable is effecting the most, do we have to see which variable has most of the factor's p_value below 0.05?
    Also pls let me know which of my answers are incorrect, will try to solve them again.
     

    Attached Files:

    #19
  20. Arun Rawat

    Arun Rawat Member

    Joined:
    Aug 26, 2019
    Messages:
    3
    Likes Received:
    0
    Hi Sonal,

    Please review the code.
    ##########################################################################################
    # 5. Since, the length of stay is the crucial factor for inpatients, the agency
    # wants to find if the length of stay can be predicted from age, gender, and race.
    # model <- lm(continuous variable ~ continuous + categorical)
    #if categorical is in num ir int work with as.factor(continuous)
    #Code
    model_Los <-lm(LOS ~ as.factor(AGE) +FEMALE +RACE, data = hospital)
    model_Los <-lm(LOS ~ AGE +FEMALE +RACE, data = hospital)
    model_Los <-lm(LOS ~ as.factor(AGE) +as.factor(FEMALE) +as.factor(RACE), data = hospital)
    model_Los
    ##########################################################################################
    # 6. To perform a complete analysis, the agency wants to find the variable that
    # mainly affects the hospital costs.
    # To find the variables that mainly affect the
    # total costs, construct a linear model with all the variables as the causing variables.
    # Dependent variable: TOTCHG Independent variables: All other variables
    # Code
    head(hospital)
    model_totchg <-lm(TOTCHG ~ as.factor(AGE) +as.factor(FEMALE) + as.factor(RACE) +as.factor(APRDRG), data = hospital)
    model_totchg
     
    #20
    Last edited: Oct 16, 2019
  21. Sonal Ghanshani_1

    Sonal Ghanshani_1 Active Member
    Alumni Customer

    Joined:
    Oct 31, 2018
    Messages:
    25
    Likes Received:
    6
    hello,

    #5. model_Los <-lm(LOS ~ as.factor(AGE) +FEMALE +RACE, data = hospital) - correct

    For #6. model_totchg <-lm(TOTCHG ~ as.factor(AGE) +as.factor(FEMALE) + as.factor(RACE) +as.factor(APRDRG), data= hospital)

    if the variable if factor, there is no need to use - as.factor. Also, the questions asks for a complete analysis. So, we need to incorporate all the variables (all 6).

    regards,
    Sonal Ghanshani


     
    #21
  22. Sonal Ghanshani_1

    Sonal Ghanshani_1 Active Member
    Alumni Customer

    Joined:
    Oct 31, 2018
    Messages:
    25
    Likes Received:
    6
    Hello Akshay,

    The codes and the inferences are correct. Please go ahead and upload the project.

    regards,
    Sonal Ghanshani
     
    #22
  23. Sonal Ghanshani_1

    Sonal Ghanshani_1 Active Member
    Alumni Customer

    Joined:
    Oct 31, 2018
    Messages:
    25
    Likes Received:
    6

    Hello Ayush,

    #Ques2. The Visits need to be converted to factor variable.

    #Ques3. Find out the probable factors from the dataset which could affect the exits. Need to apply multiple linear regression.
    Exits as dependents and all others as independent

    #Ques4. Find the variables which possibly have an effect on the time on page.
    time on page as dependents and all others as independent

    regards,
    Sonal Ghanshani
     
    #23
  24. Sonal Ghanshani_1

    Sonal Ghanshani_1 Active Member
    Alumni Customer

    Joined:
    Oct 31, 2018
    Messages:
    25
    Likes Received:
    6
    Hello Anindya,

    The project is correct. It is done succinctly. Just one point.

    Hops$RACE <- as.factor(Hops$RACE)

    Hops$FEMALE <- as.factor(Hops$FEMALE)

    These need not be repeated can be done once at the beginning.

    Please do upload this.

    regards,
    Sonal Ghanshani
     
    #24
  25. Arun Rawat

    Arun Rawat Member

    Joined:
    Aug 26, 2019
    Messages:
    3
    Likes Received:
    0
     
    #25
  26. Arun Rawat

    Arun Rawat Member

    Joined:
    Aug 26, 2019
    Messages:
    3
    Likes Received:
    0
    #Project Hospital Cost#
    rm(list=ls())
    getwd()
    hospital <- read.csv(file.choose())
    head(hospital)
    # Attribute Description
    # Age Age of the patient discharged
    # Female A binary variable that indicates if the patient is female
    # Los Length of stay in days
    # Race Race of the patient (specified numerically)
    # Totchg Hospital discharge costs
    # Aprdrg All Patient Refined Diagnosis Related Groups
    nrow(hospital)
    summary(hospital)
    colSums(is.na(hospital))
    hospital <- na.omit(hospital)
    colSums(is.na(hospital))
    str(hospital)
    # Race, Female and APRDRG are be categorical variable
    hospital$RACE <- as.factor(hospital$RACE)
    hospital$FEMALE<- as.factor(hospital$RACE)
    unique(hospital$APRDRG)
    hospital$APRDRG_Factor <- as.factor(hospital$APRDRG)
    str(hospital)
    summary(hospital)
    # Creating age bins#
    hospital$age_bins <- ifelse((hospital$AGE < 1), "infant",
    ifelse(hospital$AGE < 3, 'toddler',
    ifelse(hospital$AGE < 11, 'child',
    'adolescent')))
    hospital$age_bins <- as.factor(hospital$age_bins)
    str(hospital)
    head(hospital)
    View(hospital)
    # 'data.frame': 500 obs. of 8 variables:
    # $ AGE : int 17 17 17 17 17 17 17 16 16 17 ...
    # $ FEMALE : Factor w/ 2 levels "0","1": 2 1 2 2 2 1 2 2 2 2 ...
    # $ LOS : int 2 2 7 1 1 0 4 2 1 2 ...
    # $ RACE : Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
    # $ TOTCHG : int 2660 1689 20060 736 1194 3305 2205 1167 532 1363 ...
    # $ APRDRG : int 560 753 930 758 754 347 754 754 753 758 ...
    # $ APRDRG_Factor: Factor w/ 63 levels "21","23","49",..: 32 51 62 55 52 28 52 52 51 55 ...
    # $ age_bins : Factor w/ 4 levels "adolescent","child",..: 1 1 1 1 1 1 1 1 1 1 ...
    head(hospital)
    View(hospital)
    # 1. To record the patient statistics, the agency wants to find the age
    # category of people who frequents the hospital and has the maximum expenditure.
    # Code:
    hist(hospital$AGE)
    summary(as.factor(hospital$AGE))
    #Age: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
    #Frequency: 307 10 1 3 2 2 2 3 2 2 4 8 15 18 25 29 29 38
    # Result: From the graph that is displayed, we can see that infants AGE = 0) have the maximum
    # frequency of hospital visit, going above 300. The summary of AGE attribute gives
    # the numerical output (after converting the age from numeric to factor) - and we
    # can see that there are 307 entries for those in the range of 0-1 year.
    #########################################################################################################
    # b. To find the age category with the maximum expenditure,
    # we need to add the expenditure for each age, and find the maximum value
    # from the sum. We will use the aggregate function to add the values of
    # total expenditure according to the values of age.
    # With AGE
    aggregate(TOTCHG ~ AGE, FUN = sum, data = hospital)
    hospital.df <- aggregate(TOTCHG ~ AGE, FUN = sum, data = hospital)
    hospital.df
    hospital.df[(hospital.df$TOTCHG == max(hospital.df$TOTCHG)),]
    #AGE TOTCHG
    #1 0 676962
    # age_bins
    hospital.age_bins <-aggregate(TOTCHG ~ age_bins, FUN = sum, data = hospital)
    hospital.age_bins
    max(aggregate(TOTCHG ~ age_bins, FUN = sum, data = hospital)$TOTCHG)
    hospital.age_bins[(hospital.age_bins$TOTCHG == max(hospital.age_bins$TOTCHG)),]
    # age_bins TOTCHG
    #3 infant 676962
    # Result: From the result we can see that the infant category (AGE = 0) has maximum hospital costs
    # as well (in accordance with the number or frequency of visit). Following the infants,
    # 15 and 17 year old individuals have high hospitalization costs.
    ###########################################################################################################
    # 2. In order of severity of the diagnosis and treatments and to find out the expensive
    # treatments, the agency wants to find the diagnosis related group that has maximum
    # hospitalization and expenditure.
    summary(hospital$APRDRG_Factor)
    which.max(summary(hospital$APRDRG_Factor)) # which diagnostic group is repeating maximum number of times
    hospital.dg <- aggregate(TOTCHG ~ APRDRG_Factor, FUN = sum, data = hospital)
    hospital.dg
    hospital.dg[which.max(summary(hospital.dg$APRDRG_Factor)),]
    # Result: From the results we can see that the category 640 has the
    # maximum entries of hospitalization, by a huge contrast (266 of 500 entries),
    # and also has the highest total hospitalization cost (437978)
    ################################################################################################
    #hist(hospital$LOS)
    #summary(as.factor(hospital$LOS))
    ##aggregate(TOTCHG ~ LOS, FUN = sum, data = hospital)
    #hospital.LOS <- aggregate(TOTCHG ~ LOS, FUN = sum, data = hospital)
    #hospital.LOS
    #hospital.LOS[(hospital.df$TOTCHG == max(hospital.df$TOTCHG)),]
    #######################################################################################
    # 3. To make sure that there is no malpractice, the agency needs to analyze if the
    # race of the patient is related to the hospitalization costs.
    # If there is any effect of RACE on TOTCHG
    # To analyze, first we need to convert the Race variable to factors and perform a summary of the variable. This
    # will help to find how many patients belonging to the different groups were admitted.
    # Then, to verify if the races made an impact on the costs, performing an ANOVA with the
    # following variables:
    # Defining Hypothesis
    # Ho: The races had no an impact on the costs
    # H1: The races had an impact on the costs
    # ANOVA dependent variable: TOTCHG
    # Categorical/grouping variable: RACE Missing values: 1 NA value, use na.omit to remove the NA value
    #
    # Code:
    str(hospital$RACE)
    str(hospital$TOTCHG)
    model <- aov(TOTCHG ~ RACE, data = hospital) # numerical ~ categorical varibale
    summary(model)
    alpha = 0.05
    pvalue = 0.943
    pvalue < alpha # if this is true = whenever p_value is less than alpha; we reject the null hypothesis
    # Result: The p-value is very high specifying that there is no relation between
    # the race of patient and the hospital cost.
    summary(hospital$RACE)
    # Race 1 2 3 4 5 6
    # Frequecy 484 6 1 3 3 2
    # From the summary we can also see that:
    # the data has 484 patients of Race 1 out of the 500 entries. This will affect the
    # results of ANOVA as well, since the number of observations is very much skewed.
    # In conclusion, there is not enough data to verify if the race of patient is related
    # to the hospitalization cost.
    ################################################################################################
    # 4. To properly utilize the costs, the agency has to analyze the severity of the
    # hospital costs by age and gender for proper allocation of resource
    # Defining Hypothesis
    # Ho: The races and age has no an impact on the costs
    # H1: The races and age has an impact on the costs
    # ANOVA dependent variable: TOTCHG
    #code:
    colSums(is.na(hospital))
    str(hospital)
    str(hospital$RACE)
    str(hospital$AGE)
    str(hospital$TOTCHG)
    model_cost <- aov(TOTCHG ~ RACE +as.factor(AGE), data = hospital) # numerical ~ categorical varibale
    summary(model_cost)
    pvalue_race = 0.94112
    pvalue_age = 0.00413
    pvalue_race < alpha
    pvalue_age < alpha
    #Result:
    #From the ANOVA test we can also see that:
    # Race has no relation with respect to hostpitalization cost whereas age has
    #the relation with hospitalization cost
    ##########################################################################################
    # 5. Since, the length of stay is the crucial factor for inpatients, the agency
    # wants to find if the length of stay can be predicted from age, gender, and race.
    #Code
    model_Los <-lm(LOS ~ as.factor(AGE) +FEMALE +RACE, data = hospital)
    model_Los <-lm(LOS ~ AGE +FEMALE +RACE, data = hospital)
    model_Los <-lm(LOS ~ as.factor(AGE) +as.factor(FEMALE) +as.factor(RACE), data = hospital)
    model_Los
    ##########################################################################################
    # 6. To perform a complete analysis, the agency wants to find the variable that
    # mainly affects the hospital costs.
    # Code
    head(hospital)
    model_totchg <-lm(TOTCHG ~ as.factor(AGE) +as.factor(FEMALE) + as.factor(RACE) +as.factor(APRDRG), data = hospital)
    model_totchg
     
    #26
  27. Pooja Trivedi

    Pooja Trivedi Member

    Joined:
    Aug 12, 2019
    Messages:
    2
    Likes Received:
    0
    Hi Sonal Ma'am,

    please check my code and do let me know if any change are required or I have done everything correctly.
    Thank you for your time.

    Regards,
    Pooja.
     

    Attached Files:

    #27
  28. Pooja Trivedi

    Pooja Trivedi Member

    Joined:
    Aug 12, 2019
    Messages:
    2
    Likes Received:
    0
    Sonal ma'am once you confirm I will upload it to LMS.
     
    #28
  29. Sonal Ghanshani_1

    Sonal Ghanshani_1 Active Member
    Alumni Customer

    Joined:
    Oct 31, 2018
    Messages:
    25
    Likes Received:
    6
    Hello Pooja,

    Please upload. it is correct.

    regards,
    Sonal Ghanshani
     
    #29
  30. Arsalen Syed

    Arsalen Syed New Member

    Joined:
    Sep 20, 2019
    Messages:
    1
    Likes Received:
    0
  31. Vikas Kumar_18

    Vikas Kumar_18 Well-Known Member
    Simplilearn Support Alumni

    Joined:
    Dec 17, 2018
    Messages:
    159
    Likes Received:
    20
    Hi Arsalen,

    I request you to mention the Datasets names as well so that Sonal would upload in the google drive. However, you can get the datasets in our LMS only. LMS<< Data Science With R << Self Learning << Scroll Down at bottom << Datasets folder. You would get the datasets used in our self-learning and live sessions. I am sure this would help you.


    Regards,
    Vikas kumar
     
    #31
  32. Akshay Madan

    Akshay Madan Member

    Joined:
    Aug 31, 2019
    Messages:
    2
    Likes Received:
    0
    Hi Sonal,
    Wanted to know if you could share the hints of all the projects in the drive. We all have already completed one project, wanted to complete the rest by self pace. Could you share those hints in the drive.
     
    #32

Share This Page