Welcome to the Simplilearn Community

Want to join the rest of our members? Sign up right away!

Sign Up

DS with R | Mar 31 - Apr 22 | Samridhi

Status
Not open for further replies.

_26915

Richa Rao
Alumni
Hi Everyone,

Please find attached the assignment questions on Data manipulation using Apply and Dplyr family. Please post your answers here.

Best Regards,
Samridhi

#Questions on Data Manipulation
head(iris)
?iris
#Find the sum of each column and confirm if the sum is greater than 800 or notexpr=

s=apply(iris[,-5],2,sum)
s
i=1
for(i in s)
{
if(i>800)
print("sum is greater than 800")
else
print("sum is not greater than 800")
}

#Hint: create a custom function to find the sum and compare it with 800,
#and apply the function on each numerical column

#Find the sum / mean / median of Sepal Length species-wise

head(iris)
levelwise_sum= tapply(iris$sepal_length, list(iris$species),sum)
levelwise_mean= tapply(iris$sepal_length,list(iris$species), mean)
levelwise_median= tapply(iris$sepal_length,list(iris$species),median)
OR
iris%>%group_by(species)%>%summarise(median(sepal_length,na.rm=T))
iris%>%group_by(species)%>%summarise(mean(sepal_length, na.rm = T))
iris%>%group_by(species)%>%summarise(sum(sepal_length,na.rm = T))

#For all the flowers having sepal width > 3.0, find the number of flowers in each species
summarise(group_by(filter(iris, sepal_width>3.0),species),number=n())

#Count how many different petal widths are there in each species.
head(iris)

iris%>%group_by(species)%>%summarise(n_distinct(petal_width))

#Create 2 dataframes from Iris:
# Df1: variables containing the keyword Width + Species column
df1= iris%>%mutate(new_col=paste(species,sepal_width))
df1
# Df2: variables containing the keyword Length + Species column
df2=iris%>%mutate(new_column= paste(species,petal_length))
df2

library(hflights)
#How many flights are not cancelled? Hint: use var cancellation code

?hflights
head(hflights,5)
summarise(filter(hflights,hflights$Cancelled==0), number=n())

#Combine year month and day variables to create a date column
p=hflights%>%mutate(date_column=paste(Year,Month,DayOfWeek, sep="/"))
head(p)

#Find the maximum AirTime for all flights whose Departure delay is not NA

l=hflights%>%group_by(TailNum)%>%filter(!is.na(DepDelay))%>%summarise(max(AirTime, na.rm= T))
head(l,20)

#Find per-carrier mean of arrival delays and arrange them in increasing / decreasing order

hflights%>%group_by(UniqueCarrier)%>%summarise(mean_value=mean(ArrDelay,na.rm=T))%>%arrange(desc(mean_value))
hflights%>%group_by(UniqueCarrier)%>%summarise(mean_value=mean(ArrDelay,na.rm=T))%>%arrange((mean_value))

#How many airplanes only flew to one destination from Houston?
nrow(filter(hflights, hflights$Origin=="HOU"))
summarise(filter(hflights,hflights$Origin=="HOU"), n_row=n())
# Hint: each tail number represents 1 airplane.

[Samridhi, Can you please suggest answer to #Find the sum of each column and confirm if the sum is greater than 800 or notexpr= question using function creation. I tried but it was not working. It was taking into account only 1 column]
[Also, answer for this question #How many airplanes only flew to one destination from Houston?]
 

Samridhi Dutta

Well-Known Member
Alumni
Trainer
At 5% significance level:

For Left tail test:

-1.64485363

For right tail test:

1.644853627

For 2 tail test:

-1.95996398 & 1.959963985

At 1% significance level:

Left tail test:
-2.32634787

Right tail test:
2.326347874

2 Tail test:
-2.5758293 & 2.575829304

At 0.1% significance level:

Left tail test:
-3.09023231

Right tail test:
3.090232306

2 Tail test:
-3.29052673 & 3.290526731

Hi Richa,

Henceforth, for the benefit of other students, let us please upload the excel sheet on which we perform our calculations. It need not be formal, just a little organised and numbered, to facilitate the learning of all.

The answers here are right :)

Regards,
Samridhi
 

_26915

Richa Rao
Alumni
Hi Richa,

Henceforth, for the benefit of other students, let us please upload the excel sheet on which we perform our calculations. It need not be formal, just a little organised and numbered, to facilitate the learning of all.

The answers here are right :)

Regards,
Samridhi
Sure. :)
 

Samridhi Dutta

Well-Known Member
Alumni
Trainer
#Questions on Data Manipulation
head(iris)
?iris
#Find the sum of each column and confirm if the sum is greater than 800 or notexpr=

s=apply(iris[,-5],2,sum)
s
i=1
for(i in s)
{
if(i>800)
print("sum is greater than 800")
else
print("sum is not greater than 800")
}

#Hint: create a custom function to find the sum and compare it with 800,
#and apply the function on each numerical column

#Find the sum / mean / median of Sepal Length species-wise

head(iris)
levelwise_sum= tapply(iris$sepal_length, list(iris$species),sum)
levelwise_mean= tapply(iris$sepal_length,list(iris$species), mean)
levelwise_median= tapply(iris$sepal_length,list(iris$species),median)
OR
iris%>%group_by(species)%>%summarise(median(sepal_length,na.rm=T))
iris%>%group_by(species)%>%summarise(mean(sepal_length, na.rm = T))
iris%>%group_by(species)%>%summarise(sum(sepal_length,na.rm = T))

#For all the flowers having sepal width > 3.0, find the number of flowers in each species
summarise(group_by(filter(iris, sepal_width>3.0),species),number=n())

#Count how many different petal widths are there in each species.
head(iris)

iris%>%group_by(species)%>%summarise(n_distinct(petal_width))

#Create 2 dataframes from Iris:
# Df1: variables containing the keyword Width + Species column
df1= iris%>%mutate(new_col=paste(species,sepal_width))
df1
# Df2: variables containing the keyword Length + Species column
df2=iris%>%mutate(new_column= paste(species,petal_length))
df2

library(hflights)
#How many flights are not cancelled? Hint: use var cancellation code

?hflights
head(hflights,5)
summarise(filter(hflights,hflights$Cancelled==0), number=n())

#Combine year month and day variables to create a date column
p=hflights%>%mutate(date_column=paste(Year,Month,DayOfWeek, sep="/"))
head(p)

#Find the maximum AirTime for all flights whose Departure delay is not NA

l=hflights%>%group_by(TailNum)%>%filter(!is.na(DepDelay))%>%summarise(max(AirTime, na.rm= T))
head(l,20)

#Find per-carrier mean of arrival delays and arrange them in increasing / decreasing order

hflights%>%group_by(UniqueCarrier)%>%summarise(mean_value=mean(ArrDelay,na.rm=T))%>%arrange(desc(mean_value))
hflights%>%group_by(UniqueCarrier)%>%summarise(mean_value=mean(ArrDelay,na.rm=T))%>%arrange((mean_value))

#How many airplanes only flew to one destination from Houston?
nrow(filter(hflights, hflights$Origin=="HOU"))
summarise(filter(hflights,hflights$Origin=="HOU"), n_row=n())
# Hint: each tail number represents 1 airplane.

[Samridhi, Can you please suggest answer to #Find the sum of each column and confirm if the sum is greater than 800 or notexpr= question using function creation. I tried but it was not working. It was taking into account only 1 column]
[Also, answer for this question #How many airplanes only flew to one destination from Houston?]


Hi Richa,

Find the sum of each column and confirm if the sum is greater than 800 or not
For this question, create a custom defined function sum_comparison (x) in which sum of the vector x is compared with 800 and true / false is returned. Apply this function sum_comparison on all the columns of the df using apply or lapply or sapply.

How many airplanes only flew to one destination from Houston?

hflights%>%filter(Origin=="HOU")%>%group_by(TailNum)%>%summarise(n=n_distinct(Dest))%>%filter(n==1)%>%summarise(n())

Regards,
Samridhi
 

_26915

Richa Rao
Alumni
Hi Everyone,

Please find attached the assignment on Hypothesis Testing I and II. Please attempt the questions and submit in the community forum.

Best Regards
Samridhi

Hi Samridhi/All,

Please help in checking the answers & concept of the assignment of basic stats. I just want to confirm if my understanding of these concept is correct or not.

Also, for T-inverse, in my excel t.inv.2t function is not there. So, I have solved one question involving use of inverse T through norminv only.
 

Attachments

  • assignment_2.zip
    10.2 KB · Views: 12
  • Basic_stats1-assignment.zip
    9.2 KB · Views: 12

Samridhi Dutta

Well-Known Member
Alumni
Trainer
Hi Guys,

Please find attached the class notes of Linear Regression and Logistic regression discussed today. Please try your hands on Data Science Project_4 from LMS->Resources (Projects). This is a project on logistic regression. The learners will be expected to do this project in the class. Practice it at your end, try the coding at home, and let's discuss in the class tomorrow (Let's keep the var names same as in the given code and in lower case to keep consistency).

Regards,
Samridhi
 

Attachments

  • Class_Notes_21Apr_Linear Regression.pdf
    186.3 KB · Views: 20
  • Class_Notes_21Apr_Logistic Regression.pdf
    336.6 KB · Views: 20

Samridhi Dutta

Well-Known Member
Alumni
Trainer
Hi guys,

Sensitivity, specificity and ROC curve are complex concepts. So I have tried to summarize the concepts for you. Please go through the attached document and image of ROC Curve, and let me know if there are any queries.

Correction:
Guys there is a small mistake in the given pdf in the question:
In the attached ROC graph, find the threshold to minimize FPR, while maximizing TPR. What is the
best threshold in the above graph .
Answer should be .6 and not .7

Best Regards
Samridhi
 

Attachments

  • Sensitivity_Specificity_ROC Curve.pdf
    331 KB · Views: 14
  • ROC_curve.png
    ROC_curve.png
    42.7 KB · Views: 13
Last edited:

Samridhi Dutta

Well-Known Member
Alumni
Trainer
Hi guys,

Please find attached the notes discussed today in the class (22nd Apr).

Best Regards
Samridhi
 

Attachments

  • Class_Notes_22_Apr_Classification.pdf
    353.2 KB · Views: 17

Samridhi Dutta

Well-Known Member
Alumni
Trainer
Hi guys,

As assignment please try your hands on Project 2 and Project 4 and submit here in the community forum.

For the final project, you need to submit either of the Healthcare, Insurance, Internet or Retail project on the coming sunday. I would expect all of you to choose your final project and work in this week and ask your questions here in community forum, so that if something requires discussion, then we are able to spend time in the class on Saturday.

I understand it's complex, but practice and patience are the keys to success :)!

Regards,
Samridhi
 
Last edited:

Samridhi Dutta

Well-Known Member
Alumni
Trainer
Here are the model selection guidelines which will be useful for you:
Classification


Linear Problem: Logistic Regression or SVM (linear)

Non-linear Problem: K-NN, Naive Bayes, Decision Tree SVM (Radial) or Random Forest

Then which one should you choose in each case? We’ll do model Selection with k-Fold Cross Validation.

From a business point of view, you would rather use:

- Logistic Regression or Naive Bayes is to be chosen when we want to rank the predictions by their probability. Eg. when we want to target for the marketing campaigns, and need the prob. ranking. For this type of business problem, we should use Logistic Regression if the problem is linear, and Naive Bayes if the problem is non linear.

- SVM to predict to which segment the customers belongs to. Segments can be any kind of segments, for example some market segments identified with clustering.

- Decision Tree when we want to have clear interpretation of the model results.

- Random Forest when we are just looking for high performance with less need for interpretation.

Regression

Linear Problem: Linear Regression

Non Linear Problem: Polynomial Regression, SVR, Decision Tree or Random Forest. Then which one should you choose among these four? Use k-Fold Cross Validation, and then picking the model that shows the best results.
 

Samridhi Dutta

Well-Known Member
Alumni
Trainer
Hi guys,

For regression (continuous DV), I request all of you to try your hands on Boston data and try implementing Decision tres, randomforest, svr linear svr radial and k-fold cross-validation. The method is similar as discussed in the class. You can verify your code from the attached reference.

Let me know incase of queries.

Regards
Samridhi
 

Attachments

  • Regression_SVR_DecTrees_RF_kfoldCV.pdf
    198.8 KB · Views: 14

Samridhi Dutta

Well-Known Member
Alumni
Trainer
Hi guys,

Sensitivity, specificity and ROC curve are complex concepts. So I have tried to summarize the concepts for you. Please go through the attached document and image of ROC Curve, and let me know if there are any queries.

Correction:
Guys there is a small mistake in the given pdf in the question:
In the attached ROC graph, find the threshold to minimize FPR, while maximizing TPR. What is the
best threshold in the above graph .
Answer should be .6 and not .7

Best Regards
Samridhi
Hi guys,
I am attaching the document again after making the above mentioned correction.

Regards
Samridhi
 

Attachments

  • Sensitivity_Specificity_ROC.pdf
    329.6 KB · Views: 7
  • ROC_curve.png
    ROC_curve.png
    42.7 KB · Views: 10

_26915

Richa Rao
Alumni
Hi Samridhi,

Please review the Flight delay assignment which we were suppose to do in break out session of 21st April.

I have few doubts, please help in clarifying the same:

1. When I did step down on the model & checked for col linearity , there were many independent variables which had significantly high probability value but in col linearity had value more than 10. I was not sure to include or exclude those independent variables from the formula for logistic regression as 3 star p value is significantly good and need to be considered but at the same time co relation value above 5 is not good.

2. After ROC curve, when we have obtained the threshold value, we predict the test data. Should we calculate the accuracy right then and proceed further with area under curve calculation? In class, the calculation for accuracy was made after area under curve calculation which got me confused. Why will we calculate the accuracy after auc?

3. In this particular problem, I tested with different threshold value and got the accuracy as 79% and area under curve as 70.6%. How do we know this area or accuracy is good? Is there any standard value to compare with or we have to try different regression model like random forest, decision tree and then take the most accurate result into account?

Please check the code attached and let me know if my understanding is correct or not.
 

Attachments

  • hflights_logistic_regression.txt
    2.3 KB · Views: 9

Samridhi Dutta

Well-Known Member
Alumni
Trainer
Hi Richa,

I shall check the code at a convenient time, but here are the quick answers to help you improve your judgement.

Please review the Flight delay assignment which we were suppose to do in break out session of 21st April.

I have few doubts, please help in clarifying the same:

1. When I did step down on the model & checked for col linearity , there were many independent variables which had significantly high probability value but in col linearity had value more than 10. I was not sure to include or exclude those independent variables from the formula for logistic regression as 3 star p value is significantly good and need to be considered but at the same time co relation value above 5 is not good.

-Multi-collinearity is not aloud in logistic regression. We have to check the correlations, and make sure to keep only one of the correlated vars, even if accuracy goes down slightly. vif > 5 is not aloud, even if it is for significant variables.


2. After ROC curve, when we have obtained the threshold value, we predict the test data. Should we calculate the accuracy right then and proceed further with area under curve calculation? In class, the calculation for accuracy was made after area under curve calculation which got me confused. Why will we calculate the accuracy after auc?

AUC has nothing to do with selection of threshold value. The accuracy is dependent on the threshold you select based on ROC curve. But Area under the curve indicates the area covered by the model at all the threshold values. An auc of 100% indicates, there is atleast 1 threshold value where TPR = 1 and FPR = 0. We never obtain a 100% perfect model. AUC of the worst model is 50%. An AUC >=70% is considered to be a good model.

3. In this particular problem, I tested with different threshold value and got the accuracy as 79% and area under curve as 70.6%. How do we know this area or accuracy is good? Is there any standard value to compare with or we have to try different regression model like random forest, decision tree and then take the most accurate result into account?

AUC > = 70% is considered a good model
You need to try different models and choose the model with the best accuracy and least variance error.

Please check the code attached and let me know if my understanding is correct or not.[/QUOTE]
 
Last edited:

_26915

Richa Rao
Alumni
Hi Samridhi/All,

While doing random forest/decision tree, is it advisable/mandatory step to always make ROC curve and decide threshold value before preceding for prediction on test data?
 

_26313

Member
Alumni
Hi Guys,

Please find attached the class notes of Linear Regression and Logistic regression discussed today. Please try your hands on Data Science Project_4 from LMS->Resources (Projects). This is a project on logistic regression. The learners will be expected to do this project in the class. Practice it at your end, try the coding at home, and let's discuss in the class tomorrow (Let's keep the var names same as in the given code and in lower case to keep consistency).

Regards,
Samridhi
I am unable to open "Assignment Statistics.pdf" file here. It keeps on giving error about corruption.

I was also getting the same error. I just opened the the document in excel format and it opened. Better give try....
 

_26313

Member
Alumni
I was trying the project with solution and is not able to load "library(arules)" as per given in the solution. Can you please tell me how can this be loaded?
 

Samridhi Dutta

Well-Known Member
Alumni
Trainer
I was trying the project with solution and is not able to load "library(arules)" as per given in the solution. Can you please tell me how can this be loaded?

You need to try only project 2 and project 4 as assignment. Project 1 and 3 are association mining (arules package) and clustering which we will cover in the week end.

For your final project you have to choose either of insurance, health, internet or retail.

Regards,
Samridhi
 

Samridhi Dutta

Well-Known Member
Alumni
Trainer
Hi Samridhi/All,

While doing random forest/decision tree, is it advisable/mandatory step to always make ROC curve and decide threshold value before preceding for prediction on test data?

Hi Richa,

No it's not mandatory at all. Infact, unless we have a specific business requirement of giving higher priority to sensitivity over specificity or vice-versa, we will not vary the threshold.

However, finding the Area under the curve using ROC curve helps us understand the performance of the model. A good model will have atleast 1 cut-off with high TPR and low FPR value, and hence area under the curve will be high.

With regards to your question on whether it is advisable: the primary use of ROC curve is sensitivity and specificity analysis. So, depends on business requirement. Also, AUC is a good metric to compare the classification model performance

Regards,
Samridhi
 

_26915

Richa Rao
Alumni
Hi Richa,

No it's not mandatory at all. Infact, unless we have a specific business requirement of giving higher priority to sensitivity over specificity or vice-versa, we will not vary the threshold.

However, finding the Area under the curve using ROC curve helps us understand the performance of the model. A good model will have atleast 1 cut-off with high TPR and low FPR value, and hence area under the curve will be high.

With regards to your question on whether it is advisable: the primary use of ROC curve is sensitivity and specificity analysis. So, depends on business requirement. Also, AUC is a good metric to compare the classification model performance

Regards,
Samridhi

Thanks Samridhi.
 

Samridhi Dutta

Well-Known Member
Alumni
Trainer
Hi Samridhi,

Please review the Flight delay assignment which we were suppose to do in break out session of 21st April.

I have few doubts, please help in clarifying the same:

1. When I did step down on the model & checked for col linearity , there were many independent variables which had significantly high probability value but in col linearity had value more than 10. I was not sure to include or exclude those independent variables from the formula for logistic regression as 3 star p value is significantly good and need to be considered but at the same time co relation value above 5 is not good.

2. After ROC curve, when we have obtained the threshold value, we predict the test data. Should we calculate the accuracy right then and proceed further with area under curve calculation? In class, the calculation for accuracy was made after area under curve calculation which got me confused. Why will we calculate the accuracy after auc?

3. In this particular problem, I tested with different threshold value and got the accuracy as 79% and area under curve as 70.6%. How do we know this area or accuracy is good? Is there any standard value to compare with or we have to try different regression model like random forest, decision tree and then take the most accurate result into account?

Please check the code attached and let me know if my understanding is correct or not.


Hi Richa,

I have reviewed the code. It's correct. Following are additional inputs:
1. The final model which you select should have only significant variables. All the insignificant variables need to be removed from the model equation manually.
2. Model with lowest AIC, with as less variables as possible is the best model.

Regards,
Samridhi
 

Samridhi Dutta

Well-Known Member
Alumni
Trainer
Hi Samridhi/All,

Please help in checking the answers & concept of the assignment of basic stats. I just want to confirm if my understanding of these concept is correct or not.

Also, for T-inverse, in my excel t.inv.2t function is not there. So, I have solved one question involving use of inverse T through norminv only.
Hi Richa,

There are some mistakes in this. I shall post the solution.

Regards
Samridhi
 

Madhu K_M

Member
Alumni
Hi

am getting error - Error: could not find function "prp"
wonder whch pakage i need to load ?

library(rpart)
head(training)
model = rpart(as.factor(churn)~., training)
prp(model)
 

Madhu K_M

Member
Alumni
I am getting with same solution on project 2 alos

library(textir) ## needed to standardize the data
library(class) ## needed for knn
credit <- read.csv("germancredit.csv")
credit$Default <- factor(credit$Default)
credit$history = factor(credit$history,
levels=c("A30","A31","A32","A33","A34"))
levels(credit$history) = c("good","good","poor","poor","terrible")
credit$foreign <- factor(credit$foreign, levels=c("A201","A202"),
labels=c("foreign","german"))
credit$rent <- factor(credit$housing=="A151")
credit$purpose <- factor(credit$purpose,
levels=c("A40","A41","A42","A43","A44","A45","A46","A47","A48","A49","A410"))
levels(credit$purpose) <-
c("newcar","usedcar",rep("goods/repair",4),"edu",NA,"edu","biz","biz")

## for demonstration, cut the dataset to these variables
credit <- credit[,c("Default","duration","amount","installment","age",
"history", "purpose","foreign","rent")]
credit[1:3,]
summary(credit) # check out the data

x <- normalize(credit[,c(2,3,4)])
x <- scale(credit[,c(2,3,4)])
error:
> x <- normalize(credit[,c(2,3,4)])
Error: could not find function "normalize"
> x <- scale(credit[,c(2,3,4)])
i am using r vesion 3.3.2
 

Samridhi Dutta

Well-Known Member
Alumni
Trainer
Hi

am getting error - Error: could not find function "prp"
wonder whch pakage i need to load ?

library(rpart)
head(training)
model = rpart(as.factor(churn)~., training)
prp(model)

Hi Madhu,
Please google "prp package in R". You need to load the package to which this function belongs.
This function belongs rpart.plot library and you 'll have to install and load it.

Regards
Samridhi
 

Samridhi Dutta

Well-Known Member
Alumni
Trainer
I am getting with same solution on project 2 alos

library(textir) ## needed to standardize the data
library(class) ## needed for knn
credit <- read.csv("germancredit.csv")
credit$Default <- factor(credit$Default)
credit$history = factor(credit$history,
levels=c("A30","A31","A32","A33","A34"))
levels(credit$history) = c("good","good","poor","poor","terrible")
credit$foreign <- factor(credit$foreign, levels=c("A201","A202"),
labels=c("foreign","german"))
credit$rent <- factor(credit$housing=="A151")
credit$purpose <- factor(credit$purpose,
levels=c("A40","A41","A42","A43","A44","A45","A46","A47","A48","A49","A410"))
levels(credit$purpose) <-
c("newcar","usedcar",rep("goods/repair",4),"edu",NA,"edu","biz","biz")

## for demonstration, cut the dataset to these variables
credit <- credit[,c("Default","duration","amount","installment","age",
"history", "purpose","foreign","rent")]
credit[1:3,]
summary(credit) # check out the data

x <- normalize(credit[,c(2,3,4)])
x <- scale(credit[,c(2,3,4)])
error:
> x <- normalize(credit[,c(2,3,4)])
Error: could not find function "normalize"
> x <- scale(credit[,c(2,3,4)])
i am using r vesion 3.3.2
Hi Madhu,

You need to define the normalize function the way we defined in the class. Please refer to classification class notes and check how we defined normalize function.

Regards
Samridhi
 

Samridhi Dutta

Well-Known Member
Alumni
Trainer
Hi Everyone,

Please find attached the assignment on Hypothesis Testing I and II. Please attempt the questions and submit in the community forum.

Best Regards
Samridhi
Hi All,

Attaching the solutions for the statistics assignments.

Regards,
Samridhi
 

Attachments

  • Assignment_statistics_solutions.zip
    16 KB · Views: 8

Samridhi Dutta

Well-Known Member
Alumni
Trainer
Hi Guys,

Please find attached the class notes of Association Rules Mining discussed today.

Best Regards,
Samridhi
 

Attachments

  • Class_notes_28_Apr_AssociationRules.pdf
    320.8 KB · Views: 9

_27016

Member
Alumni
Hi Ms Samridhi,

PFA answer sheet for my project on Health care.

Have not been able to grasp usage of R, so only concept based answers based on my understanding at present

Regards
Sanjay Aggarwal
 

Attachments

  • Answers.pdf
    353.3 KB · Views: 9

Samridhi Dutta

Well-Known Member
Alumni
Trainer
Hi Ms Samridhi,

PFA answer sheet for my project on Health care.

Have not been able to grasp usage of R, so only concept based answers based on my understanding at present

Regards
Sanjay Aggarwal

Hi Sanjay,

You will have to do the project on R as well, and submit it in the projects tab on lms. Try this today and let's discuss your queries tomorrow.

In order to do the project on R, you may download all the classnotes posted on community forum, and copy paste on Rstudio. Then it will be easy for you to refer to the notes, if you forget the function name or its usage.

Regards,
Samridhi
 

_27016

Member
Alumni
Have added the codes as explained today. However need to go thru from start and learn coding
 

Attachments

  • Answer1.0.pdf
    356.4 KB · Views: 7

Samridhi Dutta

Well-Known Member
Alumni
Trainer
Have added the codes as explained today. However need to go thru from start and learn coding
Hi,

Would request you to please upload and submit the code along with documentation in the projects tab of lms.

Let me know if I can help you during your learning and practice :)

Regards,
Samridhi
 

Samridhi Dutta

Well-Known Member
Alumni
Trainer
Hi Guys,

Please find attached today's class notes on clustering.

All the best to everybody for a very bright career in Data Science.

Best Regards,
Samridhi
 

Attachments

  • Class_Notes_29_Apr_Clustering.pdf
    342.4 KB · Views: 7

Samridhi Dutta

Well-Known Member
Alumni
Trainer
Hi Madhu,

You need to define the normalize function the way we defined in the class. Please refer to classification class notes and check how we defined normalize function.

Regards
Samridhi

Hi,

Here is the normalize function. You can add it in your code, then it will not throw error.

normalize <- function(x){
return((x-min(x))/(max(x) - min(x)))
}

Regards
Samridhi
 

_26915

Richa Rao
Alumni
Hi Samridhi,

In all the examples till now , we have dealt with dependent variable having two factors. What if dependent variable is categorical with multiple levels? Can we use all the other classification techniques (like Decision Tree, Logistic, Naive Bayes, Random Forest, Knn) in that case also.
 

Samridhi Dutta

Well-Known Member
Alumni
Trainer
Hi Samridhi,

In all the examples till now , we have dealt with dependent variable having two factors. What if dependent variable is categorical with multiple levels? Can we use all the other classification techniques (like Decision Tree, Logistic, Naive Bayes, Random Forest, Knn) in that case also.
Hi Richa,

Yes if categorical level has multiple levels, then also all these models can be used.

Try on Iris data set with species as DV. Use the following command:
<modelname>(as.factor(Species)~., iris)
where <modelname> can be glm / rpart / randomforest / knn / naivebayes
To test the accuracy, make a confusion matrix

Best Regards,
Samridhi
 

_26915

Richa Rao
Alumni
Hi Samridhi,

Need your help on a practice project that I am doing .

This project is about identifying fraudulent insurance claims being made by customers in a health insurance company.
What all hypothesis can we assume while working on this kind of project? I mean, I want to use ifelse condition to segregate fraud or not fraud claims. What all hypothesis should be taken in consideration to classify the same?
 

_26915

Richa Rao
Alumni
Hi Samridhi,

I can think of only two hypothesis out of which one seems to be outlier to be

1. Premium should not be more than Monthly income.
2. A male person claiming any female related disease like breast cancer and vice-vera (This appears to be outlier to me)

Can you help with few more, Please.
 

_26915

Richa Rao
Alumni
unique Id, Policy Type, Age, Sex, Job, Salary, Premium, Weight, Hieght, BMI, Date of Injury, Reason of Injury
 

Samridhi Dutta

Well-Known Member
Alumni
Trainer
unique Id, Policy Type, Age, Sex, Job, Salary, Premium, Weight, Hieght, BMI, Date of Injury, Reason of Injury

Following are the hypothesis:
Hypothesis1: Premium and Policy Type are not related
Hypothesis2: Premium and Age are not related
Hypothesis3: Premium and Sex are not related
Hypothesis4: Premium and Job are not related
... same way for all independent variables

A hypothesis means a supposition made on the basis of limited evidence, and needs to be validated using scientific experiment. We can validate the above hypothesis using correlations / anova / T-test / linear models.

The hypothesis given by you are not correct. Those are only facts / observations in the given sample data. Hypothesis testing is a part of inferential statistics. We will conduct the scientific experiment (i.e applying one of the above tests) on the sample, and make inferences about the population based on probability value (Reminder: if probability score is less than 5%, we reject the null hypothesis, and conclude that the 2 distributions are associated).

Let me know if you have more clarifications on this.

Regards
Samridhi
 
Status
Not open for further replies.
Top