## This sets the cross-validation method with k=5 folds
<- trainControl(method = "cv", number = 5)
method
## #fit a regression model and use k-fold CV to evaluate performance
<- train(as.factor(htn) ~ age + bmi + ecghr,
crossmodelfull data = crossdata,
method = "glm",
trControl = method)
<- train(as.factor(htn) ~ age + bmi,
crossmodel1 data = crossdata,
method = "glm",
trControl = method)
<- train(as.factor(htn) ~ age + ecghr,
crossmodel2 data = crossdata,
method = "glm",
trControl = method)
<- train(as.factor(htn) ~ bmi + ecghr,
crossmodel3 data = crossdata,
method = "glm",
trControl = method)
4.2.1 K-Fold Cross-Validation
Training and Testing
print(crossmodelfull)
Generalized Linear Model
2403 samples
3 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1923, 1923, 1922, 1922, 1922
Resampling results:
Accuracy Kappa
0.7045244 0.4066879
print(crossmodel1)
Generalized Linear Model
2403 samples
2 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1923, 1923, 1922, 1922, 1922
Resampling results:
Accuracy Kappa
0.705369 0.4088601
print(crossmodel2)
Generalized Linear Model
2403 samples
2 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1922, 1922, 1922, 1923, 1923
Resampling results:
Accuracy Kappa
0.679987 0.3581111
print(crossmodel3)
Generalized Linear Model
2403 samples
2 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1922, 1923, 1923, 1922, 1922
Resampling results:
Accuracy Kappa
0.5705423 0.137524
Results
Interpretation of results:
No pre-processing. We did not scale the data before fitting the models.
The re-sampling method we used to evaluate the model was cross-validation with 5 folds.
The sample size for each training set was approximately 1923.
Accuracy: This is a measure of the correlation between the predictions made by the model and the actual observations. The higher the Accuracy, the more closely a model can predict the actual observations.
By comparing the Accuracy metric of the 4 models, we can see the our FULL model produces the highest Accuracy rate and is therefore the best model to use.
$resample crossmodelfull
Accuracy Kappa Resample
1 0.6770833 0.3493091 Fold1
2 0.7041667 0.4048201 Fold2
3 0.7276507 0.4550777 Fold3
4 0.7110187 0.4212294 Fold4
5 0.7027027 0.4030031 Fold5
Above are the predictions for each fold from our FULL model.
$finalModel crossmodelfull
Call: NULL
Coefficients:
(Intercept) age bmi ecghr
-7.455689 0.089976 0.064819 0.009751
Degrees of Freedom: 2402 Total (i.e. Null); 2399 Residual
Null Deviance: 3327
Residual Deviance: 2781 AIC: 2789
Our final model is:
ht̂n = -7.4557 + 0.0899{age} + 0.0648{bmi} + 0.0098{ecghr}
For each 1-year increase in Age, the likelihood of a person having hypertension increases by approximately 9%.
For each 1 kg/m increase in BMI (Body Mass Index), the likelihood of a person having hypertension increases by approximately 6%.
For each 1 bpm increase in Ecghr (Heart Rate), the likelihood of a person having hypertension increases by slightly less than 1%.
4.2.2 Holdout Cross-Validation
Training and Testing
<- read_dta("C:/Users/buste/OneDrive/Desktop/Modeling/analysis1.dta") %>%
crossdata ::select(subjid,hba1c,age,totchol,htn,bmi,fpg,ecghr,nbmedhhincome) %>%
dplyrna.omit()
## This sets the cross-validation method to holdout with an 80/20 split
<- createDataPartition(crossdata $ htn,
random_sample p = 0.8, list = FALSE)
# generating training dataset from the random_sample
<- crossdata[random_sample, ]
train
# generating testing dataset from rows which are not included in the random_sample
<- crossdata[-random_sample, ] test
Confirm the splitting is correct.
dim(train)
[1] 1923 9
dim(test)
[1] 480 9
Now let’s try to analyze our model by making some predictions
<- glm(as.factor(htn) ~ age + bmi + ecghr, data = train, family = binomial)
holdoutmodel1
summary(holdoutmodel1)
Call:
glm(formula = as.factor(htn) ~ age + bmi + ecghr, family = binomial,
data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5310 -0.9409 0.4429 0.9356 2.1257
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.559911 0.526347 -14.363 < 2e-16 ***
age 0.092267 0.005200 17.745 < 2e-16 ***
bmi 0.062651 0.007801 8.031 9.66e-16 ***
ecghr 0.010743 0.005329 2.016 0.0438 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2663.2 on 1922 degrees of freedom
Residual deviance: 2212.6 on 1919 degrees of freedom
AIC: 2220.6
Number of Fisher Scoring iterations: 4
Results
From the summary of the model it is evident that ecghr has no significant role in predictions, so we can remove it.
<- glm(as.factor(htn) ~ age + bmi, data = train, family = binomial) holdoutmodel2
Now let’s try to analyze our model by making some predictions.
$residuals <- residuals(holdoutmodel2)
train$predicted <- predict(holdoutmodel2,train)
train<- sqrt(mean(train$residuals ** 2))
rmse_train
# confusion Matrix
# $Misclassification error -Training data
<-ifelse(train$predicted > 0.5, 1, 0)
pre1<-table(Prediction = pre1,
pretableActual = train$htn)
pretable
Actual
Prediction 0 1
0 748 442
1 178 555
1 - sum(diag(pretable)) / sum(pretable)
[1] 0.3224129
$predicted <- predict(holdoutmodel2,test)
test$residuals <- test$htn - test$predicted
test<- sqrt(mean(test$residuals ** 2))
rmse_test
# confusion Matrix
# $Misclassification error -Testing data
<-ifelse(test$predicted > 0.5, 1, 0)
post1<-table(Prediction = post1,
posttableActual = test$htn)
posttable
Actual
Prediction 0 1
0 174 104
1 53 149
1 - sum(diag(posttable)) / sum(posttable)
[1] 0.3270833
We round up our results and then create a confusion matrix to compare the number of true/false positives and negatives for both the training data and the testing data. Our misclassification error rate is approximately 30%. This means that our model correctly predicted the outcome for 70% of the subjects.
4.4.3 Leave-One-Out Cross-Validation
Training and Testing
<- read_dta("C:/Users/buste/OneDrive/Desktop/Modeling/analysis1.dta") %>%
crossdata select(ecghr,age,bmi,htn) %>%
na.omit()
## Sets method of cross-validation to use leave-one-out
<- trainControl(method = "LOOCV")
method
## Example model created to demonstrate leave-one-out
<- train(as.factor(htn) ~ age + bmi + ecghr,
crossmodelfull data = crossdata,
method = "glm",
trControl = method)
<- train(as.factor(htn) ~ age + bmi,
crossmodel1 data = crossdata,
method = "glm",
trControl = method)
<- train(as.factor(htn) ~ age + ecghr,
crossmodel2 data = crossdata,
method = "glm",
trControl = method)
<- train(as.factor(htn) ~ bmi + ecghr,
crossmodel3 data = crossdata,
method = "glm",
trControl = method)
print(crossmodelfull)
Generalized Linear Model
2646 samples
3 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Leave-One-Out Cross-Validation
Summary of sample sizes: 2645, 2645, 2645, 2645, 2645, 2645, ...
Resampling results:
Accuracy Kappa
0.7093726 0.4132869
print(crossmodel1)
Generalized Linear Model
2646 samples
2 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Leave-One-Out Cross-Validation
Summary of sample sizes: 2645, 2645, 2645, 2645, 2645, 2645, ...
Resampling results:
Accuracy Kappa
0.7071051 0.4090227
print(crossmodel2)
Generalized Linear Model
2646 samples
2 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Leave-One-Out Cross-Validation
Summary of sample sizes: 2645, 2645, 2645, 2645, 2645, 2645, ...
Resampling results:
Accuracy Kappa
0.685941 0.366857
print(crossmodel3)
Generalized Linear Model
2646 samples
2 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Leave-One-Out Cross-Validation
Summary of sample sizes: 2645, 2645, 2645, 2645, 2645, 2645, ...
Resampling results:
Accuracy Kappa
0.5684051 0.1202561
Results
Interpretation of results:
No pre-processing. We did not scale the data before fitting the models.
The re-sampling method we used to evaluate the model was leave-one-out cross-validation.
The sample size for each training set was approximately 2645.
Accuracy: This is a measure of the correlation between the predictions made by the model and the actual observations. The higher the Accuracy, the more closely a model can predict the actual observations.
By comparing the Accuracy metric of the 4 models, we can see the our FULL model produces the highest Accuracy rate and is therefore the best model to use.
$finalModel crossmodelfull
Call: NULL
Coefficients:
(Intercept) age bmi ecghr
-7.33430 0.08857 0.06191 0.01125
Degrees of Freedom: 2645 Total (i.e. Null); 2642 Residual
Null Deviance: 3655
Residual Deviance: 3072 AIC: 3080
Our final model is: ht̂n = -7.33430 + 0.08857{age} + 0.06191{bmi} + 0.01125{ecghr}
For each 1-year increase in Age, the likelihood of a person having hypertension increases by approximately 8.8%.
For each 1 kg/m increase in BMI (Body Mass Index), the likelihood of a person having hypertension increases by approximately 6%.
For each 1 bpm increase in Ecghr (Heart Rate), the likelihood of a person having hypertension increases by approximatly 1%.
It can be noted that k-fold cross-validation and leave-one-out cross-validation are very similar when it comes to R code.