<- read_dta("C:/Users/mtara/Desktop/Graduate School/STA6257/analysis1.dta") %>%
crossdata ::select(subjid,hba1c,age,totchol,htn,bmi,fpg,ecghr,nbmedhhincome) %>%
dplyrna.omit()
Methods
In this section, we will look at how we can compare different cross-validation algorithms and choose the best one.
4.1 Data
We are working with data from the Jackson Heart Study. The goal of this study was to investigate the causes of cardiovascular, renal, and respiratory diseases in African Americans. Data was collected from approximately 5300 participants living in Jackson, Mississippi. (Jackson Heart Study)
Individuals in this study provided a thorough medical and social history and participated in a range of physical and biochemical tests and diagnostic procedures during a baseline examination that occurred during calendar years 2000-2004 and two follow-up examinations in 2005-2008 and 2009-2012 (Jackson Heart Study).
For this analysis, we chose 3 data elements from the dataset as our independent variables - age, bmi and ecghr. The dependent variable that we want to be able to predict is htn. THe variable definitions are as follows:
- Age: Age of subject at visit 1. Measured in years.
- BMI: Body Mass Index of subject at visit 1. Measured in kg/m.
- ecghr: Heart Rate of subject at visit 1. Measured in bpm.
- htn: Hypertension Status of subject at visit 1. A value of 1 indicates the presence of hypertension. A value of 0 indicates no presence of hypertension.
A closer look at the data revealed some missing and NA values. These were addressed by including the na.omit in our select statement.
<- ggplot(crossdata, aes(bmi)) +
plot1 geom_histogram(aes(fill = as.factor(htn)), color = "black", binwidth = 2)
<- ggplot(crossdata, aes(age)) +
plot2 geom_histogram(aes(fill = as.factor(htn)), color = "black", binwidth = 2)
<- ggplot(crossdata, aes(ecghr)) +
plot3 geom_histogram(aes(fill = as.factor(htn)), color = "black", binwidth = 2)
+ theme(legend.position="bottom") plot1
+ theme(legend.position="bottom") plot2
+ theme(legend.position="bottom") plot3
#grid.arrange(grobs= list(plot1, plot2, plot3),
# ncol=2, nrow=2,
# top = ("Histograms"))
The histograms show that the distribution of bmi (Body Mass Index), age and ecghr (Heart Rate) based on htn (Hypertension Status) is symmetric. This indicates that the data we will use in our models are evenly distributed around the central value. In addition, our plots suggest only age is associated with increased prevalence of hypertension.
4.2 Statistical-Methods
We compare three different cross-validation methods below and choose the best one.
4.2.1 K-Fold Cross-Validation
Training and Testing
## This sets the cross-validation method with k=5 folds
<- trainControl(method = "cv", number = 5)
method
## #fit a regression model and use k-fold CV to evaluate performance
<- train(as.factor(htn) ~ age + bmi + ecghr,
crossmodelfull data = crossdata,
method = "glm",
trControl = method)
<- train(as.factor(htn) ~ age + bmi,
crossmodel1 data = crossdata,
method = "glm",
trControl = method)
<- train(as.factor(htn) ~ age + ecghr,
crossmodel2 data = crossdata,
method = "glm",
trControl = method)
<- train(as.factor(htn) ~ bmi + ecghr,
crossmodel3 data = crossdata,
method = "glm",
trControl = method)
print(crossmodelfull)
Generalized Linear Model
2403 samples
3 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1923, 1923, 1922, 1922, 1922
Resampling results:
Accuracy Kappa
0.7045244 0.4066879
print(crossmodel1)
Generalized Linear Model
2403 samples
2 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1923, 1923, 1922, 1922, 1922
Resampling results:
Accuracy Kappa
0.705369 0.4088601
print(crossmodel2)
Generalized Linear Model
2403 samples
2 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1922, 1922, 1922, 1923, 1923
Resampling results:
Accuracy Kappa
0.679987 0.3581111
print(crossmodel3)
Generalized Linear Model
2403 samples
2 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 1922, 1923, 1923, 1922, 1922
Resampling results:
Accuracy Kappa
0.5705423 0.137524
Results
Interpretation of results:
No pre-processing. We did not scale the data before fitting the models.
The re-sampling method we used to evaluate the model was cross-validation with 5 folds.
The sample size for each training set was approximately 1923.
Accuracy: This is a measure of the correlation between the predictions made by the model and the actual observations. The higher the Accuracy, the more closely a model can predict the actual observations.
By comparing the Accuracy metric of the 4 models, we can see the our FULL model produces the highest Accuracy rate and is therefore the best model to use.
$resample crossmodelfull
Accuracy Kappa Resample
1 0.6770833 0.3493091 Fold1
2 0.7041667 0.4048201 Fold2
3 0.7276507 0.4550777 Fold3
4 0.7110187 0.4212294 Fold4
5 0.7027027 0.4030031 Fold5
Above are the predictions for each fold from our FULL model.
$finalModel crossmodelfull
Call: NULL
Coefficients:
(Intercept) age bmi ecghr
-7.455689 0.089976 0.064819 0.009751
Degrees of Freedom: 2402 Total (i.e. Null); 2399 Residual
Null Deviance: 3327
Residual Deviance: 2781 AIC: 2789
Our final model is:
ht̂n = -7.4557 + 0.0899{age} + 0.0648{bmi} + 0.0098{ecghr}
For each 1-year increase in Age, the likelihood of a person having hypertension increases by approximately 9%.
For each 1 kg/m increase in BMI (Body Mass Index), the likelihood of a person having hypertension increases by approximately 6%.
For each 1 bpm increase in Ecghr (Heart Rate), the likelihood of a person having hypertension increases by slightly less than 1%.
4.2.2 Holdout Cross-Validation
Training and Testing
<- read_dta("C:/Users/buste/OneDrive/Desktop/Modeling/analysis1.dta") %>%
crossdata ::select(subjid,hba1c,age,totchol,htn,bmi,fpg,ecghr,nbmedhhincome) %>%
dplyrna.omit()
## This sets the cross-validation method to holdout with an 80/20 split
<- createDataPartition(crossdata $ htn,
random_sample p = 0.8, list = FALSE)
# generating training dataset from the random_sample
<- crossdata[random_sample, ]
train
# generating testing dataset from rows which are not included in the random_sample
<- crossdata[-random_sample, ] test
Confirm the splitting is correct.
dim(train)
[1] 1923 9
dim(test)
[1] 480 9
Now let’s try to analyze our model by making some predictions
<- glm(as.factor(htn) ~ age + bmi + ecghr, data = train, family = binomial)
holdoutmodel1
summary(holdoutmodel1)
Call:
glm(formula = as.factor(htn) ~ age + bmi + ecghr, family = binomial,
data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5310 -0.9409 0.4429 0.9356 2.1257
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.559911 0.526347 -14.363 < 2e-16 ***
age 0.092267 0.005200 17.745 < 2e-16 ***
bmi 0.062651 0.007801 8.031 9.66e-16 ***
ecghr 0.010743 0.005329 2.016 0.0438 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2663.2 on 1922 degrees of freedom
Residual deviance: 2212.6 on 1919 degrees of freedom
AIC: 2220.6
Number of Fisher Scoring iterations: 4
Results
From the summary of the model it is evident that ecghr has no significant role in predictions, so we can remove it.
<- glm(as.factor(htn) ~ age + bmi, data = train, family = binomial) holdoutmodel2
Now let’s try to analyze our model by making some predictions.
$residuals <- residuals(holdoutmodel2)
train$predicted <- predict(holdoutmodel2,train)
train<- sqrt(mean(train$residuals ** 2))
rmse_train
# confusion Matrix
# $Misclassification error -Training data
<-ifelse(train$predicted > 0.5, 1, 0)
pre1<-table(Prediction = pre1,
pretableActual = train$htn)
pretable
Actual
Prediction 0 1
0 748 442
1 178 555
1 - sum(diag(pretable)) / sum(pretable)
[1] 0.3224129
$predicted <- predict(holdoutmodel2,test)
test$residuals <- test$htn - test$predicted
test<- sqrt(mean(test$residuals ** 2))
rmse_test
# confusion Matrix
# $Misclassification error -Testing data
<-ifelse(test$predicted > 0.5, 1, 0)
post1<-table(Prediction = post1,
posttableActual = test$htn)
posttable
Actual
Prediction 0 1
0 174 104
1 53 149
1 - sum(diag(posttable)) / sum(posttable)
[1] 0.3270833
We round up our results and then create a confusion matrix to compare the number of true/false positives and negatives for both the training data and the testing data. Our misclassification error rate is approximately 30%. This means that our model correctly predicted the outcome for 70% of the subjects.
4.4.3 Leave-One-Out Cross-Validation
Training and Testing
<- read_dta("C:/Users/buste/OneDrive/Desktop/Modeling/analysis1.dta") %>%
crossdata select(ecghr,age,bmi,htn) %>%
na.omit()
## Sets method of cross-validation to use leave-one-out
<- trainControl(method = "LOOCV")
method
## Example model created to demonstrate leave-one-out
<- train(as.factor(htn) ~ age + bmi + ecghr,
crossmodelfull data = crossdata,
method = "glm",
trControl = method)
<- train(as.factor(htn) ~ age + bmi,
crossmodel1 data = crossdata,
method = "glm",
trControl = method)
<- train(as.factor(htn) ~ age + ecghr,
crossmodel2 data = crossdata,
method = "glm",
trControl = method)
<- train(as.factor(htn) ~ bmi + ecghr,
crossmodel3 data = crossdata,
method = "glm",
trControl = method)
print(crossmodelfull)
Generalized Linear Model
2646 samples
3 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Leave-One-Out Cross-Validation
Summary of sample sizes: 2645, 2645, 2645, 2645, 2645, 2645, ...
Resampling results:
Accuracy Kappa
0.7093726 0.4132869
print(crossmodel1)
Generalized Linear Model
2646 samples
2 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Leave-One-Out Cross-Validation
Summary of sample sizes: 2645, 2645, 2645, 2645, 2645, 2645, ...
Resampling results:
Accuracy Kappa
0.7071051 0.4090227
print(crossmodel2)
Generalized Linear Model
2646 samples
2 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Leave-One-Out Cross-Validation
Summary of sample sizes: 2645, 2645, 2645, 2645, 2645, 2645, ...
Resampling results:
Accuracy Kappa
0.685941 0.366857
print(crossmodel3)
Generalized Linear Model
2646 samples
2 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Leave-One-Out Cross-Validation
Summary of sample sizes: 2645, 2645, 2645, 2645, 2645, 2645, ...
Resampling results:
Accuracy Kappa
0.5684051 0.1202561
Results
Interpretation of results:
No pre-processing. We did not scale the data before fitting the models.
The re-sampling method we used to evaluate the model was leave-one-out cross-validation.
The sample size for each training set was approximately 2645.
Accuracy: This is a measure of the correlation between the predictions made by the model and the actual observations. The higher the Accuracy, the more closely a model can predict the actual observations.
By comparing the Accuracy metric of the 4 models, we can see the our FULL model produces the highest Accuracy rate and is therefore the best model to use.
$finalModel crossmodelfull
Call: NULL
Coefficients:
(Intercept) age bmi ecghr
-7.33430 0.08857 0.06191 0.01125
Degrees of Freedom: 2645 Total (i.e. Null); 2642 Residual
Null Deviance: 3655
Residual Deviance: 3072 AIC: 3080
Our final model is: ht̂n = -7.33430 + 0.08857{age} + 0.06191{bmi} + 0.01125{ecghr}
For each 1-year increase in Age, the likelihood of a person having hypertension increases by approximately 8.8%.
For each 1 kg/m increase in BMI (Body Mass Index), the likelihood of a person having hypertension increases by approximately 6%.
For each 1 bpm increase in Ecghr (Heart Rate), the likelihood of a person having hypertension increases by approximatly 1%.
It can be noted that k-fold cross-validation and leave-one-out cross-validation are very similar when it comes to R code.