Methods

In this section, we will look at how we can compare different cross-validation algorithms and choose the best one.

4.1 Data

We are working with data from the Jackson Heart Study. The goal of this study was to investigate the causes of cardiovascular, renal, and respiratory diseases in African Americans. Data was collected from approximately 5300 participants living in Jackson, Mississippi. (Jackson Heart Study)

Individuals in this study provided a thorough medical and social history and participated in a range of physical and biochemical tests and diagnostic procedures during a baseline examination that occurred during calendar years 2000-2004 and two follow-up examinations in 2005-2008 and 2009-2012 (Jackson Heart Study).

crossdata <- read_dta("C:/Users/mtara/Desktop/Graduate School/STA6257/analysis1.dta") %>%
  dplyr::select(subjid,hba1c,age,totchol,htn,bmi,fpg,ecghr,nbmedhhincome) %>%
  na.omit()

For this analysis, we chose 3 data elements from the dataset as our independent variables - age, bmi and ecghr. The dependent variable that we want to be able to predict is htn. THe variable definitions are as follows:

Age: Age of subject at visit 1. Measured in years.
BMI: Body Mass Index of subject at visit 1. Measured in kg/m.
ecghr: Heart Rate of subject at visit 1. Measured in bpm.
htn: Hypertension Status of subject at visit 1. A value of 1 indicates the presence of hypertension. A value of 0 indicates no presence of hypertension.

A closer look at the data revealed some missing and NA values. These were addressed by including the na.omit in our select statement.

plot1 <- ggplot(crossdata, aes(bmi)) +
  geom_histogram(aes(fill = as.factor(htn)), color = "black", binwidth = 2)

plot2 <- ggplot(crossdata, aes(age)) +
  geom_histogram(aes(fill = as.factor(htn)), color = "black", binwidth = 2)

plot3 <- ggplot(crossdata, aes(ecghr)) +
  geom_histogram(aes(fill = as.factor(htn)), color = "black", binwidth = 2)

plot1 + theme(legend.position="bottom")

plot2 + theme(legend.position="bottom")

plot3 + theme(legend.position="bottom")

#grid.arrange(grobs= list(plot1, plot2, plot3),
#             ncol=2, nrow=2,
#             top = ("Histograms"))

The histograms show that the distribution of bmi (Body Mass Index), age and ecghr (Heart Rate) based on htn (Hypertension Status) is symmetric. This indicates that the data we will use in our models are evenly distributed around the central value. In addition, our plots suggest only age is associated with increased prevalence of hypertension.

4.2 Statistical-Methods

We compare three different cross-validation methods below and choose the best one.

4.2.1 K-Fold Cross-Validation

Training and Testing

## This sets the cross-validation method with k=5 folds
method <- trainControl(method = "cv", number = 5)

## #fit a regression model and use k-fold CV to evaluate performance
crossmodelfull <- train(as.factor(htn) ~ age + bmi + ecghr,
                        data = crossdata,
                        method = "glm",
                        trControl = method)

crossmodel1 <- train(as.factor(htn) ~ age + bmi,
                        data = crossdata,
                        method = "glm",
                        trControl = method)

crossmodel2 <- train(as.factor(htn) ~ age + ecghr,
                        data = crossdata,
                        method = "glm",
                        trControl = method)

crossmodel3 <- train(as.factor(htn) ~ bmi + ecghr,
                        data = crossdata,
                        method = "glm",
                        trControl = method)

print(crossmodelfull)

Generalized Linear Model 

2403 samples
   3 predictor
   2 classes: '0', '1' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1923, 1923, 1922, 1922, 1922 
Resampling results:

  Accuracy   Kappa    
  0.7045244  0.4066879

print(crossmodel1)

Generalized Linear Model 

2403 samples
   2 predictor
   2 classes: '0', '1' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1923, 1923, 1922, 1922, 1922 
Resampling results:

  Accuracy  Kappa    
  0.705369  0.4088601

print(crossmodel2)

Generalized Linear Model 

2403 samples
   2 predictor
   2 classes: '0', '1' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1922, 1922, 1922, 1923, 1923 
Resampling results:

  Accuracy  Kappa    
  0.679987  0.3581111

print(crossmodel3)

Generalized Linear Model 

2403 samples
   2 predictor
   2 classes: '0', '1' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1922, 1923, 1923, 1922, 1922 
Resampling results:

  Accuracy   Kappa   
  0.5705423  0.137524

Results

Interpretation of results:

No pre-processing. We did not scale the data before fitting the models.
The re-sampling method we used to evaluate the model was cross-validation with 5 folds.
The sample size for each training set was approximately 1923.
Accuracy: This is a measure of the correlation between the predictions made by the model and the actual observations. The higher the Accuracy, the more closely a model can predict the actual observations.

By comparing the Accuracy metric of the 4 models, we can see the our FULL model produces the highest Accuracy rate and is therefore the best model to use.

crossmodelfull$resample

   Accuracy     Kappa Resample
1 0.6770833 0.3493091    Fold1
2 0.7041667 0.4048201    Fold2
3 0.7276507 0.4550777    Fold3
4 0.7110187 0.4212294    Fold4
5 0.7027027 0.4030031    Fold5

Above are the predictions for each fold from our FULL model.

crossmodelfull$finalModel


Call:  NULL

Coefficients:
(Intercept)          age          bmi        ecghr  
  -7.455689     0.089976     0.064819     0.009751  

Degrees of Freedom: 2402 Total (i.e. Null);  2399 Residual
Null Deviance:      3327 
Residual Deviance: 2781     AIC: 2789

Our final model is:

ht̂n = -7.4557 + 0.0899{age} + 0.0648{bmi} + 0.0098{ecghr}

For each 1-year increase in Age, the likelihood of a person having hypertension increases by approximately 9%.

For each 1 kg/m increase in BMI (Body Mass Index), the likelihood of a person having hypertension increases by approximately 6%.

For each 1 bpm increase in Ecghr (Heart Rate), the likelihood of a person having hypertension increases by slightly less than 1%.

4.2.2 Holdout Cross-Validation

Training and Testing

crossdata <- read_dta("C:/Users/buste/OneDrive/Desktop/Modeling/analysis1.dta")  %>%
  dplyr::select(subjid,hba1c,age,totchol,htn,bmi,fpg,ecghr,nbmedhhincome) %>%
  na.omit()

## This sets the cross-validation method to holdout with an 80/20 split
random_sample <- createDataPartition(crossdata $ htn,
                                p = 0.8, list = FALSE)

# generating training dataset from the random_sample
train  <- crossdata[random_sample, ]
 
# generating testing dataset from rows which are not included in the random_sample
test <- crossdata[-random_sample, ]

Confirm the splitting is correct.

dim(train)

[1] 1923    9

dim(test)

[1] 480   9

Now let’s try to analyze our model by making some predictions

holdoutmodel1 <- glm(as.factor(htn) ~ age + bmi + ecghr, data = train, family = binomial)

summary(holdoutmodel1)


Call:
glm(formula = as.factor(htn) ~ age + bmi + ecghr, family = binomial, 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5310  -0.9409   0.4429   0.9356   2.1257  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -7.559911   0.526347 -14.363  < 2e-16 ***
age          0.092267   0.005200  17.745  < 2e-16 ***
bmi          0.062651   0.007801   8.031 9.66e-16 ***
ecghr        0.010743   0.005329   2.016   0.0438 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2663.2  on 1922  degrees of freedom
Residual deviance: 2212.6  on 1919  degrees of freedom
AIC: 2220.6

Number of Fisher Scoring iterations: 4

Results

From the summary of the model it is evident that ecghr has no significant role in predictions, so we can remove it.

holdoutmodel2 <- glm(as.factor(htn) ~ age + bmi, data = train, family = binomial)

Now let’s try to analyze our model by making some predictions.

train$residuals <- residuals(holdoutmodel2)
train$predicted <- predict(holdoutmodel2,train)
rmse_train <- sqrt(mean(train$residuals ** 2))

# confusion Matrix
# $Misclassification error -Training data
pre1<-ifelse(train$predicted > 0.5, 1, 0)
pretable<-table(Prediction = pre1,
            Actual = train$htn)
pretable

          Actual
Prediction   0   1
         0 748 442
         1 178 555

1 - sum(diag(pretable)) / sum(pretable)

[1] 0.3224129

test$predicted <- predict(holdoutmodel2,test)
test$residuals <- test$htn - test$predicted
rmse_test <- sqrt(mean(test$residuals ** 2))

# confusion Matrix
# $Misclassification error -Testing data
post1<-ifelse(test$predicted > 0.5, 1, 0)
posttable<-table(Prediction = post1,
            Actual = test$htn)
posttable

          Actual
Prediction   0   1
         0 174 104
         1  53 149

1 - sum(diag(posttable)) / sum(posttable)

[1] 0.3270833

We round up our results and then create a confusion matrix to compare the number of true/false positives and negatives for both the training data and the testing data. Our misclassification error rate is approximately 30%. This means that our model correctly predicted the outcome for 70% of the subjects.

4.4.3 Leave-One-Out Cross-Validation

Training and Testing

crossdata <- read_dta("C:/Users/buste/OneDrive/Desktop/Modeling/analysis1.dta")  %>%
  select(ecghr,age,bmi,htn) %>%
    na.omit()
## Sets method of cross-validation to  use leave-one-out
method <- trainControl(method = "LOOCV")


## Example model created to demonstrate leave-one-out
crossmodelfull <- train(as.factor(htn) ~ age + bmi + ecghr,
                        data = crossdata,
                        method = "glm",
                        trControl = method)

crossmodel1 <- train(as.factor(htn) ~ age + bmi,
                     data = crossdata,
                     method = "glm",
                     trControl = method)

crossmodel2 <- train(as.factor(htn) ~ age + ecghr,
                     data = crossdata,
                     method = "glm",
                     trControl = method)

crossmodel3 <- train(as.factor(htn) ~ bmi + ecghr,
                     data = crossdata,
                     method = "glm",
                     trControl = method)

print(crossmodelfull)

Generalized Linear Model 

2646 samples
   3 predictor
   2 classes: '0', '1' 

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 2645, 2645, 2645, 2645, 2645, 2645, ... 
Resampling results:

  Accuracy   Kappa    
  0.7093726  0.4132869

print(crossmodel1)

Generalized Linear Model 

2646 samples
   2 predictor
   2 classes: '0', '1' 

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 2645, 2645, 2645, 2645, 2645, 2645, ... 
Resampling results:

  Accuracy   Kappa    
  0.7071051  0.4090227

print(crossmodel2)

Generalized Linear Model 

2646 samples
   2 predictor
   2 classes: '0', '1' 

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 2645, 2645, 2645, 2645, 2645, 2645, ... 
Resampling results:

  Accuracy  Kappa   
  0.685941  0.366857

print(crossmodel3)

Generalized Linear Model 

2646 samples
   2 predictor
   2 classes: '0', '1' 

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 2645, 2645, 2645, 2645, 2645, 2645, ... 
Resampling results:

  Accuracy   Kappa    
  0.5684051  0.1202561

Results

Interpretation of results:

No pre-processing. We did not scale the data before fitting the models.

The re-sampling method we used to evaluate the model was leave-one-out cross-validation.

The sample size for each training set was approximately 2645.

Accuracy: This is a measure of the correlation between the predictions made by the model and the actual observations. The higher the Accuracy, the more closely a model can predict the actual observations.

By comparing the Accuracy metric of the 4 models, we can see the our FULL model produces the highest Accuracy rate and is therefore the best model to use.

crossmodelfull$finalModel


Call:  NULL

Coefficients:
(Intercept)          age          bmi        ecghr  
   -7.33430      0.08857      0.06191      0.01125  

Degrees of Freedom: 2645 Total (i.e. Null);  2642 Residual
Null Deviance:      3655 
Residual Deviance: 3072     AIC: 3080

Our final model is: ht̂n = -7.33430 + 0.08857{age} + 0.06191{bmi} + 0.01125{ecghr}

For each 1-year increase in Age, the likelihood of a person having hypertension increases by approximately 8.8%.

For each 1 kg/m increase in BMI (Body Mass Index), the likelihood of a person having hypertension increases by approximately 6%.

For each 1 bpm increase in Ecghr (Heart Rate), the likelihood of a person having hypertension increases by approximatly 1%.

It can be noted that k-fold cross-validation and leave-one-out cross-validation are very similar when it comes to R code.