Overfitting | STA6257 Cross Validation

4.2.2 Holdout Cross-Validation

Training and Testing

crossdata <- read_dta("C:/Users/buste/OneDrive/Desktop/Modeling/analysis1.dta")  %>%
  dplyr::select(subjid,hba1c,age,totchol,htn,bmi,fpg,ecghr,nbmedhhincome) %>%
  na.omit()

## This sets the cross-validation method to holdout with an 80/20 split
random_sample <- createDataPartition(crossdata $ htn,
                                p = 0.8, list = FALSE)

# generating training dataset from the random_sample
train  <- crossdata[random_sample, ]
 
# generating testing dataset from rows which are not included in the random_sample
test <- crossdata[-random_sample, ]

Confirm the splitting is correct.

dim(train)

[1] 1923    9

dim(test)

[1] 480   9

Now let’s try to analyze our model by making some predictions

holdoutmodel1 <- glm(as.factor(htn) ~ age + bmi + ecghr, data = train, family = binomial)

summary(holdoutmodel1)


Call:
glm(formula = as.factor(htn) ~ age + bmi + ecghr, family = binomial, 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5310  -0.9409   0.4429   0.9356   2.1257  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -7.559911   0.526347 -14.363  < 2e-16 ***
age          0.092267   0.005200  17.745  < 2e-16 ***
bmi          0.062651   0.007801   8.031 9.66e-16 ***
ecghr        0.010743   0.005329   2.016   0.0438 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2663.2  on 1922  degrees of freedom
Residual deviance: 2212.6  on 1919  degrees of freedom
AIC: 2220.6

Number of Fisher Scoring iterations: 4

Results

From the summary of the model it is evident that ecghr has no significant role in predictions, so we can remove it.

holdoutmodel2 <- glm(as.factor(htn) ~ age + bmi, data = train, family = binomial)

Now let’s try to analyze our model by making some predictions.

train$residuals <- residuals(holdoutmodel2)
train$predicted <- predict(holdoutmodel2,train)
rmse_train <- sqrt(mean(train$residuals ** 2))

# confusion Matrix
# $Misclassification error -Training data
pre1<-ifelse(train$predicted > 0.5, 1, 0)
pretable<-table(Prediction = pre1,
            Actual = train$htn)
pretable

          Actual
Prediction   0   1
         0 748 442
         1 178 555

1 - sum(diag(pretable)) / sum(pretable)

[1] 0.3224129

test$predicted <- predict(holdoutmodel2,test)
test$residuals <- test$htn - test$predicted
rmse_test <- sqrt(mean(test$residuals ** 2))

# confusion Matrix
# $Misclassification error -Testing data
post1<-ifelse(test$predicted > 0.5, 1, 0)
posttable<-table(Prediction = post1,
            Actual = test$htn)
posttable

          Actual
Prediction   0   1
         0 174 104
         1  53 149

1 - sum(diag(posttable)) / sum(posttable)

[1] 0.3270833

We round up our results and then create a confusion matrix to compare the number of true/false positives and negatives for both the training data and the testing data. Our misclassification error rate is approximately 30%. This means that our model correctly predicted the outcome for 70% of the subjects.

4.4.3 Leave-One-Out Cross-Validation

Training and Testing

crossdata <- read_dta("C:/Users/buste/OneDrive/Desktop/Modeling/analysis1.dta")  %>%
  select(ecghr,age,bmi,htn) %>%
    na.omit()
## Sets method of cross-validation to  use leave-one-out
method <- trainControl(method = "LOOCV")


## Example model created to demonstrate leave-one-out
crossmodelfull <- train(as.factor(htn) ~ age + bmi + ecghr,
                        data = crossdata,
                        method = "glm",
                        trControl = method)

crossmodel1 <- train(as.factor(htn) ~ age + bmi,
                     data = crossdata,
                      method = "glm",
                     trControl = method)

crossmodel2 <- train(as.factor(htn) ~ age + ecghr,
                     data = crossdata,
                     method = "glm",
                     trControl = method)

crossmodel3 <- train(as.factor(htn) ~ bmi + ecghr,
                     data = crossdata,
                     method = "glm",
                     trControl = method)

print(crossmodelfull)

Generalized Linear Model 

2646 samples
   3 predictor
   2 classes: '0', '1' 

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 2645, 2645, 2645, 2645, 2645, 2645, ... 
Resampling results:

  Accuracy   Kappa    
  0.7093726  0.4132869

print(crossmodel1)

Generalized Linear Model 

2646 samples
   2 predictor
   2 classes: '0', '1' 

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 2645, 2645, 2645, 2645, 2645, 2645, ... 
Resampling results:

  Accuracy   Kappa    
  0.7071051  0.4090227

print(crossmodel2)

Generalized Linear Model 

2646 samples
   2 predictor
   2 classes: '0', '1' 

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 2645, 2645, 2645, 2645, 2645, 2645, ... 
Resampling results:

  Accuracy  Kappa   
  0.685941  0.366857

print(crossmodel3)

Generalized Linear Model 

2646 samples
   2 predictor
   2 classes: '0', '1' 

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 2645, 2645, 2645, 2645, 2645, 2645, ... 
Resampling results:

  Accuracy   Kappa    
  0.5684051  0.1202561

Results

Interpretation of results:

No pre-processing. We did not scale the data before fitting the models.

The re-sampling method we used to evaluate the model was leave-one-out cross-validation.

The sample size for each training set was approximately 2645.

Accuracy: This is a measure of the correlation between the predictions made by the model and the actual observations. The higher the Accuracy, the more closely a model can predict the actual observations.

By comparing the Accuracy metric of the 4 models, we can see the our FULL model produces the highest Accuracy rate and is therefore the best model to use.

crossmodelfull$finalModel


Call:  NULL

Coefficients:
(Intercept)          age          bmi        ecghr  
   -7.33430      0.08857      0.06191      0.01125  

Degrees of Freedom: 2645 Total (i.e. Null);  2642 Residual
Null Deviance:      3655 
Residual Deviance: 3072     AIC: 3080

Our final model is: ht̂n = -7.33430 + 0.08857{age} + 0.06191{bmi} + 0.01125{ecghr}

For each 1-year increase in Age, the likelihood of a person having hypertension increases by approximately 8.8%.

For each 1 kg/m increase in BMI (Body Mass Index), the likelihood of a person having hypertension increases by approximately 6%.

For each 1 bpm increase in Ecghr (Heart Rate), the likelihood of a person having hypertension increases by approximatly 1%.

It can be noted that k-fold cross-validation and leave-one-out cross-validation are very similar when it comes to R code.