What is Cross-Validation?

Technique used to assess the performance (accuracy) of a machine learning model
Involves reserving a sample of your data set for testing
The goal is to gain insight on how the model will generalize to an unknown data set
Easy to understand and implement
Beneficial when the amount of data available is limited

Overfitting and Underfitting

Overfitting occurs when the model finds patterns in the training data that are caused by random chance.

Underfitting occurs when the model cannot “learn” the underlying trend of the data

KFold Cross-Validation

How does it work?

Data set is divided into k parts
With each iteration, the model is trained on the k-1 subsets of the entire data set and tested on the kth subset
The average error across all k trials is computed in order to find the best model

KFold Cross-Validation (continued)

Pros and Cons

Produces less biased models as every data point from the original data set will appear in both the training and testing set.
Optimal if you have a limited amount of data
Time-consuming because the algorithm has to rerun k times

Holdout Cross-Validation

What is it? How does it work?

Original data set is split in the ratio of 80:20 or 70:30
Data is randomly shuffled before it is split
High variance if training data set is not representative of the entire data set
Ideal if you’re in a hurry to train and test a model and have a large data set

Leave-One-Out Cross-Validation

How does it work?

Test data set contains only one element from the original data set
Training data set consists in all the remaining elements from the original data set

Leave-One-Out Cross-Validation (continued)

Pros and Cons

Can be very time-consuming depending on size of data set
May result in better models since this method uses the largest amount of samples available for training
No need to shuffle the data, since all possible combinations of train/test data sets will be generated

Other CV Methods

Stratified KFold Cross-Validation

With straight KFold cross-validation, there’s a chance that we end up with an imbalanced subset
Can cause the training to be biased, which results in inaccurate models
In stratification, the data is rearranged to ensure that each subset is a good representation of the entire data set.

Leave-P-Out Cross-Validation

The model is trained on n-p data points and tested on p data points
The same process is repeated for all possible combinations of p from the original data
Results of each iteration are averaged to determine accuracy

Other CV Methods (continued)

Monte Carlo Cross-Validation

Also known as Repeated random sub-sampling validation
Training and testing data are determined through a series of random splits
For each split, the model is fit to the training data, and predictive accuracy is assessed using the testing data.
The results are then averaged over the splits to determine accuracy
Some observations may never be selected for the testing subsample, whereas others may be selected more than once

Example of KFold - R Code

## This sets the cross-validation method with k=5 folds
method <- trainControl(method = "cv", number = 5)

## Fit regression model and use k-fold CV to evaluate performance
crossmodelfull <- train(as.factor(htn) ~ age + bmi + ecghr,
                        data = crossdata,
                        method = "glm",
                        trControl = method)

Jackson Heart data
Predict htn using age, bmi and ecghr
Split data into 5 folds
Test one fold against others

Example of KFold - Results

1923 observations in each fold
R generates an accuracy for each model
Use the model that produces the highest accuracy

Conclusion

Cross-validation can be used to compare the performances of different predictive modeling procedures
- Example: Optical Character Recognition
- Which method works best? Support Vector Machine (SVM) or K-Nearest Neighbors (KNN)?
- Use cross-validation to objectively compare the two methods
Cross-validation can also be used in variable selection
- Example: Cancer Research
- Determine which subset of features should be used to produce the best predictive model

Cross-Validation