Daniel Bohorquez, Jason Heiserman, Mary Tarabocchia
November 2022
Overfitting occurs when the model finds patterns in the training data that are caused by random chance.
Underfitting occurs when the model cannot “learn” the underlying trend of the data
How does it work?
Data set is divided into k parts
With each iteration, the model is trained on the k-1 subsets of the entire data set and tested on the kth subset
The average error across all k trials is computed in order to find the best model
Pros and Cons
What is it? How does it work?
Original data set is split in the ratio of 80:20 or 70:30
Data is randomly shuffled before it is split
High variance if training data set is not representative of the entire data set
Ideal if you’re in a hurry to train and test a model and have a large data set
How does it work?
Test data set contains only one element from the original data set
Training data set consists in all the remaining elements from the original data set
Pros and Cons
Can be very time-consuming depending on size of data set
May result in better models since this method uses the largest amount of samples available for training
No need to shuffle the data, since all possible combinations of train/test data sets will be generated
Stratified KFold Cross-Validation
With straight KFold cross-validation, there’s a chance that we end up with an imbalanced subset
Can cause the training to be biased, which results in inaccurate models
In stratification, the data is rearranged to ensure that each subset is a good representation of the entire data set.
Leave-P-Out Cross-Validation
The model is trained on n-p data points and tested on p data points
The same process is repeated for all possible combinations of p from the original data
Results of each iteration are averaged to determine accuracy
Monte Carlo Cross-Validation
Also known as Repeated random sub-sampling validation
Training and testing data are determined through a series of random splits
For each split, the model is fit to the training data, and predictive accuracy is assessed using the testing data.
The results are then averaged over the splits to determine accuracy
Some observations may never be selected for the testing subsample, whereas others may be selected more than once
## This sets the cross-validation method with k=5 folds
method <- trainControl(method = "cv", number = 5)
## Fit regression model and use k-fold CV to evaluate performance
crossmodelfull <- train(as.factor(htn) ~ age + bmi + ecghr,
data = crossdata,
method = "glm",
trControl = method)
Cross-validation can be used to compare the performances of different predictive modeling procedures
Example: Optical Character Recognition
Which method works best? Support Vector Machine (SVM) or K-Nearest Neighbors (KNN)?
Use cross-validation to objectively compare the two methods
Cross-validation can also be used in variable selection
Example: Cancer Research
Determine which subset of features should be used to produce the best predictive model