Train-Test Split

In the construction and evaluation of models, we have approaches on how we can segment our available dataset to be used:

Train/test on the same dataset: we both train/fit our model using all of the data in our dataset, and the validate our model using a portion of our dataset.
- This will give us low out-of-sample accuracy, as it rewards overfitting.
Train/test split: divide the dataset into one that is going to be used for the model training and into another that is only going to be used in the test phase.
- This will give us a more accurate out-of-sample accuracy.
- Careful on how the division is done. The results are highly dependent on the dataset.
K-fold cross-validation: This is train/test split but done in K-iterations, and at each iteration, 1/K part of the dataset is chosen as the test set.

On Train/Dev/Test split

It is also possible to split the dataset three ways: