In the construction and evaluation of models, we have approaches on how we can segment our available dataset to be used:

  • Train/test on the same dataset: we both train/fit our model using all of the data in our dataset, and the validate our model using a portion of our dataset.
    • This will give us low out-of-sample accuracy, as it rewards overfitting.
  • Train/test split: divide the dataset into one that is going to be used for the model training and into another that is only going to be used in the test phase.
    • This will give us a more accurate out-of-sample accuracy.
    • Careful on how the division is done. The results are highly dependent on the dataset.
  • K-fold cross-validation: This is train/test split but done in K-iterations, and at each iteration, 1/K part of the dataset is chosen as the test set.

On Train/Dev/Test split

It is also possible to split the dataset three ways:

  • Train: uset to train parameters
  • Dev/Validation: used to train hyper parameters such as neurons length
  • Test: only used at the end to validate the neural network