You'd have heard about cross-validation - a common technique used in data-science process to avoid overfitting and many a times to tune the optimal parameters. Overfitting is when the model does well on training data but fails drastically on test data. The reason could be one of the following:
The model is trying to map the exact findings of training data to test data instead of generalizing the patterns.
The train data and test data are significantly different from each other i.e. they have not been derived from the same population.
We will try to understand the second issue. What is the problem with the second issue? If you have participated in Kaggle like competitions, then you would know the way these competitions work. You are given a training data set and test dataset. You train your model on training data, predict on the test data and upload the predictions on Kaggle to get your rank.
What we typically do is divide the training data into train and validation data set. Validation data is used to get an idea of how your model will work on the test data. Now imagine if your train data and test data are different in terms of the population from where they've been derived. You won't see the same result in validation and test data. You see the problem here?
The use of validation data is to understand how the model is expected to perform on test data. But if train and test are not identically distributed, validation and test data would show different results.
1. Build a classifier to distinguish between training and test data
Combine your train and test data into one data. Create a response variable say isTest and assign it as 0 to all the rows in training data and 1 to all the rows in test data. Now your task is to build a classification model that will distinguish between the training and test data. This could be any classification model - logistic or random forest.
2. Sort the predicted probabilities of training data in decreasing order
Once you have the model built, use this model to predict on the training data. You will get the fitted probabilities. Sort the probabilities in decreasing order i.e. the row having highest probability of being classified as test data comes to the top.
3. Take the starting few rows as your validation set
The starting few rows are now those rows of training data that resembles the most to test data. Take the starting few rows say 30% as your validation set. And the remaining as your train data to train your model.
Now, the accuracy metrics on your validation set should be similar to that on the test data. If your model works well on validation data, it should work well on test data. If you are interested in the implementation of what we just talked, head out to this post for the part 2 where the code is written in R.
Did you find the article useful? If you did, share your thoughts in the comments. Share this post with people who you think would enjoy reading this. Let's talk more of data-science.