Split Your Data
You have some data and are ready to build a model. Hooray! Before you build a model, you use scikit-learn’s train_test_split. That’s a good start, but I’ve seen data scientists make lots of mistakes with splitting data. Here’s how this can easily go wrong, so you can avoid these pitfalls.
Set Your Seed
You split your data, build your model, review your results, and everything looks great. You run your code again and everything changes! Set your seed for train_test_split (and while you are at it, any other methods you use where a seed can be set) so that your work is reproducible. This allows you and anyone else to run your code and get the same results.
Check for the Same Entity
Another way your data split can go wrong is when you have an entity - a customer, a patient, a subject, etc. - show up in your data multiple times. For example, you are working on hospital admissions data, and some patients are admitted multiple times during the window of time your data covers. These aren’t duplicate records, but if your model trains on records for a patient and then sees that same patient in the test data, your model may already know the “answer” for that patient.
The solution to this is to split your data by putting entities either all in training data or all in test data. For our example with patients, we could randomly split into train and test by patient, rather than by admission record. This ensures that our test metrics are good, and the model hasn’t “cheated” by already seeing the answer for the entity in training.
Feature Engineer After Splitting
The order in which you do things matters. Perform your data splitting before you begin feature engineering. Your model should know nothing about your test data. If you are going to engineer features using derived data, such as calculating the mean, and apply this to your data, calculate the derived data after you have split your data and only use your training data.
If you violate this, this is another example of data leakage. This can result in the metrics on your test data looking better than they really should, because your model has incorporated some information from the test data. When you deploy this model on data that truly has never been seen before, this often results in a drop in performance.
Note that this also includes data augmentation. For example, if you are creating more training data for an image classifier by rotating and cropping the images, do this to your training data after the split has occurred, not before.
Time is Important
Using time series data in your modeling can be tricky. Usually with time series data, your model incorporates some prior values of that data. For example, you are predicting the number of orders placed today given the number of orders for each of the seven prior days along with some other information. When splitting this data, you do not want to perform a random split. If you do, some of the values you want to predict are known features in your training data - orders for October 7th will be predicted from your test data, but that value is a feature in predicting orders for October 8th, which is in your training data.
Instead, split your data based on time. The most recent data should be your test set. This mimics what will happen when you start using your model in the real world - your model doesn’t know anything about future values, and instead only knows about what has occurred up through training. This will avoid having your metrics on the test data look great, but your model performs poorly when used, because it memorized the values from the training data.
Summary
I hope this guidance helps you to avoid some of the common errors made with splitting your data. Happy modeling!