Every data scientist is familiar with experimentation.
You know the drill. You get a dataset, load it into a Jupyter notebook, explore it, preprocess the data, fit a baseline model or two, and then train an initial final model, such as XGBoost. The first time around, maybe you don’t tune the hyperparameters and include 20 features. Then, you check your error metrics.
They look okay, but perhaps your model is overfitting a bit. So you decide to tune some regularization parameters (eg max depth) to reduce the complexity of the model and run it again.
You see a little improvement from the last run, but perhaps you want to also:
- Add more features
- Perform feature selection and remove some features
- Try a different scaler for your features
- Tune different/more hyperparameters
As the different kinds of tests you want to run increases, the more difficult it is to remember which combinations of your “experiments” actually yielded the best results. You can only run a notebook so many times, print out the results, and copy/paste them to a Google doc before you get frustrated.