models is a bit like cooking: too little seasoning and the dish is bland, too much and it’s overpowering. The goal? That perfect balance – just enough complexity to capture the flavour of the data, but not so much that it is overwhelming.
In this post, we’ll dive into two of the most common pitfalls in model development: overfitting and underfitting. Whether you’re training your first model or tuning your hundredth, keeping these concepts in check is key to building models that actually work in the real world.
Overfitting
What is overfitting?
Overfitting is a common issue with data science models. It happens when the model learns too well from trained data, meaning that it learns from patterns specific to trained data and noise. Therefore, it is not able to predict well based on unseen data.
Why is overfitting an issue?
- Poor performance: The model is not able to generalise well. The patterns it has detected during training are not applicable to the rest of the data. You get the impression that the model is working great based on training errors, when in fact the test or real-world errors are not that optimistic.
- Predictions with high variance: The model performance is unstable and the predictions are not reliable. Small adjustments to the data cause high variance in the predictions being made.
- Training a complex and expensive model: Training and building a complex model in production is an expensive and high-resource job. If a simpler model performs just as well, it’s more efficient to use it instead.
- Risk of losing business trust: Data scientists who are overly optimistic when experimenting with new models may overpromise results to business stakeholders. If overfitting is discovered only after the model has been presented, it can significantly damage credibility and make it difficult to regain trust in the model’s reliability.
How to identify overfitting
- Cross-validation: During cross-validation, the input data is split into several folds (sets of training and testing data). Different folds of the input data should give similar testing error results. A large gap in performance across folds may indicate model instability or data leakage, both of which can be symptoms of overfitting.
- Keep track of the training, testing and generalisation errors. The error when the model is deployed (generalisation error) should not deviate largely from the errors you already know of. If you want to go the extra mile, consider implementing a monitoring alert if the deployed model’s performance deviates significantly from the validation set error.
How to mitigate/ prevent overfitting
- Remove features: Too many features might “guide” the model too much, therefore resulting to a model that is not able to generalise well.
- Increase training data: Providing more examples to learn from, the model learns to generalise better and it is less sensitive to outliers and noise.
- Increase regularisation: Regularisation techniques assist by penalising the already inflated coefficients. This protects the model from fitting too closely to the data.
- Adjust hyper-parameters: Certain hyper-parameters that are fitted too much, might result in a model that is not able to generalise well.
Underfitting
What is underfitting?
Underfitting happens when the nature of the model or the features are too simplistic to capture the underlying data well. It also results in poor predictions in unseen data.
Why is underfitting problematic?
- Poor performance: The model performs poorly on training data, therefore poorly also on test and real-world data.
- Predictions with high bias: The model is incapable of making reliable predictions.
How to identify underfitting
- Training and test errors will be poor.
- Generalisation error will be high, and possibly close to the training error.
How to fix underfitting
- Enhance features: Introduce new features, or add more sophisticated features (e.g.: add interaction effects/ polynomial terms/ seasonality terms) which will capture more complex patterns in the underlying data
- Increase training data: Providing more examples to learn from, the model learns to generalise better and it is less sensitive to outliers and noise.
- Reduce regularisation power: When applying a regularisation technique that is too powerful, the features become too uniform and the model doesn’t prioritise any feature, preventing it from learning important patterns.
- Adjust hyper-parameters: An intrinsically complex model with poor hyper-parameters may not be able to capture all the complexity. Paying more attention to adjusting them may be valuable (e.g. add more trees to a random forest).
- If all other options do not fix the underlying issue, it might be worthwhile tossing the model and replacing it with one that is able to capture more complex patterns in data.
Summary
Machine learning isn’t magic, it’s a balancing act between too much and too little. Overfit your model, and it becomes a perfectionist that can’t handle new situations. Underfit it, and it misses the point entirely.
The best models live in the sweet spot: generalising well, learning enough, but not too much. By understanding and managing overfitting and underfitting, you’re not just improving metrics, you’re building trust, reducing risk, and creating solutions that last beyond the training set.
Resources
[1] https://medium.com/@SyedAbbasT/what-is-overfitting-underfitting-regularization-371b0afa1a2c
[2] https://www.datacamp.com/blog/what-is-overfitting


