I Pitted XGBoost Against Logistic Regression on 358 Matches. The Boring Model Won.

of us share on a new modelling problem: reach for the model that wins. These days that’s gradient boosting, and the reflex is usually right — XGBoost earns its reputation on a staggering range of problems.

So when I lined up five classifiers on the same task and the one-line linear model beat the Kaggle champion, the result was the kind that surprises exactly nobody who has shipped models on real data, and almost everybody still learning.

Five classifiers, same task, same features: predict whether an international match ends in a home win, draw, or away win. The contenders ran from a humble logistic regression up through a random forest, KNN, a small neural network, and XGBoost.

The simplest one won. More interesting than that it won is why — and the why is one of the most useful ideas in applied machine learning. Here’s the experiment, the result, and the theory that cracks it open.

The setup

This came out of building a suite of eleven World Cup models, where I needed a result classifier and wanted to know which family to trust. Each model saw the same three features for 358 historical internationals — the 2010–2022 World Cups plus the 2020 and 2024 Euros: the strength gap between the teams, their combined strength, and a knockout flag. The target is the three-way result.

I scored them with 5-fold cross-validation, and the primary metric is log-loss, not accuracy. That choice does a lot of work in this article, so it’s worth being explicit about it up front. Accuracy only asks whether the top-ranked class was correct. Log-loss grades the entire probability vector and punishes confident mistakes hard:

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import log_loss, accuracy_score

proba = cross_val_predict(model, X, y, cv=5, method="predict_proba")
print(log_loss(y, proba), accuracy_score(y, proba.argmax(1)))

For a forecasting model whose entire job is to emit calibrated probabilities, log-loss is the honest scorecard and accuracy is a sanity check. The number to keep in your pocket is ln(3) ≈ 1.099 — the log-loss you’d get by shrugging and predicting a uniform 1/3 across the three classes. Beat 1.099 and your model knows something. Score above it and you’d have been better off guessing.

The result

There are two things in the results below that should bother you.

The first is the podium: a plain logistic regression posted the best log-loss, and XGBoost — the model that wins Kaggle competitions — came last. The second is stranger and easy to skim past. XGBoost didn’t just lose; it scored above 1.099, the uniform-guessing baseline. A model with a respectable-looking 48% accuracy was, by the metric that actually matters here, worse than a coin with three sides.

Cross-validated log-loss by model. Image by author

Model	CV log-loss (lower is better)	CV accuracy
Logistic regression	1.001	54%
Random Forest	1.011	56%
KNN	1.013	53%
Neural network	1.115	52%
XGBoost	1.169	48%

Both of these facts have the same root cause, and it’s the most useful idea in this whole article.

Why the boring model won: bias and variance

The clean way to think about this is the bias–variance decomposition. A model’s expected out-of-sample error splits into three parts:

Error = Bias² + Variance + Irreducible noise

Bias is error from wrong assumptions — too rigid a model misses real structure in the data.
Variance is error from sensitivity to the particular training sample — too flexible a model fits noise that won’t recur next time.
Irreducible noise is the genuine randomness of the thing you’re predicting. In football it’s enormous: a single deflected shot decides a knockout tie. No model touches this term, which is why even the best classifier here sits near 50% accuracy.

The whole game is the trade between the first two. High-capacity models, such as boosted trees or neural nets, buy low bias by being flexible enough to bend to almost any shape in the data. The bill for that flexibility is variance, and it only comes due when you don’t have enough data to pin the model down.

And that’s is exactly our situation. With 358 examples split across a three-way target, you have roughly 120 matches per class. An XGBoost ensemble, meanwhile, has thousands of effective parameters spread across its trees. There simply isn’t enough signal to discipline all of them, so they latch onto quirks that happen to appear in one cross-validation fold and vanish in the next. That’s textbook overfitting, and it explains the first bother: cross-validation is doing its job by catching the flexible models red-handed on data they haven’t seen.

So why did XGBoost fall below random rather than just landing mid-table? This is where the choice of log-loss pays off. The penalty for a single example is −ln(p_true_class), and it’s brutally convex.

Predict the eventual outcome at a hedged 0.5 and you eat −ln(0.5) = 0.69. Predict it at a confident-but-wrong 0.1 and you eat −ln(0.1) = 2.30 — more than three times the pain for being sure and wrong. An over-flexible model on small data doesn’t just make errors; it makes them with conviction, issuing sharp 60–70% probabilities and getting enough of them wrong that the convex penalty drags its average below the timid 1/3-1/3-1/3 baseline.

The proper name for this failure is confident miscalibration, and it’s the signature of too much model for too little data. XGBoost’s accuracy edge on the occasional bold call couldn’t pay back what its overconfidence cost everywhere else.

Why logistic regression in particular

Knowing that the flexible models would struggle is only half the story. The linear model didn’t just avoid the trap — it was, for this problem, the correct tool. Two structural facts make that so:

The true relationship is close to linear in the log-odds. Most of what predicts a result is “how big is the strength gap,” and the probability of winning rises smoothly and monotonically with it — exactly the functional form logistic regression assumes. When a model’s inductive bias matches the data-generating process, you need far less data to estimate it well. The trees, by contrast, have to discover that smooth curve out of piecewise-constant splits, spending precious data to approximate something logistic regression gets for free.
Three features, weak interactions. Trees and nets earn their keep by hunting down interactions among many features. With only three features and little interaction between them, there’s nothing for that machinery to find — so it adds variance without adding any signal to show for it.

There’s a rule of thumb from classical statistics worth carrying around: you want on the order of 10–20 observations per parameter for stable estimates.

Logistic regression estimates a handful of coefficients against 358 matches — comfortably inside that budget. A boosted ensemble is orders of magnitude over it. The mismatch was baked in before a single model trained.

How to read the scoreboard honestly

Before drawing conclusions from that table, two cautions about reading it — because the same small dataset that sank XGBoost also makes the numbers noisier than they look.

The first is the metric’s own variance. With 358 matches, each of the five folds holds out only ~72 games, so the CV score itself wobbles. The gaps among logistic regression, random forest, and KNN — 1.001 vs. 1.011 vs. 1.013 — are well inside that wobble. They’re effectively tied.

What’s robust and repeatable is the two ends of the table: the simple linear model is reliably at the top, and the most flexible models reliably at the bottom. Read the podium, not the photo finish.

The second is the accuracy column, which you should resist over-reading entirely. Three-way football results are intrinsically hard because the draw is a real third outcome with no strong predictor — historically about 27% of these matches drew, and draws are nearly impossible to call in advance from team strength alone.

A model that knew each team’s true win probability still couldn’t push accuracy much past the high 50s, because the irreducible-noise term is so large. Seen that way, logistic regression’s 54% isn’t mediocre — it’s near the practical ceiling for this feature set. The real differentiator between models was never how often they top-picked the winner; it was calibration, which is precisely what log-loss measures and accuracy hides. So: Lead with the proper scoring rule; keep accuracy as a gut check.

Could the trees be rescued? With discipline, yes.

None of this is an indictment of XGBoost. It’s a statement about configuration relative to data size — and the same algorithm, handled differently, could close most of the gap. The lever is regularization: Trading a little variance back for a little bias.

For XGBoost: shallower trees (max_depth=2–3), a stronger min_child_weight, subsample and colsample_bytree below 1, an L2 penalty (lambda), a low learning rate with early stopping on a validation fold, and fewer rounds.
For logistic regression: the L2 penalty (C) is already doing quiet regularization in the background — part of why it’s so stable straight out of the box.

Tuned hard enough, a regularized gradient-boosting model would likely match logistic regression here. But notice that “match the one-liner after careful tuning” is itself the lesson, not a counterexample to it.

(The caveat in the other direction: very large, over-parameterized models can re-enter a “double descent” regime where error falls again past the interpolation threshold — but that lives at data and parameter scales far beyond 358 matches.)

So how would you know, empirically, when the trees are finally worth it? Plot a learning curve: held-out log-loss against training-set size, for each model.

Two patterns are diagnostic. A high-bias model like logistic regression plateaus early — more data barely helps, because the bias floor dominates. A high-variance model like XGBoost starts worse but keeps improving as data grows, because extra examples are exactly what tame its variance. The point where the two curves cross is the data budget at which the flexible model starts to win.

On 358 international matches we’re sitting clearly to the left of that crossover. Feed the same XGBoost tens of thousands of club matches with richer features — xG, rest days, lineups — and it would very likely overtake. Same algorithm, different data regime, opposite conclusion. That contingency is the point.

The bottom line: Choose the model with your data

Model complexity should match the data, not the hype. On big, messy, feature-rich problems, gradient boosting and deep nets routinely dominate — that’s why they’re famous, and why the reflex to reach for them is usually a good one.

But on a small, clean, low-dimensional problem like this, the reflex is wrong, and the discipline is to start simple, establish a strong baseline, measure with a proper scoring rule, and add complexity only when held-out data says it earned its place. Logistic regression isn’t the consolation prize here. Given the data, it’s the right answer.

This discipline — start simple, validate honestly with log-loss and calibration, scale complexity deliberately — runs through the modeling chapters of Soccer Analytics with Machine Learning (O’Reilly, 2026 – fresh from the press!): logistic regression and classification in Chapter 5, the tree-based methods (XGBoost included) and exactly when their extra firepower pays off in Chapter 6.

So before you reach for the biggest model on your next project, ask two questions: how much data do you actually have, and how will you know if the complexity helped? Sometimes the line of best fit is also the finish line.

Source link

Sign Up to Our Newsletter

Top Categories

Uncategorized

Tech News

Tech

Software development

Popular Tech News

How NIH Is Translating 70 Years of Health...

How Technology Is Transforming Modern Science Classrooms

White House launches ‘Gold Eagle’ cybersecurity clearinghouse to...

Starlink’s New V5 Home Dish Is Smaller And...