Overfitting

In machine learning, overfitting is the condition in which a model has learned its training data too well. It has memorized the specific examples rather than extracting the underlying pattern. On the training set, the overfit model performs perfectly. On new data, it fails, often catastrophically, because what it learned was the noise of the particular dataset rather than the signal of the general problem. The diagnostic is simple: if your model performs significantly better on data it has seen than on data it hasn’t, the model is overfit. It has mistaken the map for the territory.

The interesting thing about overfitting is that it is not a failure of effort. It is a failure of the wrong kind of success. The model did exactly what it was asked to do. It minimized the loss function on the available data. The problem is that minimizing the loss function on the available data is not the same as learning the thing the data was supposed to teach, and the difference between those two objectives is invisible from inside the training set. From inside, the model looks like it has learned. From outside, the model has memorized.

There is a specific version of this that I think explains more about institutional behavior, strategic planning, and the way people navigate their own lives than any other concept I have encountered in a technical field.

The Maginot Line was the most sophisticated fortification system ever constructed. It was built between 1929 and 1936 along France’s eastern border with Germany. It was a direct product of the French experience in the First World War, where the Western Front had stabilized into a system of trenches and the war had been decided, eventually, by attrition along a fixed line. The French general staff studied the data carefully. The data said: the next war will be fought along a fixed defensive line, and the side with better fortifications will have the advantage. The Maginot Line was the optimal solution to this problem. It was also the optimal solution to the wrong problem, because the next war was not fought along a fixed line. The Germans went around it.

This is not a story about French stupidity. The French general staff contained some of the most analytically capable military minds in Europe. The problem was not intelligence. The problem was that they had a very detailed, very high-resolution model of the last war, and the last war was the only data in their training set. They overfit. The model performed perfectly on the training data (the conditions of 1914-1918) and failed on out-of-distribution data (the conditions of 1940). The failure was not that they failed to learn from experience. The failure was that they learned from experience too well, too specifically, with too little regularization.

The United States has done this in sequence, and the sequence is instructive because it shows that the problem is not one of intelligence or information but of the structure of learning from experience itself.

Korea was fought, in its early months, as though it were a reprise of the Pacific island-hopping campaign: American forces expected a conventional engagement against a smaller power that could be overwhelmed with superior firepower and logistics. The Chinese intervention in November 1950 was out-of-distribution data. The model had not been trained on a land war against an opponent willing to absorb enormous casualties in human-wave assaults across frozen terrain. MacArthur’s advance to the Yalu was the overfit prediction: it performed perfectly on the training data of World War II and failed on the test set of a war with different strategic constraints.¹

Vietnam was shaped, at the strategic level, by a single overfit lesson from Korea: do not push north, or China will intervene. The entire American strategy was constrained by the assumption that escalation into North Vietnam would trigger Chinese entry into the war, because that is exactly what had happened at the Yalu River in 1950. The assumption determined the rules of engagement, the geographic boundaries of the conflict, and the fundamental strategic posture of fighting a limited war in the south rather than striking the source of the insurgency in the north. The problem was that the assumption was wrong. China and Vietnam had a relationship nothing like the one between China and Korea. The Sino-Vietnamese relationship was already deteriorating, shaped by centuries of mutual suspicion and competing territorial interests that would erupt into open war between the two countries in 1979, four years after Saigon fell. The model trained on Korea said: China will defend its communist neighbor. The actual data said: China had no intention of fighting another land war in Southeast Asia for Vietnam’s sake. But the data was never consulted, because the lesson from Korea was so vivid, so high-resolution, so seared into institutional memory that it functioned not as a hypothesis to be tested but as a constraint to be obeyed. The US fought the entire war inside a boundary drawn by the last war’s model, and the boundary was fiction. The irony is that China did eventually enter a war in Vietnam. In 1979, four years after the US left, China invaded Vietnam. Not to defend it. To punish it for invading Cambodia and aligning too closely with the Soviet Union, which by then was China’s rival, not its ally. The Sino-Soviet split had been widening since the late 1950s. By the time the US was constraining its entire strategy around the fear of Chinese intervention on Hanoi’s behalf, Beijing and Moscow were barely speaking, and China’s interests in Southeast Asia had more to do with containing Soviet influence than with defending communist solidarity. The overfit model predicted the right actor, the right geography, and the right event, and got the direction exactly backwards.

The Gulf War in 1991 is the exception that proves the rule, and it is worth understanding why it worked. After Vietnam, the US military spent the better part of two decades rebuilding its doctrine from first principles. AirLand Battle, the operational concept that defined the 1991 campaign, was not an extrapolation from Vietnam or Korea. It was designed for a specific scenario: a conventional war on open terrain against a Soviet-style armored force, the kind of flat, mechanized engagement where air superiority and coordinated maneuver would be decisive. The doctrine was built for the European theater and happened to map almost perfectly onto the Iraqi desert. The Gulf War succeeded not because the military had learned from its previous wars but because it had, for once, stopped trying to fight them. It looked at the actual terrain, the actual enemy, the actual force structure, and designed accordingly. It was the regularized model. It generalized because it was built from the data in front of it rather than from the memory of the last failure.

The problem is what happened next. The military extracted a lesson from the Gulf War: technology and speed win wars. This was the correct lesson for that specific war. It was also a catastrophically overfit lesson, because the next engagement in the same country, twelve years later, would involve the same technological superiority producing the same rapid conventional victory followed by a decade of insurgency that the model had no capacity to predict, because the model had been trained on the hundred hours and not on what comes after.

Iraq in 2003 is the clearest case. The conventional war lasted three weeks. The insurgency lasted eight years. The model that produced the three-week victory was trained on 1991. The data from 1991 said: the Iraqi military collapses under technological pressure, the regime falls, and the operation concludes. The model performed perfectly on the training data. The test data included an occupied population, a disbanded military with no employment, sectarian fault lines that the regime had been suppressing, and a regional power structure that had its own objectives. None of this was in the training set. The model had memorized the Gulf War and was now being asked to generalize to a completely different problem that happened to share the same geography.

Afghanistan repeated the pattern with slight variation. The initial campaign in 2001 was trained on the rapid success in Iraq (itself trained on 1991), and the twenty-year occupation that followed was the out-of-distribution data that the model could not process.

The pattern is consistent enough to be structural rather than incidental. In each case:

The institution has recent, vivid, high-resolution experience of a specific conflict.
The institution extracts a model from that experience.
The model performs well on data that resembles the training set.
The next conflict differs from the training set in precisely the dimensions the model does not capture.
The institution applies the model anyway, because the model is the best available summary of what it has learned, and what it has learned is what it has experienced, and what it has experienced is not what is happening now.

The failure is never a failure to learn. It is always a failure of generalization from learning. And the failure has a specific structure that is worth being precise about: the model is biased. Not biased in the colloquial sense of prejudiced, but in the statistical sense. It systematically overweights the features of the training data. The French general staff did not just happen to build a static defensive line. They were biased toward static defense because static defense was the dominant feature of their training set. The American command in Vietnam did not just happen to fear Chinese intervention. They were biased toward that fear because Chinese intervention was the most vivid, most costly feature of their Korean experience. The bias is not random error. It is a systematic distortion introduced by the structure of what the model has seen, and it pushes every prediction in the direction of the past.

This is the key distinction. The problem with experiential learning is not that it is noisy. It is that it is biased, and biased in a direction that is invisible to the learner, because the learner’s confidence in the model is proportional to the vividness of the experience that produced it. The more traumatic the training data, the stronger the bias, and the more certain the institution is that the bias is wisdom.

In machine learning, the standard remedy for this kind of bias is regularization: a penalty on model complexity that forces the model to learn simpler, more general patterns rather than memorizing the specific features of the training data. L2 regularization penalizes large parameter weights, pushing the model toward smoother solutions. Dropout randomly disables neurons during training, preventing the network from relying too heavily on any single feature. Both techniques work by deliberately degrading the model’s performance on the training data in order to correct for the bias toward it.

The equivalent in strategic thinking is first-principles reasoning. Not starting from zero, but correcting for the bias. Instead of asking “what did we learn last time,” you ask “what does the data in front of us actually say.” The first question produces a biased model. The second corrects for the bias by forcing the model to engage with the current distribution rather than replaying the previous one. The Gulf War worked because AirLand Battle was a bias correction: the military looked at the actual problem rather than letting Vietnam’s features dominate the prediction. The Iraq War failed because the bias correction was abandoned, and the Gulf War’s features were allowed to dominate instead. First-principles thinking is not the absence of experience. It is the discipline of not letting experience do your thinking for you, which is harder than it sounds, because experience feels like the most reliable thing you have, and the bias it introduces feels indistinguishable from knowledge.²

I think this generalizes in a way that I find both obvious once stated and difficult to act on.

The person with ten years of experience in an industry has a very detailed model of that industry. The model is trained on ten years of data. If the industry is stable, the model generalizes well and the experience is genuinely valuable. If the industry is changing, the model is overfit on the last decade and the ten years of experience become ten years of increasingly specific knowledge about conditions that no longer hold. The person with no experience and a willingness to look at the current data is, in this specific situation, better positioned than the expert. Not because the expert is wrong about the past, but because the past is not the present, and the expert’s model cannot distinguish between the two.

This is an uncomfortable claim because it runs against the most basic intuition about how learning works. The intuition says: experience teaches. More experience teaches more. The person who has been through it knows something the person who hasn’t does not. And this is true, as far as it goes. The problem is that it doesn’t go far enough. Experience teaches what happened. It does not teach what will happen. The transfer from one to the other depends entirely on whether the underlying distribution has changed, and the person most confident that it hasn’t is the person with the most training data from the old distribution, which is the person with the most experience, which is exactly the person least likely to notice the shift.

The empiricist who looks at the data in front of them and asks what it says, without filtering it through a model trained on the last war, is performing a kind of regularization. They are deliberately refusing to let their prior experience dominate the current observation. This is uncomfortable and feels like a waste of hard-won knowledge. It is also, I think, almost always the better approach when the environment is changing, which, in the domains that matter most, is almost always.

I don’t want to overstate this. Priors are useful. Experience is not worthless. The point is narrower: that the confidence people place in experiential knowledge is systematically miscalibrated, because the vividness and detail of the experience makes it feel more generalizable than it is. A detailed memory of the last war feels like wisdom. A willingness to look at the current terrain with fresh eyes feels like naivete. The feeling is exactly backwards. The detailed memory is the overfit model. The fresh eyes are the regularized one. And the overfit model will always feel more authoritative, because it has more parameters, and more parameters feel like more knowledge, even when what they actually represent is more noise.

The Chinese intervention at Chosin was so far outside the model’s training distribution that the initial American response was essentially: this cannot be happening. The intelligence existed. The warnings had been issued. The model could not process them because the model had no category for a Chinese army willing to march through sub-zero temperatures with inadequate supply lines. The data was available. The model rejected it as noise. ↩
There is a structural reason institutions resist first-principles thinking: it threatens the credibility hierarchy. The person who says “I have seen this before” has status. The person who says “forget what you’ve seen, look at what’s in front of you” is asking everyone in the room to give up the thing that makes them valuable, which is their experience. The incentive structure of most organizations rewards pattern-matching from experience and punishes first-principles analysis, which means the organizations overfit by design. ↩

Overfitting

Footnotes