Stupid-simple ML·Part 3 of 5
01 / 06

Why training data is everything

A model is a mirror of what it was shown. Show it junk, get junk. There is no shortcut.

Press → to begin
← prev·next →·or swipe
Same essay · long form

Here is an underrated truth: the model is not the smart part. The data is the smart part.

A model trained on careful, clean, well-curated examples will give you careful, clean, useful answers. A model trained on the unfiltered chaos of the open internet will sometimes confidently quote conspiracy theories at you. The architecture barely matters. The data matters enormously.

Why this is hard

Curating training data is unglamorous, expensive, and slow. You have to read it. You have to label it. You have to throw out the bad parts. You have to argue, in committee, about what counts as "bad." Most of the headline-grabbing AI breakthroughs in the last decade have been, secretly, breakthroughs in data curation dressed up as breakthroughs in models.

A model is a mirror of what it was shown. Polish the mirror.

What this looks like in practice

  • A medical AI trained mostly on photos of skin lesions on light-skinned patients will be worse at diagnosing dark-skinned patients. Not because the model is racist. Because the data was thin.
  • A coding assistant trained on GitHub will be great at JavaScript and Python, mediocre at Rust, and useless at your company's internal DSL.
  • A language model trained on text from 2020 will not know about anything that happened in 2024. It cannot. The information was not there.

What you can do about it

If you ever build something with AI — even something small — spend more time on the data than feels reasonable. Look at it. Read it. Notice what's missing. The model can only be as careful as the dataset that taught it.

The bottom line

Models are downstream of data. When a model surprises you — for good or ill — the explanation is almost always in the training set. The model is doing exactly what you would expect, given what it was shown.

Next: hallucinations. Why they happen. Why the technical fix is harder than the marketing implies.