Thursday, July 24, 2025

LLMs, Hallucinations, and the Myth of Machine Truth

Reading a book called All the Knowledge in the World: The Extraordinary History of the Encyclopaedia by Simon Garfield. In it he quotes Descartes; 

Even if all knowledge could be found in books, where it is mixed in with so many useless things and confusingly heaped in such large volumes, it would take longer to read those books than we have to live in this life and more effort to select the useful things than to find them oneself.

It called to mind the vast amount of data that LLMs today are trained on, and the phenomenon of hallucinations. So, what better way to further consider the notion than to 'ask' a LLM.


Steve:
Even if all knowledge is available to LLMs, it's buried in noise, contradiction, and massive volume. Hallucinations aren't glitches—they're a result of confusing, unreliable data and the fact that LLMs generate outputs based on pattern recognition, not truth.

ChatGPT:

You're hitting a critical point that cuts through the hype around LLMs: these systems are not truth machines—they’re pattern machines trained on a messy, inconsistent, and often outright wrong corpus of human knowledge.

Your quote (which echoes Descartes, Bacon, and even Montaigne in tone) captures the old problem of information glut—now weaponized at scale. The core problems you're highlighting:

1. Volume ≠ Value

Even if all the world’s knowledge is encoded in books (or in a training dataset), most of it is:

  • Redundant
  • Contradictory
  • Context-dependent
  • Flat-out wrong

LLMs inherit all of this. And they don’t "know" any of it in the sense that a human understands truth. They infer statistical relationships. So when contradictions or noise are common (say, in alternative medicine, political history, or pseudoscientific literature), the model mirrors that fog.

2. Training Data Is Confused Because We Are

If humans can't agree on what's true, the LLM can't either. It just reflects the average or dominant patterns in the data it's seen. If 60% of sources say X, and 40% say Y, it might say X, but that doesn't mean X is correct—it just means X showed up more often or more convincingly phrased.

Add in deceptive formatting (e.g. scammy blog posts made to look like peer-reviewed science), social bias, outdated consensus, and outright propaganda, and you have a training set that's more swamp than signal.

3. "Reliable" Sources Are Not Always Reliable

Even when instructed to check against so-called reputable sources (Wikipedia, scientific journals, news outlets), there's a meta-problem:

  • What if the reputable source is wrong, or politically influenced, or later retracted?
  • How does the model recognize institutional drift—where an organization was reliable once but isn't now?

Worse: LLMs often fail to differentiate between how things are said and what is being said. A confident, well-written fabrication can score higher than a sloppy but accurate piece.


Bottom Line

LLMs are only as good as the signal-to-noise ratio of their data and the questions we ask. Hallucinations aren’t glitches; they’re symptoms of a system doing what it’s designed to do:

predict plausible next words based on an unfiltered tangle of human knowledge, ignorance, and bullshit.

Want a smarter model? Fix the upstream problem:

  • Clean data
  • Transparent sources
  • Epistemic weighting (some facts matter more than others)
  • Domain-specific fine-tuning

Until then, the smartest use of an LLM is often as a sparring partner, not a guru. Use it to sharpen your thinking, not replace it.

No comments:

  Another ChatGPT conversation It started with a quote presented to me..... "A Stanford University study found that AI adoption has cau...