I recognize that I'm a hoarder. With that in mind, months ago I found a way to collect on my computer a repository of information by and about the founders of the US.
The result was tens of thousands of .json files. Messy. Way back then I converted them to markdown files using software called Pandoc. Still messy.
Skip to today. I asked Claude if we can fix it. Python is great for this, if you know Python. Claude does great with Python, a fortunate accident from the early days of training large language models.
There's a movie coming out soon, Young Washington. In anticipation of seeing the movie, I read the book it is based on. And, it's the reason for revisiting all those files downloaded so long ago. Most of the files are single diary entries.
Here's what that mess looked like before we cleaned it up:
### [Diary entry: 26 May 1773]
Date:
URL: https://founders.archives.gov/documents/Washington/01-03-02-0003-0010-0026
@{title=[Diary entry: 26 May 1773];
permalink=https://founders.archives.gov/documents/Washington/01-03-02-0003-0010-0026;
project=Washington Papers; authors=System.Object[];
recipients=System.Object[]; date-from=1773-05-26; date-to=1773-05-26; content=26.
Din'd at Elizabeth Town, & reachd New York in the Evening wch. I spent at
Hull's Tavern. Lodg'd at a Mr. Farmers. }.fullText
Like I said, messy. With Claude's help, I now have clean markdown files, entries stripped of the superfluous bullshit. It makes me wonder how historians of today can use this and so many other repositories of digitized references.
How the Build Actually Went
A technical account, in plain English.
The project started simply enough: take a pile of JSON from the Founders Online "Washington Papers" collection — thousands of little files, each one a single document with a title, a date, an author list, and a body — and collapse them into one Markdown file per year. Treat Washington as the default author but name anyone else. Tag each piece with what it was: diary entry, general orders, letter, and so on. Reasonable on its face. The kind of thing that looks like a thirty-line script until you actually open the files.
The First Surprise Was Hiding in the Bytes
Before writing a single line of code, I had Claude decode three sample files. The third one refused — a raw 0x92 where a UTF-8 reader expected nothing of the sort. That byte is a Windows-1252 curly apostrophe. The corpus wasn't UTF-8 at all; it was cp1252, and a naive json.load would either crash partway through thousands of files or quietly turn every apostrophe and em-dash into a �. That one discovery shaped the whole foundation: the loader tries UTF-8 first, falls back to cp1252, and place names like Cox's Fort and Hilly come through intact instead of mangled.
From there the first version was straightforward — a type classifier reading the titles, author-and-recipient logic so letters read "X to Y," dates parsed and made human, everything grouped by year and sorted. Tested on three samples. Produced clean files.
Then We Ran It for Real
A few hundred files, with python rather than python3 — the kind of detail that quietly matters. Two of the year files came back. And that's where the actually interesting problem showed up, one the samples could never have revealed.
Founders Online stores every diary month twice: once as a month-long container document, and again chopped into one document per day. Same words, both times. Worse, the two copies aren't equal — only the month container carries the editors' footnotes identifying who "Mr. French" was and so on. The choice wasn't cosmetic. It was: clean primary text with no scholarly apparatus, or rich annotation welded into undifferentiated month-blocks.
The Decision That Settled It
The answer came from reframing the purpose. The output is a corpus for LLMs to ingest — concise but comprehensive, and never mind whether it looks nice to a human. That meant the boilerplate byline-and-link on every entry stopped being polish and became dead weight. The duplication stopped being a style question and became a straightforward defect: feeding a model the same day twice just burns context and muddies retrieval.
The final design: default to the clean per-day entries, drop the duplicate month containers, but leave a --diary-mode switch so the annotated version is still accessible without editing any code. The deduplication logic decides container-versus-child from the document IDs themselves — a file is a "month" if another file's ID extends it — which is far steadier than guessing from titles, and leaves letters and orders completely untouched.
Who Did What
The wry version: I asked questions and ran commands, but the asks are where the real decisions lived. The encoding problem came out of my files. The duplication only surfaced because I ran it at a scale the samples couldn't reach and then actually looked at the output. The final shape of the thing was determined entirely by the framing — "this is for LLMs, the pretty stuff is unnecessary" — without which Claude would have happily kept polishing a human-readable artifact I didn't want.
Claude wrote the code. I supplied the diagnosis and the direction.
No comments:
Post a Comment