High frequency-ratio anagrams

I love looking up words on Wiktionary in my free time. Looking up bouncy castle was particularly fruitful because I learned this fairly common word is an anagram of “cyclobutanes”, a comparatively rare word I’d never heard before.

I figured it must be possible to semi-automatically find more entertaining anagrams; you’d just need a word list and preferably some kind of commonality metric.

My first port of call for a word list was naturally /usr/share/dict/words, but unfortunately on my system (where it’s provided by wamerican), it’s fairly small - in fact, it doesn’t even contain “cyclobutanes”.

Blessedly though, Google seems to have forgotten it’s still hosting a common good: Google Books’ Ngram Viewer dataset - every single ngram from every book Google has OCRed and how many times they’ve seen it. Now, this is a dataset of ngrams, not words, but 1-gram ngram basically means the same thing as “word” and you should just not worry about it.

🔗Massaging the Data

The dataset is split into about 25 files with one CSV for each letter of the alphabet, so I threw together a script to download all of the CSVs.

Additionally, the data isn’t in the exact right format yet. The CSVs actually record how many times a word was seen in a given year, and how many pages and volumes it was seen on. This makes the CSVs way bigger than they need to be for my purposes, which makes computations way slower. So I have to fix that too. At first I wrote another Python script to do this, but my naïve algorithm was frankly way too slow so I scrapped it and used Rust instead. I also sorted the words by frequency so that I can find anagrams with the highest frequency ratio by selecting a word from the top of the list then iterating up from the bottom.

Something I discovered very quickly is that my code shouldn’t consider a word an anagram of itself (is a word an anagram of itself?), and that Google’s dataset wasn’t perfect. Because the data is from OCRed books, and OCR isn’t infallible, most of the words with very low frequencies were actually other words misspelt or misread, so I skipped words with very low frequencies too. I also threw out words under 10 letters long because those are boring.

My method of determining whether two words were anagrams was naïve (this is a theme) - sort the letters of both words and check if the results are identical.

And that’s pretty much it! My program started spitting out multiple anagrams every second.

It still produces “anagrams” between a word and a misspelling, but there’s few enough to filter out manually - which I have to do anyway, because computers are incapable yet of objectively measuring humour.

🔗Results

First and foremost, this gem:

Eins	Zwei
legislator	allegorists

🔗Common/uncommon Pairs

Uno	Dos
regulation	urogenital
oscillation	colonialist
generations	nitrogenase
enlargement	greenmantle
incorporate	procreation
petrochemicals	cephalometric
disharmony	hydramnios
paternalistic	antiparticles
intoxicate	excitation
relationships	rhinoplasties
residential	estrildinae
nationalised	desalination
Gasherbrum	hamburgers
parenthesis	interphases
complicates	ectoplasmic
absolution	isobutanol
insatiable	banalities

🔗Antonym Pairs

Un	Deux
compressed	decompress
marginally	alarmingly
desecration	considerate

🔗Misc.

一	二
assimilation	islamisation
catherals	hard castle
tapestries	striptease
inconsiderate	containerised
streamlined	derailments
supersonic	percussion
chattering	ratcheting
mountaineer	enumeration
algorithms	logarithms
recruitment	current time
interpreter	reinterpret
reductions	discounter

The source code is available here. I also included the sorted word/frequency data as a CSV.