I love looking up words on Wiktionary in my free time. Looking up bouncy castle was particularly fruitful because I learned this fairly common word is an anagram of “cyclobutanes”, a comparatively rare word I’d never heard before.

I figured it must be possible to semi-automatically find more entertaining anagrams; you’d just need a word list and preferably some kind of commonality metric.
My first port of call for a word list was naturally /usr/share/dict/words
, but unfortunately on my system (where it’s provided by wamerican
), it’s fairly small - in fact, it doesn’t even contain “cyclobutanes”.
Blessedly though, Google seems to have forgotten it’s still hosting a common good: Google Books’ Ngram Viewer dataset - every single ngram from every book Google has OCRed and how many times they’ve seen it. Now, this is a dataset of ngrams, not words, but 1-gram ngram basically means the same thing as “word” and you should just not worry about it.
🔗Massaging the Data
The dataset is split into about 25 files with one CSV for each letter of the alphabet, so I threw together a script to download all of the CSVs.
Additionally, the data isn’t in the exact right format yet. The CSVs actually record how many times a word was seen in a given year, and how many pages and volumes it was seen on. This makes the CSVs way bigger than they need to be for my purposes, which makes computations way slower. So I have to fix that too. At first I wrote another Python script to do this, but my naïve algorithm was frankly way too slow so I scrapped it and used Rust instead. I also sorted the words by frequency so that I can find anagrams with the highest frequency ratio by selecting a word from the top of the list then iterating up from the bottom.
Something I discovered very quickly is that my code shouldn’t consider a word an anagram of itself (is a word an anagram of itself?), and that Google’s dataset wasn’t perfect. Because the data is from OCRed books, and OCR isn’t infallible, most of the words with very low frequencies were actually other words misspelt or misread, so I skipped words with very low frequencies too. I also threw out words under 10 letters long because those are boring.
My method of determining whether two words were anagrams was naïve (this is a theme) - sort the letters of both words and check if the results are identical.
And that’s pretty much it! My program started spitting out multiple anagrams every second.

It still produces “anagrams” between a word and a misspelling, but there’s few enough to filter out manually - which I have to do anyway, because computers are incapable yet of objectively measuring humour.
🔗Results
First and foremost, this gem:
Eins | Zwei |
---|---|
legislator | allegorists |
🔗Common/uncommon Pairs
Uno | Dos |
---|---|
regulation | urogenital |
oscillation | colonialist |
generations | nitrogenase |
enlargement | greenmantle |
incorporate | procreation |
petrochemicals | cephalometric |
disharmony | hydramnios |
paternalistic | antiparticles |
intoxicate | excitation |
relationships | rhinoplasties |
residential | estrildinae |
nationalised | desalination |
Gasherbrum | hamburgers |
parenthesis | interphases |
complicates | ectoplasmic |
absolution | isobutanol |
insatiable | banalities |
🔗Antonym Pairs
Un | Deux |
---|---|
compressed | decompress |
marginally | alarmingly |
desecration | considerate |
🔗Misc.
一 | 二 |
---|---|
assimilation | islamisation |
catherals | hard castle |
tapestries | striptease |
inconsiderate | containerised |
streamlined | derailments |
supersonic | percussion |
chattering | ratcheting |
mountaineer | enumeration |
algorithms | logarithms |
recruitment | current time |
interpreter | reinterpret |
reductions | discounter |
The source code is available here. I also included the sorted word/frequency data as a CSV.