Four letter wordistics

Ah know many of ya have a penchant for four letter words judging by the comments ya leave, so this post will be of interest.

There are a possible (26)^4 = 456,976 possible four letter words although we English speakers have only got round to using a tiny fraction of ’em.

It’s fairly easy to tell which combinations are real, legal words and which are plain nonsense but only if ya are a fluent speaker who is steeped in the arcane rules of the language (i before e except after c etc). If ya don’t know them rules, forget it.

Now Bill Bialek and a buddy at Princeton University in New Jersey reckons there is another way to spot legal four letter words using a particular kind of statistical analysis. He’s taken the entire corpus of four letter words used in the novels of Jane Austen (and a colorful collection it ain’t) and studied the statistical relationship between the letters. He’s looked, not just at consecutive letters but all pairwise correlations within the words.

He’s then worked out the information content of these letter combinations, their entropy, and created a kinda map of this information landscape.Within this landscape there are local energy minima, words in which any single letter change will increase the energy. Turns out that almost two thirds of legal words lie at these local minima.

So a pretty good way of determining whether a four letter word is legal or not is to see whether its sits at a local mininum.

These “stable” words, says Bialek, have the property that any single letter spelling error can be corrected by relaxing to the nearest local energy minima. A handy kinda spell checker.

But Bialek’s most interesting speculation is that if it were possible to construct the energy landscape of legal sentences, they would all lie at local energy minima.

So grammar checkers in future could work without any knowledge of the rules of English but simply by relaxing to the local minima.

Ref: arxiv.org/abs/0801.0253: Toward a Statistical Mechanics of Four Letter Words