i before e except after…w?

The mnemonic ‘i before e except after c’ is something we’ve probably all encountered at one point or another and can be a useful trick for figuring out awkward spellings. However, an episode of QI I watched recently claimed the rule has more exceptions than adherents, that words containing ‘cie’ actually outnumber those containing ‘cei’, rendering the latter half of the rule useless. This got me interested in two things: 1) just how useless are we talking? and 2) is it possible to come up with any modifications to the rule which aren’t useless?

To do this, first I gathered a list of English words and loaded it up in R. The source is a txt file of over 350,000 words. With this in hand, it’s simple to use grep to extract all words containing an ‘ei’/‘ie’ pair:

words <- RCurl::getURL('https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt')
cat(words, file = "dict.txt")
dict <- read.table("dict.txt")
ie_words <- grep("ie", t(dict))
ei_words <- grep("ei", t(dict))

Note that each word can only feature once in each list. As such, while ‘weightiest’ can appear in both the ie_words and ei_words list, the word ‘zeitgeist’ only appears once in ei_words. This will lead to some undercounting, but I expect it to be minimal.

i before e…

So far, the rule is serving its purpose; if you’re struggling to order an ‘ei’/‘ie’ pair in a word, there’s an approximately three to one chance that the ‘i’ will go first.

…except after c

So far, not so interesting. The QI episode I mentioned only raised an issue with the ‘except after c’ part of the rule. I checked this in the same manner as before, comparing the number of words containing ‘cei’ with those containing ‘cie’.

Oh. Well, that doesn’t really look any different at all. So much so that I had to check that R didn’t just spit out the same plot both times. It didn’t. It turns out if an ‘ei’/‘ie’ pair follows a ‘c’, it’s slightly less likely that the ‘i’ goes first than in the general case, but the difference is so marginal that it makes this addendum to the rule completely useless. You still have roughly three to one odds that the ‘i’ goes first.

except after…?

With that aspect of the rule rubbished, is there any letter where the rule tends not to hold? Exactly as before I found the number of ‘ei’/‘ie’ words following each letter of the alphabet. (‘^’ denotes words beginning with either ‘ei’ or ‘ie’).

In almost all cases if you’re faced with uncertainty the odds will be in favour of putting the ‘i’ before the ‘e’. There are, however, a few letters which seem to favour the ‘ei’ ordering. In some cases (‘i’, and ‘a’) these exceptions don’t represent very many words; however, there are over 100 words with ‘^ei’, or ‘eei’ (mostly a double ‘e’ followed by ‘-ing’ or ‘-ism’), and just shy of 200 words with ‘wei’.

So, perhaps the rule might be better phrased as “i before e, except after w, or e or at the beginning of the word”. Somewhat less catchy though. The long form of the original rule also states that you favour the ‘ei’ order when it’s pronounced like ‘A’. As far as I’m aware there is no regular expression for pronunciations (yet), so I’ll have to settle for interrogating the short form of the rule. It should be noted, however, that the ‘wei’ words feature a lot of variations on the word ‘weight’ meaning they still adhere to the original rule.

[x] before [y]

Using the approach as before, can we elicit any other general rules for spelling? To do this I extracted the number of words containing each pair of vowels and considered which typically comes first. The results are outlined in the heatmap below. As before, each word can only be included once in each list; so the word ‘queueing’ will be counted for each of ‘ue’, ‘eu’, and ‘ei’, but will only count once towards the occurrences of ‘ue’, despite it featuring twice in this word. Again, undercounting but I never said I wasn’t lazy.

The heatmap below should be read as: when the letter pair specified by the row and column feature next to each other in a word red means the letter reading down the column comes first most frequently, while blue means the letter reading across the row comes first most frequently, and yellow means either ordering is equally likely.

Some observations from this:

  • e before a: beard, beam, beautiful vs archaeology, algae
  • e before o: dungeon, deodorant vs foregoes, coerce
  • e and u fairly balanced which is a shame because I think this pair may cause more confusion than some others, e.g. feudal vs fuel
  • i before a: alleviate, megalomania vs brain, assail
  • o before a: groan, koala vs aorta, chaos
  • i before o: this is influenced by -ion words, multiplication, position, vs steroid, groin
  • o before u: sound, mountain vs. buoy, quote

What can we get from all this? It seems that if you have a pair of neighbouring vowels in a word and one of them is an ‘a’, you should usually opt to put the ‘a’ second. And forget ‘i before e’, as ‘i’ tends to come before all vowels except ‘u’. So, perhaps instead of ‘i before e’ we should be teaching ‘i before all vowels, except sometimes u’. As before, I don’t think it’ll catch on.

Below I’ve done a heatmap for all letter pairings. Of course, consonant pairs by their nature should have a more discernible ordering than vowels. Here white means there were fewer than ten occurences of this combination of letters (regardless of ordering), e.g. there were fewer than ten words featuring either the combination ‘qb’ or ‘bq’.

Make of that what you will. I just thought it looked nice.