Phase 2 is now complete, so this is a good time to review the history of the project and look ahead.
In phase 1, which I started around 1997, I downloaded Roger Mitton's dictionary list derived from the Oxford Advanced Learners' Dictionary of 1974, at that time the only such list accessible to scholars without payment. I wrote a piece of software which would search the pronunciation field and mark up all homophone pairs. It would then substitute a dummy character for both symbols of the minimal pair and search the list again for any additional homophone pairs that the process had created. This created a set of fairly complete lists, and these were built up gradually, all 190 vowel lists and 276 consonant lists being made and uploaded to the web site by 2002. In that year I delivered my paper "Don't ask the admiral to show you his pinnace" at Birmingham University in the lecture series to mark Tim Johns's retirement, which I look back on as my best presentation.
For several reasons the lists still needed some editing, and this constituted Phase 2. Since the computer algorithm would only identify the first minimal pair of any set, not the further pairs that would arise from homophones of either of the terms, extra pairs were now added manually where needed. Often, because the computer used ASCII ordering instead of alphabetical ordering, it would take a proper name to make a pair ahead of a common noun; for instance Leigh would be picked out ahead of lea. Often the process identified a plural pair, eg cheetahs/heaters, but not the corresponding singular, cheetah/heater. This was because the entry for heater and all similar words includes a symbol for potential linking -r which made the pronunciation look different to the computer. In addition there were some transcription errors in the dictionary, inevitable when one person has had to carry out the work unaided.
As each list was edited, I added extra information that might be relevant. Since many pronunciation errors arise directly or indirectly from the vagaries of spelling, I noted which spellings of the target sounds occurred in the lists. I compared the articulation of the contrasted sounds. I also commented on which nationalities might have a problem with the contrast.
Where pairs arose in which one of the words was a taboo word, either a common swearword or a word for a bodily function, I flagged this up. The reason for this is to alert teachers and learners to occasions where they might accidentally make listeners laugh without knowing why. I also commented on common expressions or song lyrics which exploited the contrasting sounds, and made a list of any 'interesting' pairs, ie those in which the two words, normally two syllables or longer, belonged to different parts of speech or which conjured up an amusing image.
Within the lists I separated into groups those words which were simply inflections of the same headword. This allowed me to measure the range of the semantic contrasts that any given pair carried, how many separate meanings might be compared. From this I calculated a figure which I have called the semantic loading. I am not sure yet how this might affect the measurement of confusability, but the numbers are recorded and available for further work.
In my Birmingham lecture I commented on two problem areas, contrasting a sound with a null: bank/back or feel/eel, or a consonant with a vowel: screen/serene. While realising that these had doubtful status as minimal pairs, I nevertheless tried to build up lists. The lists of sound/null contrasts are well in hand though tedious to assemble. The lists of vowel-consonant pairs will take a good deal longer.
I have also put tracking code on all the lists which tells me from which countries my visitors come. This will enable me, as part of Phase 3 of the project, to incorporate information about which contrasts seem most important to which speakers.
But most importantly I now have data about the distribution of minimal pairs which will allow me to investigate what I have called the O'Connor Conjecture, the idea that the language will reduce the number of contrasting pairs where the difference is small and the distinction is difficult but allow it to rise where the difference is large and the distinction is easy. In other words, is language self-repairing? It is almost an article of faith among linguists that language does not need any regulation from an 'Academy'. My data might provide some data to support that position, though my first look at the figures suggests it is unlikely.
The whole of Phase 2 has occupied me for about seven years. This was not, of course, seven years of full-time work. Had I been working on it full time and being paid, the whole job could have been done within a year or so. I certainly could not recommend anyone to follow in my footsteps if they want to get rich.
There is plenty more to do, of course. Editorial decisions were made 'on the hoof' during Phase 2 and now there are many inconsistencies to be ironed out. For instance the dictionary includes many contracted forms such as he'll and they're. Early lists admitted these, but latterly I excluded them on the grounds that you could not then logically exclude other nouns including proper nouns with contractions: the key'll, Kay's, etc. I have also wrestled with the actual content of the dictionary. It contained some archaic terms, and I have probably not been fully consistent in accepting or rejecting them. I have also started to inject a few modern terms which students are likely to encounter, bling, blog, and so on. In most cases these modifications have been personal and arbitrary, the result of being a one-man research team.
But at least I hope I have created an ordered collection of data that teachers can exploit in their own way and that others can enjoy. It has been hard work at times, but it has also been fun, and that is the spirit in which I offer it.
John Higgins, Shaftesbury, December 2010.