A guide to quantities

(originally issued as a handout for students on the MSc in CALL and ESL at Stirling)

John Higgins, Stirling, January 1999.

When you work with language it is important to gain a sense of how much language there is in typical chunks. If you are working with computers and language it is also important to understand the relationship between the way data is stored on different media and the size of texts to be stored. This paper is intended to supply a few facts and to get you thinking about language and quantities.

Computer storage

A bit is one choice or decision, Yes or No, On or Off, 1 or 0. It can be stored in many ways, as a setting of a switch, a symbol on paper, a charge in an impurity in silicon, a minute magnetised or unmagnetised area on tape or disk, a hole or lack of a hole in punched card or paper tape, or the choice of a white or coloured record card in a card index. Two bits give you four choices:

OFF OFF
ON OFF
OFF ON
ON ON

Three bits allow eight choices:

OFF OFF OFF
OFF OFF ON
OFF ON OFF
OFF ON ON
ON OFF OFF
ON OFF ON
ON ON OFF
ON ON ON

and so on.

Eight bits allow 256 choices, and this is called a byte. The 256 choices allow generously for the symbols we use in writing the Roman alphabet to be coded, so one byte is effectively the equivalent of one character. The actual coding is most often based on a code called ASCII, standing for American Standard Code for Information Interchange and pronounced /`æskɪ/. This code matches the numbers from 32 to 127 to characters, including space, the ten digits, various punctuation signs, and the upper and lower case alphabets. (The numbers from 0 to 31 are reserved for other purposes.)

The numbers from 128 to 255, previously used for checking, are handled in a code called “extended ASCII”, not quite as widespread in use as standard ASCII; these numbers represent accented characters, currency symbols, maths symbols, and box-drawing characters. A word-processed file which has been created on one computer system which used extended ASCII and is then read and edited on another which uses a different code (such as the ANSI system used by Windows) may well look slightly different; the commonest manifestation of this for the British is the way the pound sign sometimes gets transformed into something else in poorly written software.

There now exists a much more elaborate coding system called Unicode which allows the coding of a huge range of non-Roman writing systems and special characters, though each character will need two or more bytes of storage.

For convenience we count bytes in thousands, or millions, or thousands of millions, namely kilobytes (Kb), megabytes (Mb), and gigabytes (Gb). Remember, though, that these “thousands” are the so-called binary thousands, 2 raised to the power of 10, namely 1024. A kilobyte is, therefore, slightly more than a decimal thousand bytes. 64 K is not 64,000 but 65,536 in decimal.

Files

We store our computer data in files. The size of these files varies greatly. The smallest permitted file size in DOS is 2 kilobytes (2048 bytes) even though the information within the file may only amount to a couple of bytes. (Some hard disk systems may have a much larger minimum.) There is no upper limit on the size of a file other than the physical limits of the medium.

Floppy disks have a capacity of about 1.4 megabytes. In the past this was adequate for most file storing and program installation, since individual files were hardly ever bigger than this. Nowadays, with more elaborate software and large graphic and sound files, this size is too small and is being supplemented by a range of other storage media such as Zip drives (98 Mb or larger), CD-ROM (680 Mb) and DVD (4.5 Gb).

But what do these measures correspond to?

Text

A word is a very variable measure, ranging from one character up to about fifteen characters (disregarding freakishly long words), but the average length of words in written text is fairly stable. In simple text for children it averages 4.2 letters, and even in serious academic writing it is normally no more than 4.9 letters. Add one space per word and an element for punctuation, and you arrive at roughly six characters per word. Indeed tests of typing speed define a word as six keyboard actions.

Sentences are rather more variable than words. Simple narrative, as in children's fiction, is around 10 to 15 words per sentence, while the sentences of academic writing are usually in the range of 20 to 50, with a good deal of individual stylistic variation. Fiction and newspaper text usually have sentences at the shorter end of this range, especially where there is a good deal of direct speech in inverted commas. The average length of sentences (counting headlines as sentences) used in The Guardian newspaper in 1991, for instance, was 21 words, but with a standard deviation of 19 words (showing that some very long sentences occurred).

Pages vary in size and in font size, but typically contain 25 to 35 lines, each with between 9 and 14 words. A page from a book other than a children's book is typically around 350 words, and a left/right page spread, what the eye has available to scan, is therefore around 700 words.

Books also vary enormously. A short work of fiction or humour might be about 100 pages, say 35,000 words. Alice in Wonderland, for instance, contains just under 27,000 words. A typical novel would be closer to 60,000, with a blockbuster measuring 120,000 or more. Mark Twain's Tom Sawyer, for instance, has 72,000 words, while Dickens's A Tale of Two Cities has 137,000 words. The editorial content (i.e. excluding advertisements) of one day's issue of a modern broadsheet newspaper such as The Daily Telegraph would be roughly 30,000 words.

One floppy diskette, therefore, could hold the textual content of two blockbuster novels, or about ten books of the length of Alice in Wonderland. Notice this is only the text. As soon as one includes other kinds of information, such as illustrations or exact page layout, the size of the file increases. Taking a plain text of 208 Kb and saving it as an identical Microsoft Word document immediately increases its size to 240Kb. After editing, the file size increases rapidly (since Microsoft Word stores information about all the changes you have made in case you want to unmake them).

How much written text is there around us? An individual university lecturer's office, with books and files on 30 feet of shelving, and a filing cabinet with handouts, might contain about 90 million words, or 550 Mb. A small branch of a county library would probably hold about eight gigabytes. A university library with half a million books corresponds to roughly 200 Gb. If we were to write down everything we said or heard in conversation or heard on radio or television in the course of one day it would amount to between 20,000 and 60,000 words, corresponding to between 120 Kb and 350 Kb of storage. A typical lifetime's listening and speaking is somewhere between 2 and 5 gigabytes of language. A lifetime's reading for a literate person who has followed some courses of study, reads a daily paper (but not every word) and some fiction at bedtime, is probably in the same range.

One statistic to be wary of, by the way, is estimates of vocabulary size, the number of different words a writer uses. When somebody tells you how many words Shakespeare or Jane Austen has used in all their collected works, it is worth asking how they define a word. Is sing-sings-singing-sang-sung one word or five? Is round one word or five (a round hole, a round of golf, to round the Horn, a scarf round my neck, come round after dinner). A count in which the different forms, such as sing-sings-singing-sang-sung, have been counted as only one word is said to be lemmatised (lemma being the technical term for the head-word of a dictionary entry). A count in which the distinct senses of a word like round have been reckoned separately is said to be semanticised. Always check how the count has been done if there is any reason to suppose that they are not comparing like with like.

Rates; How fast do we process words?

Speaking: the standard for serious monologue, such as a Radio 3 talk, is 120 words per minute. Gabbled conversation is about 250 wpm. Natural speech will contain a mixture of rates as there may be many inserted pauses, and speakers may speed up and slow down as they try to take or yield conversational turns.

Listening: obviously there can be no independent rate for listening; one listens at the same speed as the speaker speaks. However, the speed of speech is likely to have an effect on comprehension, and there are rates which affect even the native speaker's concentration and ability to follow. I do not know of any research that has tested the speed variable independently while measuring EFL comprehension skills.

Reading: rates vary enormously, and since reading is not a "real-time" activity (since one does not have to read at the same time as the writing takes place or read every word of a text or even read the words in the same order), it is hard to measure reading accurately. There are devices (called tachistoscopes) which try to do this by running a window over a text at a variable speed. The general opinion is that reading speeds which are much below an average speaking speed (say 180 wpm) are inefficient and suggest that the reader is tied to some kind of mouthing procedure, reading the words silently. Good readers will use a range of speeds, say180 wpm for careful study, 300 wpm for entertainment (reading a novel or newspaper feature) to 600 wpm or more for skimming a newspaper. Speeds much faster than these suggest that one is only scanning, i.e. missing out most of the text and only reading enough to make up one's mind if one has found something one is looking for.

Writing: one needs to distinguish between copying (physically writing without generating the message) and writing as communication. A working minimum for a trained copy typist would be around 50 words per minute with the more accomplished reaching speeds of over 100 wpm. In the days before computer typesetting, highly trained (and highly paid) linotype operators working for national newspapers used to reach speeds of over 160 wpm with almost complete accuracy, and were beautiful to look at, showing the kind of graceful movement and control that one associates with a concert pianist. Most people do very little copying; when they write they create the message at the same time. The ideal they strive to reach is to write as they think, and the speed of thinking is presumably close to that of the inner voice one hears in one's head, probably around 150 wpm. In practice they achieve about 30 wpm or less, which leads to a kind of permanent frustration when one “can't get the ideas down fast enough”. This frustration is also very strong when one is taking notes from a lecture, and leads to the development of various short-cut procedures in order to record meanings in fewer words than the speaker used.

Sizes in relation to text activities

When we select texts for certain kinds of learning activity, we need to be careful not to put texts into a Procrustean bed, making them unnatural by cutting them or extending them. The constraints applied by teaching time or the need to work with a group of learners together make us favour texts of around 200 words for low-level classes or 500 words for more advanced learners. These are the texts that we use to put in context the new language of the day and to form the basis for any exercise work we plan to do. We may want learners to spend some class time reading them silently, but we will not want this to take much more than five minutes since otherwise there will be too wide a discrepancy between the finishing times of the best and worst students. We want these texts to be self-contained and interesting. We seldom stop to ask, "What kind of 200 word text in real life is self-contained and interesting?"

The same constraints apply to the texts one puts into computer programs. Taking the ones we have covered on this course, it is worth asking what kinds of text fit each template. The ECLIPSE and PINPOINT screens accept around 140 words. What kinds of text are that length? Possible answers are anecdotes, short encyclopaedia entries, recipes, and weather forecasts. The DOUBLE-UP screen is limited to around 40 words. This in turn means that the only authentic complete texts that can be put in are proverbs, short jokes, and the kind of instructions or warnings that might appear on notices or packaging. SEQUITUR, on the other hand, is not limited by a specific screen size and typically needs from 150 to 800 word texts. Again it is worth asking what kind of text can best be used.

John Higgins, Shaftesbury, March 2008