Statistics

This page introduces some basic concepts for the quantitative analysis of corpora, and also offers an introductory tutorial into R, a programming language for statistical analysis and graphics.

1 Basics and examples

Written by Tabea Harris

1.1 Types/tokens

In corpus linguistics we differentiate types from tokens. The term type refers to the number of word forms that are unique, whereas token refers to the total number of words (in a corpus or a sample). For instance, in COCA the word soliloquy occurs 283 times in total, i.e. for one type there are 283 tokens. It is a matter of particular research interests whether we would consider the word form sings as a type of its own or only the lemma sing. Consider the example sentence below. How many types and tokens can you identify?

The red lantern and the white lantern illuminated the room and gave it a reddish tinge.

When working with historical corpora, be aware of spelling variations. The English language was not fully standardized until the Early Modern English period. So if you encounter the noun colour in Middle English, you may notice a number of different forms such as coleour, cullour, or colowr, among many others. These would not count as different types, but are variations of one type.

One basic method of corpus statistics is the calculation of the type/token ratio, i.e. the relation between the number of types and the number of tokens. The type/token ratio can give some indication of how varied the vocabulary of a corpus is. It is calculated by dividing the number of types by the number of tokens in a corpus. When multiplied by 100, the type/token ratio can be expressed as a percentage. Let us take example 1 above to illustrate. We have 12 different types and 16 tokens in total. By using the calculation above we obtain a type/token ratio of 0.75 or 75%. The higher the ratio, the more nuanced the vocabulary is because it signals a high amount of types in relation to the overall tokens. However, the basic method can fail when the corpus is too large as words with a very high frequency (for instance, the article the) will likely skew the results. To circumvent this problem, a standardized type/token ratio is used by which the corpus is divided into equally-sized parts each for which a type/token ratio is calculated. The standardized type/token ratio then is the average of these individual ratios.

In this context, it is useful to have some information about words (or word forms) that only occur once in a corpus. These are called hapax legomena (hapax legomenon, sg.), which is of Greek origin and means ‘said only once’. Hapaxes can give some indication about new words (i.e. neologisms) but should not be confused with them. In doing so, they can be used to inform about the productivity of a word form. A linguistic item is productive when many new words can be formed with it. For instance, the derivational suffix -ness is considered quite productive because many new words are formed with it, whereas the suffix -th cannot be considered productive anymore. Aside from the existing derivations width, depth, breadth, and a few others, no new words are formed with -th. Productivity in affixes can be measured by taking the number of hapaxes with a particular affix and dividing them by the number of all tokens with that affix. For the sake of illustration let us consider an example from Plag, Dalton-Puffer, & Baayen (1999). They distinguished between two variants of the suffix -ful:

the so-called ‘measure -ful’ (e.g. handful) and
a ‘property -ful’ suffix (e.g. careful).

The overall tokens for (2) are 2,615 with the number of hapaxes amounting to 60. For (3), the number of tokens is 77,316 and the number of hapaxes is 22. If we use the calculation from above we obtain a productivity P of 0.023 for (2), and 0.00028 for (3). Which of the two is more productive given these numbers? Again, corpus size is crucial in determining ‘real’ hapax legomena. If a corpus is too small, the results may be skewed in that well-known and otherwise frequently used words only occur once.

1.2 Collocations

Introduced by Firth (1957), a collocation describes “actual words in habitual company” (1957, p. 14). In other words, collocations are words which characteristically occur next to particular other words with a high frequency. For instance, the adjective blue and the noun sky occur in sequence very frequently - they are collocates. A similar co-occurrence of the adjective turquoise and sky would be much less common and accepted. That adjective may be a collocate of another word, however (the noun water for example). Of course, collocations do not have to be directly adjacent to each other, but may occur in close vicinity to each other. In corpora, you are able to set the span of words within which a collocate of a given word may occur. Collocations can become more fixed by repeated use (i.e. they become lexicalised) and usually have a strong semantic connection. A special type of collocation is the so-called colligation which is situated at the grammatical level. A colligation involves the relationships of particular word classes to each other, for instance, a noun may tend to colligate with an adjective, rather than with another part of speech.

1.3 Co-occurrences

Co-occurrences are very similar to collocations in principle. The term describes linguistic units which co-occur with high frequencies that exceed mere chance within a sentence or an entire text. For example, you will probably find a high number of words like coffee and espresso bean co-occurring together in a text about coffee shops, but not so much in an article about language acquisition. It is assumed that these terms are interdependent if they co-occur in a text with an above-chance frequency. In corpora, very often you will only find the option of searching for collocations.

p value	Level of significance	Abbreviation
p ≥ 0.05	Not significant	n.s.
0.05 > p ≥ 0.01	Significant	*
0.01 > p ≥ 0.001	Highly significant	**
p < 0.001	Extremely significant	***

	Categorical variable	Numerical variable
Categorical variable	χ² test, Fisher test	-
Numerical variable	t-test, Wilcoxon test	Correlation test

Token n°	word_order	speaker_age	rank
1	SVO	28	62.0
2	SVO	28	62.0
3	SVO	28	62.0
4	SVO	23	70.0
5	SVO	23	70.0
6	SVO	61	8.5
7	VSO	61	8.5
8	VSO	55	15.0
10	VSO	34	39.5
11	VSO	34	39.5
15	VSO	42	31.5
17	VSO	34	39.5

Effect size	Cohen’s d	Reference
Very small	0.1	Sawilowsky (2009)
Small	0.2	Cohen (1988)
Medium	0.5	Cohen (1988)
Large	0.8	Cohen (1988)
Very large	1.2	Sawilowsky (2009)
Huge	2.0	Sawilowsky (2009)

Value	Estimate	StE	t	p
(Intercept)	218.643	6.793	32.186	< 2e-16 ***
FrequencyClass: low	80.251	9.607	8.353	2.96e-16 ***
WordClass: content	30.619	9.619	3.183	0.00151 **
FrequencyClass: low * WordClass: content	-32.238	13.595	-2.371	0.01797 *

Statistics

Table of contents

1 Basics and examples

1.1 Types/tokens

1.2 Collocations

1.3 Co-occurrences

2 Statistics with R

2.1 Getting started with R: basic concepts and functions

2.1.1 Installation

2.1.2 The console

2.1.3 The script editor

2.1.4 Variables

2.1.5 Functions

2.1.6 Vectors

2.1.7 Matrices

2.1.8 Importing data

2.1.9 Inspecting data

2.2 Statistical analysis with R

2.2.1 Variable types

2.2.2 Statistical tests

2.2.2.1 Categorical × categorical

2.2.2.2 Categorical × numerical

2.2.2.3 Effect size

2.2.2.4 Multivariate analysis

2.3 Bonus material

2.3.1 Quicker (and prettier) graphs with ggplot2

2.3.2 Literate programming with R Markdown

2.4 Further reading

References