Glossary

This glossary has been created together with the students of the HWS 2018 seminar Corpus Linguistics.

This glossary contains relevant terms for corpus linguistic research at our department.

Absolute frequency

Also called raw frequency. Absolute frequencies are raw numbers of occurrences that are useful when comparing equal amounts of text.

Annotation

The process of adding linguistic information to the forms in a → corpus. Annotations are frequently found on the morphosyntactic level (→ Part-of-speech tagging) or on the syntactic level (→ parsing). Less common are annotations on the semantic or pragmatic level. The three primary methods of annotating a corpus are 1) automatic, 2) semi-automatic, and 3) manual annotation.

Balance

The range of text categories within a → corpus. A corpus is considered balanced if it spans a wide range of text categories which are → representative of the language or language variety under consideration.

Collocation

A collocation is a word which systematically occurs next to a particular other word with an above-chance frequency. An example is strong coffee. A given combination of collocates tends to be highly lexicalized.

Colligation

A special type of → collocation which is situated on the grammatical level. A colligation focusses on the relationship of particular word classes with each other. An adjective tends to typically colligate with a noun as in sour apple.

Co-occurrence

A co-occurrence denotes the joint occurrence of linguistic units in a given sentence or text document. The terms are assumed to be co-dependent if they occur together with a higher than average frequency. An example is the co-occurrence of the terms stir and whisked eggs in the same document, i.e. a recipe.

Corpus

A large collection of written or spoken language data which represents a language or language variety. The data collected in a corpus are digitized, i.e. they are stored on computers and are machine-readable. A corpus consists of the raw data (i.e. the texts themselves) and possibly of further information such as → metadata or → linguistic annotations assigned to the data.

Corpus-based

Approach used in → corpus linguistics. Corpus data are used in order to explore a theory or hypothesis and to test its validity.

Corpus-driven

Approach used in → corpus linguistics. A → corpus is explored with minimal theoretical presuppositions. In this approach, corpus data usually only consist of raw data, i.e. the corpus is left unannotated.

Corpus linguistics

Methodology which studies the structures and units of naturally occurring language. As such it is not considered a further branch of linguistics in the same way as phonology, morphology or syntax is, but rather as a scientific practice for linguistic analysis.

Dynamic corpus

Also called → monitor corpus. Such a → corpus is able to track rapid language change (e.g. → neologisms) as it is constantly updated. They are often rather large corpora. An example of a dynamic corpus is the NOW corpus. A dynamic corpus is the opposite of a → static corpus.

Generalized corpus

Also called → reference corpus. They aim at describing a language or language variety in its entirety, including a broad range of text categories. As such they strive for → representativeness and are typically rather large. A generalized corpus is the opposite of a → specialized corpus.

Hapax legomenon

Greek for ‘said only once’. Hapax legomena are words with the lowest possible frequency, i.e. words that occur only once (in a given corpus). They are employed in studies of productivity and correlate with the number of → neologisms.

Lemmatization

A process which connects all the inflectional variants (e.g. sings, singing) of a word to their respective lemmas or citation form (SING). This approach is also used in dictionaries.

Metacharacter

A metacharacter is used in → regular expressions. Metacharacters are one or more special characters that have a unique meaning in a search string and are not used in their literal interpretation in the search expression (e.g. ? = zero or one arbitrary character).

Metadata

Data that provide information about the raw data in a corpus. They usually include descriptions about the content, the authors/speakers or the time of creation of the raw data.

Neologism

A newly coined word at a given time.

Normalization

A process that adapts spelling variants to their standard form. Normalization is often used for e.g. spoken language data where pronunciations (and thus the form of the transcribed words) will vary.

Parsing

A type of linguistic → annotation. Parsing goes beyond the word level and is a form of annotation of higher-level syntactic relationships.

POS-tagging

A type of linguistic → annotation. POS (short for part-of-speech) tags are used to give information about the word class of the items they are attached to.

Productivity

A term used in word-formation. It refers to the ability of an affix to coin new complex words. It can be measured e.g. by looking at → hapax legomena.

Regular expression

Also called regex or grep. It is a systematic formal notation for searching and analysing text files. It describes a search pattern in an underspecified way and uses → metacharacters.

Relative frequency

Also called normalized frequency. Relative frequencies calculate how often a linguistic item appears per x words of running text. Relative frequencies assume a ‘base of normalization’ which is normally one million words.

Representativeness

A → corpus can be called representative if its findings can be generalized to the larger population (i.e. language or language variety) it is supposed to represent. As such it is connected to how → balanced a corpus is.

Sampling

A method to select equal amounts of text in a → corpus. A corpus is a sample of a much larger population (i.e. language or language variety). Sampling is a way to ensure → representativeness.

Specialized corpus

A type of → corpus that aims at representing a certain variety of language and focusses on domain- or genre-specific language data. An example for a specialized corpus is the TIMES corpus. A specialized corpus is the opposite of a → generalized corpus.

Static corpus

Also called → sample corpus. A static corpus collects linguistic data over a given time frame. After completion the → corpus is not updated any further. An example for a static corpus is the BNC. The opposite of a static corpus is a → dynamic monitor corpus.

Tagset

A tagset is a list of all the word class labels (i.e. → POS tags) that were used in → annotating a → corpus.

Tokens

The overall number of linguistic items that occur in a → corpus.

Types

Number of unique linguistic items in a → corpus.

Type/Token ratio

The relationship of → types and → tokens in a given text. The ratio is an indicator of how varied a vocabulary is. The higher the ratio the more varied the vocabulary.

Treebank

A representation format in → parsing. The term is coined by Geoffrey Leech as this format employs tree structures to represent grammatical analyses.