CorporaWritten by and
This page describes four corpus collections useful for English studies (both modern and historical) and a large German corpus, which is often used for comparative reasons.
Table of contents
1 Basic Information
For a brief introduction to the field of digital humanities in general and the use of corpora, please see the following document:
We commonly distinguish generalized from specialized corpora and dynamic from static corpora.
A generalized corpus will try to balance the types of genres and domains it represents. It will contain texts from various genres (e.g. fiction, newspapers, academic texts, etc.) that at best have a word count as equal as possible. A specialized corpus, on the other hand, will focus on particular domains or genres and will represent these predominantly. An example for a generalized corpus is the British National Corpus (“The British National Corpus, version 3 (BNC XML Edition),” 2007) to be described below, which includes both spoken and written data from different genres. The TIMES corpus could be considered a specialized corpus as it only contains the genre ‘News’. However, it will be balanced concerning different subtypes of articles (e.g. sports, politics, or feuilleton). Thus a specialized corpus is always specialized relative to a generalized corpus (cf. McEnery, Xiao, & Tono, 2006).
The second pair of corpus types determines whether new data will be included in the course of time or if it will stay stable. Whenever we speak of a dynamic corpus we refer to the first type, the latter is a characteristic of static corpora. An example for the former (i.e. dynamic corpus) is the Corpus of Contemporary American English (Davies, 2008), which has been fed with new data every two years.1 The British National Corpus (“The British National Corpus, version 3 (BNC XML Edition),” 2007) is an example of a static corpus: Its latest entries date from 1993.
The sections below describe the corpora that are most relevant to the research undertaken at the chair for diachronic English linguistics. However, many more corpora suited for the study of English exist. A list of such corpora can be found at the Corpus Resource Database (CoRD), curated by the University of Helsinki.
2 Penn Parsed Corpora of Historical English (Penn corpora)
(NB: accessible to students and staff of Anglistik Ⅳ at Mannheim University only)
The Penn Parsed Corpora of Historical English (Penn corpora in short) have been created at the University of Pennsylvania and span the linguistic history of British English, starting with Middle English (ME) documents. The corpora and their respective sizes are given below: - Penn-Helsinki Parsed Corpus of Middle English, 2nd edition (PPCME-2): 1.2 million words (Kroch & Taylor, 2000) - Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME): 1.7 million words (Kroch, Santorini, & Delfs, 2004) - Penn Parsed Corpus of Modern British English (PPCMBE): 1 million words (Kroch, Santorini, & Diertani, 2016)
The corpora represent the different linguistic stages and are each subdivided into several periods for easy reference. Each corpus is part-of-speech annotated (i.e. tagged) as well as syntactically annotated (i.e. parsed), making inquiries into historical texts available both at the word-level and at the syntactic level. As the language has changed, you will find slightly different annotation schemes for each of the corpora, but they all follow the same principle and are searchable via the search tool CorpusSearch (Randall, 2010), which is an open source program available at http://corpussearch.sourceforge.net/.2
In addition, the verbs of PPCME2 have also been lemmatized as part of the research project BASICS (short for Borrowing of Argument Structure in Contact Situations), which has been carried out by researchers of the Universities of Mannheim and Stuttgart (see Percillier, 2016, 2018 and the BASICS Toolkit website for details). To query verbs by lemma in the PPCME2, one can specify the lemma entry as listed in the Middle English Dictionary (MED, McSparran et al., 2001) with an
l attribute, or with the MED entry number with an
m attribute to avoid matching homonyms. Every such insertion is demarcated by
@ characters. For example, searching for all forms of yeven, the Middle English ancestor to give, can be done with
*@m=53944@*. Furthermore, a distinction is made regarding verb etymology: verbs of French origin can be queried with
*@e=french@*, and other verbs with
*@e=nonfrench@*. An additional version of the lemmatized PPCME2 contains lemma and animacy information for lexical noun forms within prepositional phrases headed by to that occur in the vicinity of verbs. This was done to help determine the status of prepositional phrases as prepositional objects or adverbials. Options for the animacy status are "animate", "inanimate", "ambiguous" for homonyms that could not be distinguished, and "panimate" for lexemes that are animate in their primary sense but have a high propensity to refer to animate entities via metonymy or metaphor. The animacy status can be queried as such:
*@a=animate@* for animate nouns,
*@a=animate@*|*@a=panimate@* to also include cases of potential metonymy or metaphor, or
*@a=inanimate@* for inanimate nouns.
Due to a focus on prose texts, the PPCME2 has large data gap for the period from 1250–1350, often referred to as M2. Two additional corpora were compiled using the same annotation scheme as the PPCME2 to fill this gap: the Parsed Corpus of Middle Poetry (PCMEP, Zimmermann, 2018) and the Parsed Linguistic Atlas of Early Middle English (PLAEME, Truswell, Alcorn, Donaldson & Wallenberg, 2018). The lemmatization of lexical verbs and addition of animacy information described for the PPCME2 above has been extended to the PCMEP and PLAEME (see Percillier & Trips, 2020 for details). A combination of the three corpora PPCME2, PCMEP, and PLAEME provides the best coverage for the Middle English period.
A number of other corpora make use of the annotation scheme used for the Penn corpora as well. Among them is the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE in short, Taylor, Warner, Pintzuk, & Beths, 2003), which covers the Old English period and contains 1.5 million words. The sister corpus to the Penn corpora contains written texts of various genres, including law and medical texts, fiction, and many religious texts. Since Old English was an inflected language, expect POS-tags to be slightly differing from the later corpora (e.g. extended tags for case markings on words and phrases). It can also be accessed via the search engine CorpusSearch mentioned above.
Information and documentation about the corpora can be found at http://www-users.york.ac.uk/~lang22/YCOE/YcoeHome.htm for the Old English YCOE corpus and for the various Penn corpora of the later stages of English at https://www.ling.upenn.edu/hist-corpora/.
Notable studies carried out with the above-mentioned corpora include Pintzuk & Taylor (2008) and Taylor (2008) for Old English, Ecay & Tamminga (2013), Trips (2002) and Walkden & Morrison (2017) for Middle English, and Kytö & Romaine (1997) for Early Modern English.
The corpora at English-Corpora.org, previously known as the Brigham Young University corpora (BYU in short), are a collection of several corpora developed by Mark Davies, a Professor of Linguistics at BYU. The collection includes a variety of annotated corpora which range from smaller static corpora (e.g. the Corpus of American Soap Operas) to huge dynamic corpora (e.g. the News on the Web corpus) suitable both for synchronic and diachronic research: - News on the Web (NOW): 6 billion+ words (Davies, 2013b) - Global Web-Based English (GloWbE): 1.9 billion words (Davies, 2013a) - Corpus of Contemporary American English (COCA): 560 million words (Davies, 2008) - Corpus of Historical American English (COHA): 400 million words (Davies, 2010) - TIME Magazine Corpus: 100 million words (Davies, 2007) - etc.
The corpora make use of the same user interface, which is available after registration. Each corpus has its own documentation about which texts have been included and from which time span the texts originate. Further, each corpus gives various query examples and a brief tutorial to ease into work with these corpora. Search options for collocations and corpus comparison make it easy to match results both within corpora and across them.
4 British National Corpus (BNC)
The British National Corpus (“The British National Corpus, version 3 (BNC XML Edition),” 2007) is a 100 million word annotated text corpus developed by Oxford University press. It contains 90 per cent written and 10 per cent spoken material collected between 1980 and the early 1990s. The written texts span a wide range of genres, including fiction, newspapers, (non-)academic prose, among others. Spoken texts contain both conversations and monologues with the possibility to listen in to some of them. The time span of the texts in the BNC ranges from the 1960s to 1993.
The BNC may be explored via different interfaces, for instance via BNCWeb (Lehmann, Schneider, & Hoffmann, 2000, developed at Lancaster University) or through the BYU website. The BNCWeb is a very accessible and intuitive interface, giving information on a wide range of topics, including the tagset used, or how to build your own subcorpus. Access may be gained at http://bncweb.lancs.ac.uk/bncwebSignup/user/login.php.
The main page of the BNCWeb allows you to specify your search query in a number of ways, using either a simple syntax or a more elaborate query syntax (the so-called Corpus Query Processor-syntax, or CQP in short) to find more complex examples. On the result page, a number of possibilities emerge that involve the display of the output (e.g. the Key Word in Context display, or KWIC in short) or the presentation of the results (e.g. information on frequency or sociolinguistic data). The original BNC is both a balanced and representative corpus, but a static one, i.e. no new material will be created for this corpus. As of 2014 however, a new version of the corpus has been made available, the British National Corpus 2014 (Love, Dembry, Hardie, Brezina, & McEnery, 2017). Access to the BNC2014 is available via http://corpora.lancs.ac.uk/bnc2014/signup.php.
5 COSMAS II
COSMAS II (“COSMAS II (Corpus Search, Management and Analysis System),” 1991) is an application developed at the Institute for the German Language (Institut für Deutsche Sprache, IDS) in Mannheim. With the application you are able to access several German text corpora within DeReKo (Deutsches Referenzkorpus, Kupietz, Belica, Lüngen, & Perkuhn, 2018) that contains about 42 billion words of written text as of February 2018. The corpus contains a variety of newspapers and magazines in the German language, among them Mannheimer Morgen, Die Zeit or the Swiss-based St. Galler Tagblatt. Recent additions to DeReKo include the magazines Stern and HÖRZU as well as a Soccer Liveticker corpus.
The individual subcorpora are annotated, either for POS tags only, or additionally for syntactic information.
An extensive documentation (including a brief tour and search query examples) can be found at the project’s website: https://www.ids-mannheim.de/cosmas2/projekt/. You can gain access to the web version via http://www.ids-mannheim.de/cosmas2-web/.