Corpus Design

This page describes the steps in creating a corpus, from creating a digital text from a printed page, to the addition of annotation.

1 Optical Character Recognition (OCR)

Written by Michael Percillier

When you want to create your own searchable corpus, you first need a machine-readable text. OCR, which stands for Optical Character Recognition, is a way to obtain such a text without having to type it by hand. OCR refers to software that converts a scanned text into a digital one. When you scan a page of text, you store a digital image of that page, and you, as a human, may be able to read the text on a computer screen. However, it is not a text as far as the computer is concerned, but merely a picture of a text, as you cannot perform searches in the text, copy from the text, modify the text, or perform any operations that you can normally do with a text file on a computer. As the ability to perform searches is crucial for a corpus, OCR is a necessary step when creating a corpus from printed sources.

What OCR does is to analyze the picture of the text and convert it to a text file. Its rate of success depends on a number of factors, such as picture quality, which is why a scanning resolution of 300 PPI (pixels per inch) or higher is recommended, the font of the text, the language of the text, etc.

1.1 OCR software

There are many software titles capable of OCR (see Table 1), ranging from commercial to free. We will be using Tesseract for a number of reasons: (1) it is free (free as in free beer, but also free as in free speech), and (2) it provides support for recognizing Middle English texts.

Table 1. Selection of OCR software titles
Title	Reference	Licence
Acrobat	Adobe (2018)	Commercial
FineReader	ABBYY (2018)	Commercial
OneNote	Microsoft (2018)	Proprietary freeware
ReadIris	I.R.I.S. (2018)	Commercial
Tesseract	Smith (2018)	Open source (Apache version 2.0)
VueScan Professional Edition	Hamrick & Hamrick (2018)	Commercial

To install Tesseract, visit the Tesseract wiki and follow installation instructions for your operating system. In addition, you should also install the language data for English (language code eng) and Middle English (Language code enm), which, depending on your system and chosen installation method for Tesseract, may have to be installed via the command line or by downloading the language data files and placing them in a specific folder on your computer. Therefore, you should follow the installation instructions carefully.

1.2 Working example

As a working example, we will use a screenshot of the beginning of Geoffrey Chaucer’s The Canterbury Tales, as shown in Figure 1.

Download the picture file from the link below in order to test the use of Tesseract, then place the picture in a folder such as “Documents”.

Screenshot of opening lines from Chaucer’s The Canterbury Tales

Tesseract functions via a Command Line Interface (abbreviated as CLI). You should therefore open a program called “Terminal” on Linux and macOS. On Windows, this program may be called “Command Prompt” or “PowerShell” depending on your version of Windows. Once we are in the CLI, we change our working directory (i.e. the folder in which we want to run our commands) to the folder containing our screenshot file, in this case “Documents”. We do this with the cd command, followed by the path to the folder. This path will look differently in Linux/macOS (line 1) than it will in Windows (line 2). If your folder is located on another drive, you may have to use the /d switch when using “Command Prompt” (line 3). To analyze our screenshot with Tesseract, we run the tesseract command as shown on line 4, which is followed by the name of our screenshot file, then the desired name of our output file, and finally the -l flag with which we specify the language as Middle English.

cd ~/Documents
cd C:\Documents
cd /d D:\Documents
tesseract canterbury-tales-prologue-72PPI-sample.png ocr-test -l enm

The resulting file ocr-test.txt, whose contents are shown below, recognized most of the text correctly, but still made a few errors, which are highlighted.

Here bygyuneth the Book of the Tales of Caunterbury
Whan that Aplill, with his shoures soote
The droghte ofMarch hath perced to the mote
And bathed every veyne in swich licour,
Of which vertu engendred is the ﬂour;
Whan Zephirus eek with his sweete breeth
Inspired hath in every holt and heeth
The tendre croppes, and the yonge sonne
Hath in the Ram his halfe cours yronne,
And smale foweles maken melodye,
That slepen al the nyght with open ye
(So pliketh hem Nature in hiſ corages);
Thanne longen folk to goon on pilgrimages
And palmeres for to seken straunge strandes
To ferne halwes, knwthe in sondly londes;
And specially from every shires ende
Of Engelond, to Caunterbury they wende,
The hooly blisful martir for to seke
That hem hath holpen, whan that they were seeke.

To improve accuracy, we should rescale the image resolution to at least 300 PPI. This step is necessary whenever we are treating a screenshot as in this case, or a picture scanned by someone else with a resolution below 300 PPI.¹ The exact instructions to rescale the image depend on your operating system and the image viewing/editing software available. Standard image viewers such as Preview on macOS or IrfanView on Windows provide functions to rescale images. A version of the screenshot rescaled to 300 PPI is available from the link below.

Screenshot of opening lines from Chaucer’s The Canterbury Tales, rescaled to 300 PPI

We can now re-run Tesseract on the rescaled image with the command below.

tesseract canterbury-tales-prologue-300PPI-sample.png ocr-test-300PPI -l enm

When looking at the content of the output file ocr-test-300PPI.txt, we can see that Tesseract made fewer errors.

Here bygynneth the Book of the Tales of Caunterbury
Whan that Aprill, with his shoures soote
The droghte of March hath perced to the ſoote
And bathed every veyne in swich licour,
Of which vertu engendred is the ﬂour;
Whan Zephirus eek with his sweete breeth
Inspired hath in every holt and heeth
The tendre croppes, and the yonge sonne
Hath in the Ram his halfe cours yronne,
And smale foweles maken melodye,
That slepen al the nyght with open ye
(So priketh hem Nature in hiſ corages);
Thanne longen folk to goon on pilgrimages
And palmeres for to seken straunge strondes
To ferne halwes, kowthe in sondry londes;
And specially from every shires ende
Of Engelond, to Caunterbury they wende,
The hooly blisful martir for to seke
That hem hath holpen, whan that they were seeke.

1.3 Proofreading

Still, the results are not perfect, which is why it is always a good idea to proofread any OCR output. If this is not feasible due to huge amounts of scanned text, one should at least proofread a sample portion to identify re-occurring errors in order to rectify these automatically.

When proofreading, it is advisable to switch to a font that uses clear contrasts between characters or character sequences that may be difficult to distinguish for OCR software or even for human readers. As shown in Table 2, roman/serif typefaces (e.g. Times), often used for book print, are the worst choice, whereas monospaced fonts (e.g. Consolas), show distinguishable characters for zero and upper-case “O”, or one, lower-case “L”, and upper-case “I”. The fact that each character occupies the same amount of horizontal space (hence the name monospaced) avoids character sequences such as <rn> (<r> followed by <n>) from being mistaken for the single character <m>. The same applies to the character sequence <cl> (<c> followed by <l>) and the single character <d>. An even better choice would be to select a typeface designed specifically for proofreading, such as DP Sans Mono, which, while being far from aesthetically pleasing, provides very clear contrasts between otherwise similar-looking characters.

Table 2. Comparison of typeface families
Character preview	Typeface family
rnm cld 0O 1lI	Roman/serif
rnm cld 0O 1lI	Sans serif
rnm cld 0O 1lI	Monospaced

Return to Table of contents

2 Annotation

Written by Mareike Keller

Once you have a machine-readable text, you can start your corpus search by using search tools like AntConc (see the corresponding tutorial), or regex (see the corresponding tutorial). Or you can go one step further and add linguistic annotation to your corpus. This is a time-consuming process, and before you decide to do this, think about what you need it for. If you only need to find specific words in a text, AntConc will be more than enough. If you want to find all word forms of a lemma, you can get by with regex or long queries. But if you have a large number of lemmas you want to search for, you might be better off if you lemmatize your corpus (see section 2.2). If you are looking for specific parts of speech (e.g. adjectives) or combinations of parts of speech (e.g. adjectives followed by nouns), you need to add information on the part of speech of each word in your text. This is done with so-called part-of-speech tags (POS tags). If you want to find specific syntactic structures (e.g. verbs with a clause as their argument), you also need to mark all the sentences in your text with labels for the individual syntactic constituents, i.e. to parse your text syntactically.

With all the annotation programs on the market today, there is one thing you always have to keep in mind: A computer cannot think. No matter how quickly AI (artificial intelligence) is progressing, the programs we use for corpus creation do not understand a single sentence of your text. If you have a huge database, like all the documents Google can find, you can “teach” a computer program to also look for something like meaning. But for smaller text corpora, this approach does not work. For our purposes a computer is an advanced type of calculator. It can match items to pre-defined categories and calculate probabilities. Both the categories and the probabilities are defined by you (or by the person who writes programs). The result the computer returns to you is as good as the logical thinking you put into preparing the corpus and the annotations.

2.1 Normalizing

When you have corrected the OCR output (see section 1), you could, in principle, start searching your corpus with a search tool (see corresponding tutorials). However, if your text is not written in a standardized form of a language, you might first want to adapt some weird spellings or other oddities to a standard form. This process is called normalization. It is part of pre-processing, which by definition involves quite a bit of manual work – and time.

For example, when you have recordings of speech and you convert them into written text, you might encounter several pronunciations of the same word. Maybe you want to keep those in one layer of your transcription. But for automatic processing this will be an obstacle. So what you do is you have one layer with the special pronunciation features that you want to be able to look up and investigate later, and another layer where you normalize the pronunciation to e.g. Standard German, or Standard American.

Another example of items that could be normalized to get better results from automatic processing would be spelling variants, e.g. Middle English texts. In Middle English times there was no standard way of spelling a word. So you can find 20 different spellings of the same word. If you normalize these to one standard (or canonical) form, automatic texts searches will be much more successful.

2.2 Lemmatizing

When you have a normalized text, you annotate it with the information you want to extract automatically. A helpful annotation step is the so-called lemmatization. Lemmatization means that each word form in the texts is matched to its lemma, i.e. the form of a word that you use to look up this word in a dictionary.

Example: Twotwo funnyfunny littlelittle dogsdog arebe chasingchase aa ballball

For English this is fairly easy because English does not have as many different word forms as some other languages do.

When you are working with a language that has been researched a lot with computational tools, you might be able to find a program on the internet which does most of the lemmatizing for you. This is the case for languages like Modern English or Modern German. If you are working with older stages of a language, like Old English, or Middle English, lemmatizers might not yet exist, and you will have to do most of the work by hand. This takes a long time but has the advantage that you know everything is marked the way you need it.

2.3 POS-Tagging

The abbreviation POS stands for parts of speech. POS-tagging means adding a word class to each word in your text. There are ready-made POS-taggers on the internet. What you need to decide before you use a POS-tagger or add POS tags by hand is: how detailed should the POS information be? Is it enough for your project to label all verb forms as V, or do you need to distinguish between finite and non-finite verbs, full verbs and auxiliaries, active and passive forms? Can all nouns be labelled as N, or do you want to differentiate between proper nouns and names?

Lemmatizing is not a necessary prerequisite for POS-tagging, but if your text is lemmatized, the lemmatizer can use a digital dictionary to match the words in your text against the lemmas (or lexicon entries) listed in the dictionary. This makes automatic POS-tagging a lot more successful.

There are many different tag sets. The smallest one is the so-called universal tag set (Petrov, Das and McDonald 2012), which contains 12 different tags and covers word classes that are most common in the languages of the world. The tagset on the Penn corpora website list 92 POS tags, the Stuttgart-Tübingen-TagSet (STTS) for Standard German contains 54 tags. Choosing a tag set might seem trivial, but it is not a trivial task at all. More tags give you more opportunities for a fine-grained search, but a larger number of tags also makes the computational task way more complex.

Example: He was embarrassed.

Do you want embarrassed to be labelled as an adjective or a verb or a past participle? Or something else altogether? How will the choice of tag influence your search results later on?

Many modern taggers are programmed to solve a problem like this on the basis of probabilities: Looking at the tags right before and after the word in question, how likely is it that this item is an adjective? How likely is it that it is a verb form? So programming a POS-tagger is not only about words. It also involves syntax. A widely used tagger which can be used for several languages is the TreeTagger (Schmid, n.d.). It will tag your text automatically - but to be sure that the tags are correct, you will still have to do some manual post-correction.

If you have a POS-tagged text, you can automatically extract words of a specific word class, or even sequences of words. Imagine you are working on a project about the position of adjectives in Middle English. You want to find all noun phrases consisting of one determiner, one adjective and one noun from your text. You can tell your search program (see corresponding tutorials) to give you all sequences of words that fulfill the following requirement: D+A+N, in this order. It will then return to you a list of noun phrases like these

The silly dog
A cute kitty
These pretty shoes

From the computational perspective there are several approaches to creating a POS-tagger. If you want to learn how POS-taggers are created and how they work, the lectures by Stanford professors Jurafsky & Manning (2012) are a good starting point.

2.4 Parsing

The top layer of morphosyntactic annotation is called parse or parsing layer. Parsing means splitting a sentence into its syntactic constituents. This can be done in several ways, not only from a practical/computational perspective but also from a theoretical point of view. The computer programs we have so far are not very good at recognizing “empty” slots in a hierarchically structured syntactic tree. So if you want a parse based on generative grammar (i.e. syntactic trees), you will have to do a lot of the parsing by hand. However, computers are fairly good at calculating direct dependencies like the ones that are used in dependency grammar. For an introduction to this type of annotation see Petrov, Das, & McDonald (2012); Nivre et al. (2018).

If you are okay with dependencies and do not need hierarchical parsing, you can get a lot of the parsing done semi-automatically. You feed your POS-tagged text into a parser (e.g. the MaltParser by Hall, Nilsson, & Nivre (2018)), and it provides you with suggestions for dependency-parsed sentences.

The parser output will probably still contain a number of incorrect parses. You can correct them manually, for example with the DG Annotator (Attardi & Simi, n.d.). This visualization tool allows you to view the parser output as dependency structures, and in contrast to other search and visualization tools it lets you make corrections right in the viewer. This means that if you find a faulty annotation, you can use the viewer not just to look at it but also to fix it.

Return to Table of contents

References

ABBYY. (2018). FineReader. Moscow. Retrieved from https://www.abbyy.com/en-eu/finereader/

Adobe. (2018). Acrobat DC. San Jose, CA. Retrieved from https://acrobat.adobe.com/de/de/acrobat/how-to/ocr-software-convert-pdf-to-text.html

Attardi, G., & Simi, M. (n.d.). DgAnnotator. Retrieved from http://medialab.di.unipi.it/Project/QA/Parser/DgAnnotator/Doc/

Hall, J., Nilsson, J., & Nivre, J. (2018). MaltParser. Retrieved from http://www.maltparser.org

Hamrick, E., & Hamrick, D. (2018). VueScan Professional Edition. Hamrick Software. Retrieved from https://www.hamrick.com

I.R.I.S. (2018). Readiris. Louvain-La-Neuve. Retrieved from http://www.irislink.com

Jurafsky, D., & Manning, C. (2012). Natural Language Processing: Lecture Slides from the Stanford Coursera course. Retrieved from https://web.stanford.edu/~jurafsky/NLPCourseraSlides.html

Microsoft. (2018). OneNote. Redmond, WA. Retrieved from https://www.onenote.com

Nivre, J., Abrams, M., Agić, Ž., Ahrenberg, L., Antonsen, L., Aplonova, K., … Zhu, H. (2018). Universal Dependencies 2.3. Retrieved from http://hdl.handle.net/11234/1-2895

Petrov, S., Das, D., & McDonald, R. (2012). A Universal Part-of-Speech Tagset. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, … S. Piperidis (Eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). Istanbul, Turkey: European Language Resources Association (ELRA). Retrieved from http://lrec-conf.org/proceedings/lrec2012/pdf/274_Paper.pdf

Schmid, H. (n.d.). TreeTagger. Retrieved from http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/

Smith, R. (2018). Tesseract. Retrieved from https://github.com/tesseract-ocr/tesseract

It is preferable to re-scan the document at 300 PPI or higher. Rescaling the image remains a “plan B” in case this is not possible.↩