Written by Michael Percillier

This page introduces and describes some tools and methods for performing searches in corpora.

1 Regular expressions

Regular expressions (abbreviated as regex, sometimes referred to as grep) are sequences of special characters that allow the formulation of comprehensive search queries in a text or a collection of texts. Regular expressions are a standard that can be used in many text editors (see Figure 1), search tools (see Figure 2), and programming languages. As such, regular expressions are not a search tool per se, but rather a query syntax employed by numerous search tools.

Wild card	Coverage	Example query	Example matches
`.`	Any character except line breaks	`n.t`	`nut`, `n0t`, `n t`, `n't`
`\w`	Word character (letters, digits, underscore)	`n\wt`	`nut`, `n0t`, `n_t`
`\W`	NOT a word character	`n\Wt`	`n t`, `n't`
`\d`	Digit (0-9)	`n\dt`	`n0t`, `n9t`
`\D`	NOT a digit	`n\Dt`	`nut`, `n t`, `n_t`, `n't`
`\s`	Whitespace (space, tab, line break)	`n\st`	`n t`
`\S`	NOT whitespace	`n\St`	`nut`, `n0t`, `n_t`, `n't`
`\b`	Word boundary	`\bone`	`one`, `oneself`
`\B`	NOT a word boundary	`\Bone`	`bone`, `none`

Quantifier	Repetitions	Example query	Example matches
`+`	Once or more	`fe+d`	`fed`, `feed`
`*`	Zero or more	`t.*o`	`to`, `two`, `trio`
`?`	Zero or one	`fiancee?`	`fiance`, `fiancee`
`{n}`	Exactly n times	`b.{2}t`	`boot`, `boat`, `beat`, `bent`
`{m,n}`	Between m and n times	`t.{2,4}t`	`that`, `treat`, `threat`
`{n,}`	n times or more	`t.{3,}t`	`treat`, `threat`, `the very best`

Character class	Range	Example query	Example matches
`[a]`	One of the characters in the brackets	`b[aeiou]t`	`bat`, `bet`, `bit`, `bot`, `but`
`[a-b]`	One of the characters in the range from `a` to `b`	`[A-Za-z]at`	`Bat`, `bat`, `Cat`, `cat`, `Oat`, `oat`, `Rat`, `rat`
`[^a]`	A character that is NOT `a`	`[^c]at`	`Bat` `bat`, `Cat`, `Oat`, `oat`, `Rat`, `rat`
`[^a-b]`	A character that is NOT in the range from `a` to `b`	`[^A-Ca-c]at`	`Oat`, `oat`, `Rat`, `rat`

Pattern	Function	Example query	Example matches
`a\|b\|c`	Either character sequence separated by `\|`	`this\|that\|these\|those`	`this`, `that`, `these`, `those`
`(abc)`	Character sequence enclosed in `( )` is treated as a group	`historic(al)?`	`historic`, `historical`
`(ab)\1`	Group content is accessible via its running number	`(\w)o\1`	`mom`, `pop`, `wow`
`a(b\|c)d`	Alternation applies within the group only	`Ind(i\|ian\|onesi)a`	`India`, `Indiana`, `Indonesia`

Search function (incl. variants)	Meaning	Example queries generating matches in example sentence
`Dominates`, `dominates`, `Doms`, `doms`	dominates to any generation	`(N Dominates child)`, `(NP-SBJ dominates N)`, `(NP-SBJ Doms child)`, `(IP-MAT doms child)`
`iDominates`, `idominates`, `iDoms`, `idoms`	immediately dominates	`(N iDominates child)`, `(D iDoms the)`, `(NP-SBJ iDoms N)`, `(ADVP-TMP idoms ADV)`, `(ADJP idoms CONJ)`
`Precedes`, `precedes`, `Pres`, `pres`	precedes	`(D Precedes N)`, `(ADJ precedes ADJ)`, `(NP-SBJ Pres VBD)`, `(NP-SBJ pres ADJP)`, `(the Precedes child)`
`iPrecedes`, `iprecedes`, `iPres`, `ipres`	immediately precedes	`(D iPrecedes N)`, `(the iprecedes child)`, `(NP-SBJ iPres VBD)`, `(VBD ipres ADJP)`, `(VBD iPrecedes ADJR)`
`HasSister`, `hasSister`, `hassister`	is on the same hierarchic level	`(D HasSister N)`, `(ADJR hasSister CONJ)`, `(VBD HasSister NP-SBJ)`
`Exists`, `exists`	exists in the sentence	`(ADJR Exists)`, `(then exists)`, `(NP-SBJ Exists)`

Search Tools

Table of contents

1 Regular expressions

1.1 Wild cards

1.2 Quantifiers

1.3 Character classes

1.4 Grouping

1.5 Logical operators

1.6 Combining groups and logical operators

1.7 Exercises

1.8 Further reading

2 AntConc

2.1 Concordance

2.2 Concordance plot

2.3 Clusters/N-Grams

2.4 Word List

2.5 Further reading

3 CorpusSearch

3.1 Installation

3.2 Basics of CorpusSearch

3.3 CorpusSearch query syntax

3.3.1 Search Functions

3.3.2 Unique referents

3.3.3 Wild cards

3.3.4 Logical operators

3.4 Further reading

References