Introduction to Corpus Linguistics
Lecture notes for the JSI postgraduate school
1. Overview
1.1. What is a corpus?
- Guidelines of the Expert Advisory Group on Language Engineering Standards, EAGLES:
Corpus : A collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language.
Computer corpus : a corpus which is encoded in a standardised and homogeneous way for open-ended retrieval tasks. Its constituent pieces of language are documented as to their origins and provenance.
1.2. Using corpora
Research on actual language: descriptive approach, study of performance, empirical linguistics.
- Applied linguistics:
- Lexicography: mono-lingual dictionaries, terminological, bi-lingual
- Language studies: hypothesis verification, knowledge discovery
(lexis, morphology, syntax, …) - Translation studies: a source translation equivalents and their contexts
translation memories, machine aided translations - Language learning: real-life examples
“idiomatic teaching”, curriculum development
- Language technology:
- testing set for developed methods;
- training set for inductive learning
1.3. Characteristics of a corpus
- Quantity:
the bigger, the better - Quality :
the texts are authentic; the mark-up is validated - Simplicity:
the computer representation is understandable, with the markup easily separated from the text - Documented:
the corpus contains bibliographic and other meta-data
1.4. Typology of corpora
- Corpora of written language, spoken and speech corpora (authenticity/price)
e.g. the agency ELRA catalog - Reference corpora (representative) and sub-language corpora (specialised)
e.g. BNC, ICE, COLT - Corpora with integral texts or of text samples (historical and legal reasons)
e.g. Brown - Static and monitor corpora (language change)
- Monolingual and multilingual parallel and comparable corpora
e.g. Hansard, Europarl - Plain text and annotated corpora
1.5. History
(Computational) linguistic paradigms:
- 1950 — 1960: empiricism
weak computers: frequency lists - 1970 — 1980: cognitive modeling (generative approaches, artificial intelligence )
deep analysis / “basic science”: computational linguistics - 1990 — …: empiricist revival, also combined approaches
quantity / usefulness: language technologies - 2000 — …: The Web
The history of computer corpora:
- First milestones: Brown (1 million words) 1964; LOB (also 1M) 1974
- The spread of reference corpora: Cobuild Bank of English (monitor, 100..200..M) 1980; BNC (100M) 1995; Czech CNC (100M) 1998; Croatian HNK (100M) 1999…
- Slovene language reference corpora: FIDA (100M), Nova Beseda (100M…) 1998; FIDA+ (600M) 2006.
- EU corpus oriented projects in the ‘90: NERC, MULTEXT-East,…
- Language resources brokers: LDC 1992, ELRA 1995