Introduction to Corpus Linguistics

Junio 23, 2008 at 5:18 pm (Joseba Abaitua, Language Resources 07/o8)

Lecture notes for the JSI postgraduate school

1. Overview

1.1. What is a corpus?

  • Guidelines of the Expert Advisory Group on Language Engineering Standards, EAGLES:
    Corpus : A collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language.
    Computer corpus : a corpus which is encoded in a standardised and homogeneous way for open-ended retrieval tasks. Its constituent pieces of language are documented as to their origins and provenance.

1.2. Using corpora

Research on actual language: descriptive approach, study of performance, empirical linguistics.

  • Applied linguistics:
    • Lexicography: mono-lingual dictionaries, terminological, bi-lingual
    • Language studies: hypothesis verification, knowledge discovery
      (lexis, morphology, syntax, …)
    • Translation studies: a source translation equivalents and their contexts
      translation memories, machine aided translations
    • Language learning: real-life examples
      “idiomatic teaching”, curriculum development
  • Language technology:
    • testing set for developed methods;
    • training set for inductive learning

1.3. Characteristics of a corpus

  1. Quantity:
    the bigger, the better
  2. Quality :
    the texts are authentic; the mark-up is validated
  3. Simplicity:
    the computer representation is understandable, with the markup easily separated from the text
  4. Documented:
    the corpus contains bibliographic and other meta-data

1.4. Typology of corpora

  • Corpora of written language, spoken and speech corpora (authenticity/price)
    e.g. the agency ELRA catalog
  • Reference corpora (representative) and sub-language corpora (specialised)
    e.g. BNC, ICE, COLT
  • Corpora with integral texts or of text samples (historical and legal reasons)
    e.g. Brown
  • Static and monitor corpora (language change)
  • Monolingual and multilingual parallel and comparable corpora
    e.g. Hansard, Europarl
  • Plain text and annotated corpora

1.5. History

(Computational) linguistic paradigms:

  • 1950 — 1960: empiricism
    weak computers: frequency lists
  • 1970 — 1980: cognitive modeling (generative approaches, artificial intelligence )
    deep analysis / “basic science”: computational linguistics
  • 1990 — …: empiricist revival, also combined approaches
    quantity / usefulness: language technologies
  • 2000 — …: The Web
The history of computer corpora:

  • First milestones: Brown (1 million words) 1964; LOB (also 1M) 1974
  • The spread of reference corpora: Cobuild Bank of English (monitor, 100..200..M) 1980; BNC (100M) 1995; Czech CNC (100M) 1998; Croatian HNK (100M) 1999…
  • Slovene language reference corpora: FIDA (100M), Nova Beseda (100M…) 1998; FIDA+ (600M) 2006.
  • EU corpus oriented projects in the ‘90: NERC, MULTEXT-East,…
  • Language resources brokers: LDC 1992, ELRA 1995

Escribe un comentario