Introduction to Corpus Linguistics

Junio 23, 2008 at 5:18 pm (Joseba Abaitua, Language Resources 07/o8)

Lecture notes for the JSI postgraduate school

1. Overview

1.1. What is a corpus?

  • Guidelines of the Expert Advisory Group on Language Engineering Standards, EAGLES:
    Corpus : A collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language.
    Computer corpus : a corpus which is encoded in a standardised and homogeneous way for open-ended retrieval tasks. Its constituent pieces of language are documented as to their origins and provenance.

1.2. Using corpora

Research on actual language: descriptive approach, study of performance, empirical linguistics.

  • Applied linguistics:
    • Lexicography: mono-lingual dictionaries, terminological, bi-lingual
    • Language studies: hypothesis verification, knowledge discovery
      (lexis, morphology, syntax, …)
    • Translation studies: a source translation equivalents and their contexts
      translation memories, machine aided translations
    • Language learning: real-life examples
      “idiomatic teaching”, curriculum development
  • Language technology:
    • testing set for developed methods;
    • training set for inductive learning

1.3. Characteristics of a corpus

  1. Quantity:
    the bigger, the better
  2. Quality :
    the texts are authentic; the mark-up is validated
  3. Simplicity:
    the computer representation is understandable, with the markup easily separated from the text
  4. Documented:
    the corpus contains bibliographic and other meta-data

1.4. Typology of corpora

  • Corpora of written language, spoken and speech corpora (authenticity/price)
    e.g. the agency ELRA catalog
  • Reference corpora (representative) and sub-language corpora (specialised)
    e.g. BNC, ICE, COLT
  • Corpora with integral texts or of text samples (historical and legal reasons)
    e.g. Brown
  • Static and monitor corpora (language change)
  • Monolingual and multilingual parallel and comparable corpora
    e.g. Hansard, Europarl
  • Plain text and annotated corpora

1.5. History

(Computational) linguistic paradigms:

  • 1950 — 1960: empiricism
    weak computers: frequency lists
  • 1970 — 1980: cognitive modeling (generative approaches, artificial intelligence )
    deep analysis / “basic science”: computational linguistics
  • 1990 — …: empiricist revival, also combined approaches
    quantity / usefulness: language technologies
  • 2000 — …: The Web
The history of computer corpora:

  • First milestones: Brown (1 million words) 1964; LOB (also 1M) 1974
  • The spread of reference corpora: Cobuild Bank of English (monitor, 100..200..M) 1980; BNC (100M) 1995; Czech CNC (100M) 1998; Croatian HNK (100M) 1999…
  • Slovene language reference corpora: FIDA (100M), Nova Beseda (100M…) 1998; FIDA+ (600M) 2006.
  • EU corpus oriented projects in the ‘90: NERC, MULTEXT-East,…
  • Language resources brokers: LDC 1992, ELRA 1995

Permalink Dejar un comentario

Compilation of corpora,Examples of use and The future of corpus and data-driven linguistics

Junio 23, 2008 at 5:08 pm (Joseba Abaitua, Language Resources 07/o8)

Compilation of corpora

1.1. Steps in the preparation of a corpus

  1. Choosing the component texts:
    linguistic and non-linguistic criteria; availability; simplicity; size
  2. Copyright
    sensitivity of source (financial and privacy considerations); agreement with providers; usage, publication
  3. Acquiring digital originals
    Web transfer; visit; OCR
  4. Up-translation
    conversion to standard format; consistency; character set encodings
  5. Linguistic annotation
    language dependent methods; errors
  6. Documentation
    TEI header; Open Archives etc.
  7. Use / Download
    • (Web-based) concordancers for linguists
    • download needed for HLT use
    • licences for use

1.2. What annotation can be added to the text of the corpus?

Annotation = interpretation

  • Documentation about the corpus

  • Document structure

  • Basic linguistic markup: sentences, words , punctuation, abbreviations

  • Lemmas and morphosyntactic descriptions

  • Syntax

  • Alignment

  • Terms, semantics, anaphora, pragmatics, intonation,…

1.3. Markup Methods

  • hand annotation: documentation, first steps

    generic (XML, spreadsheet) editors or specialised editors

  • semi-automatic: morphosyntactic and other linguistic annotation

    cyclic approach: machine, hand, validate, correct, machine, …

  • machine, with hand-written rules: tokenisation

    regular expression

  • machine, with inductivelly built models from annotated data:

    “supervised learning”; HMMs, decision trees, inductive logic programming,…

  • machine, with inductivelly built models from un-annotated data:

    “unsupervised leaning”; clustering technigues

  • overview of the field

1.4. Computer coding of corpora

A good encoding must ensure durability, enable interchange between computer platforms and applications

  • The basic standard used is Extended Markup Language, XML

  • There are a number of companion standards and technologies: XML transformations (XSLT), data definition (DTD, XML Schema, ISO Relax NG), addressing and queries (XPath, XQuery), …

  • The vocabulary of annotations for corpora and other language resources are defined by the Text Encoding Initiative, TEI

XML/TEI used much wider than just for corpora:

  • documentation: these slides slides

  • annotation of dictionaries: English-Slovene, Japanese-Slovene (from jaSlo)

  • for annotating text-critical editions

1.5. Examples of TEI encoding in corpora: meta-data

<teiHeader id=”ecmr.H” type=”text” lang=”sl-en” creator=ET
status=”update” date.created=”1999-04-13″ date.updated=”1999-06-22″ >
<fileDesc>
<titleStmt>
<title lang=”sl”>Ekonomsko ogledalo; 13 številk 98/99</title>
<title lang=”en”>Slovenian Economic Mirror; 13 issues, 98/99</title>
<respstmt>
<name>Andrej Skubic, FF</name>
<resp lang=”sl”>Zagotovitev digitalnega originala, poravnava</resp>
<resp lang=”en”>Provision of digital original, alignment</resp>
<name>Toma&zcaron; Erjavec, IJS</name>
<resp lang=”sl”>Tokenizacija, pretvorba v TEI</resp>
<resp lang=”en”>Tokenisation, conversion to TEI</resp>
</respStmt>
</titleStmt>

1.6. Examples of TEI encoding in corpora: Structure of the text

<quote id=”Osl.1.8.18″ rend=”center;it”>
<lg id=”Osl.1.8.18.1″>
<l id=”Osl.1.8.18.1.1″>Tam pod kostanjevim drevesom</l>
<l id=”Osl.1.8.18.1.2″>izdala si me,</l>
<l id=”Osl.1.8.18.1.3″>izdal sem te,</l>
<l id=”Osl.1.8.18.1.4″>ne da bi trenila z očesom.</l>
</lg>
</quote>
<p id=”Osl.1.8.19″>
<s id=”Osl.1.8.19.1″>Trije možje se niso niti ganili.</s>
<s id=”Osl.1.8.19.2″>Toda ko je <name>Winston</name>
znova pogledal v Rutherfordov propadli obraz, je opazil,
da so njegove oči polne solz.</s>

1.7. Examples of TEI encoding in corpora: Morphosyntactic descriptions

<s id=”Osl.1.2.2.1″>
<w lemma=”biti” ana=”Vcps-sma”>Bil</w>
<w lemma=”biti” ana=”Vcip3s–n”>je</w>
<w lemma=”jasen” ana=”Afpmsnn”>jasen</w><c>,</c>
<w lemma=”mrzel” ana=”Afpmsnn”>mrzel</w>
<w lemma=”aprilski” ana=”Aopmsn”>aprilski</w>
<w lemma=”dan” ana=”Ncmsn”>dan</w>
<w lemma=”in” ana=”Ccs”>in</w>
<w lemma=”ura” ana=”Ncfpn”>ure</w>
<w lemma=”biti” ana=”Vcip3p–n”>so</w>
<w lemma=”biti” ana=”Vmps-pfa”>bile</w>
<w lemma=”trinajst” ana=”Mcnpnl”>trinajst</w><c>.</c>
</s><fs id=”Vcps-sma” select=”sl” feats=”V0. V1.c V2.p V3.s V5.s V6.m V7.a”/>
<fs id=”Vcps-sman—-n” select=”cs” feats=”V0. V1.c V2.p V3.s V5.s V6.m V7.a V8.n V13.n”/>
<fs id=”Vcps-smay—-n” select=”cs” feats=”V0. V1.c V2.p V3.s V5.s V6.m V7.a V8.y V13.n”/>
<fs id=”Vcps-sna” select=”sl” feats=”V0. V1.c V2.p V3.s V5.s V6.n V7.a”/>
<fs id=”Vcps-snan—-n” select=”cs” feats=”V0. V1.c V2.p V3.s V5.s V6.n V7.a V8.n V13.n”/><fLib type=”Verb”>
<f id=”V0.” select=”en ro sl cs bg et hu hr sr sl-rozaj” name=”PoS”><sym value=”Verb”/></f>
<f id=”V1.m” select=”en ro sl cs bg et hu hr sr sl-rozaj” name=”Type”><sym value=”main”/></f>
<f id=”V1.a” select=”en ro sl cs bg et hu hr sr sl-rozaj” name=”Type”><sym value=”auxiliary”/></f>
<f id=”V1.o” select=”en ro sl cs et hr sr sl-rozaj” name=”Type”><sym value=”modal”/></f>
<f id=”V1.c” select=”ro sl cs hr sr sl-rozaj” name=”Type”><sym value=”copula”/></f>
<f id=”V1.b” select=”en” name=”Type”><sym value=”base”/></f>

1.8. Examples of TEI encoding in corpora: Alignment

<linkGrp id=”Oslen.1″ type=”body” targtype=”s” domains=”Oen Osl”>
<link xtargets=”Osl.1.2.2.1 ; Oen.1.1.1.1″>
<link xtargets=”Osl.1.2.2.2 ; Oen.1.1.1.2″>
<link xtargets=”Osl.1.2.3.1 ; Oen.1.1.2.1″>
<link xtargets=”Osl.1.2.3.2 ; Oen.1.1.2.2″>
… <link xtargets=”Osl.1.2.6.5 ; Oen.1.1.5.5″>
<link xtargets=”Osl.1.2.6.6 ; Oen.1.1.5.6 Oen.1.1.5.7″>
<link xtargets=”Osl.1.2.6.7 ; Oen.1.1.5.8″>

2. Examples of use

2.1. Lexicology

  • Concordances and collocations

    “You shall know a word by the company it keeps.” (Firth, 1957)

  • Induction of multilingual lexica:

    • D.Tufis, Ana-Maria Barbu: Revealing translators knowledge: statistical methods in constructing practical translation lexicons for language and speech processing, in International Journal on Speech Technology, Vol.5, No. 3, 2002 Kluwer Pbls.

    • Nancy Ide, Tomaž Erjavec and Dan Tufiş: Sense Discrimination with Parallel Corpora, in Proceedings of the SIGLEX Workshop on Word Sense Disambiguation: Recent Successes and Future Directions. ACL2002, July Philadelphia 2002, pp. 56-60.

    Automatically built 7-language dictionary from ‘1984′ corpus of EU project MULTEXT-East:

    first 100 entries

2.2. Automatic translation

  • VIČIČ, Jernej, ERJAVEC, Tomaž. Statistično strojno prevajanje na osnovi vzporednih korpusov. ERK 2002, 23.-25. 2002.

The Menola translator

Slovene sentence: evropi vlada veliki brat
ELAN model: europe government big brother
Bible model: evropi brother chief upright .
Czech translation: evropi vláda velké bratr .

3. The future of corpus and data-driven linguistics

3.1. The future of corpus and data-driven linguistics

Size:

  • Larger quantities of readily accessible data (Web as corpus)

  • Larger storage and processing power (Moore law)

Complexity:

  • Deeper analysis:

    syntax, deixis, semantic roles, dialogue acts, …

  • Multimodal corpora:

    speech, film, transcriptions,…

  • Annotation levels and linking:

    co-existence and linking of varied types of annotations; ambiguity

  • Development of tools and platforms:

    precision, robustness, unsupervised learning, meta-learning

3.2. Development of corpus linguistics for smaller languages

  • varied, high-quality and accessible corpora
  • technology of morphosyntactic annotation / lemmatisation
  • syntactically annotated corpora (treebanks)
  • application of developed methods
  • development of curricula…

BIBLIOGRAPHY:

Permalink Dejar un comentario

European Language Resources Asociation (ELRA)

Junio 23, 2008 at 6:14 am (Joseba Abaitua, Language Resources 07/o8)

A not-for.profit organisation, the European Language Resources Association (ELRA) association is established under the law of the Grand Duchy of Luxembourg. Its seat is in Luxembourg, headquarters in Paris (France).

Activities

Since its foundation in 1995, the European Language Resources Association (ELRA) has been a conduit for the distribution of speech, written and terminology Language Resources (LRs) for the Human Language Technology (HLT), a key compound of IST. In order to do so, a number of technical and logistic, commercial (prices, fees, royalties), legal (licensing, Intellectual Property Rights, Management), and information dissemination issues had to be addressed. Since its foundation, ELRA’s mission has enlarged slightly, broadening its objectives and responsibilities towards the HLT community. ELRA is now involved in the production, or commissioning of the production, of language resources through a number of initiatives, also actively committed to the evaluation of language engineering tools as well as to the identification of new resources. Finally, every other year, ELRA organizes a major conference LREC on language resources and evaluation; the latest edition has taken place in May 2006 in Genoa, Italy.

Mission

The mission of the Association is to promote language resources and evaluation for the Human Language Technology sector in all their forms and their uses, in a European context. Consequently, the goals are: to coordinate and carry out identification, production, validation, distribution, standardisation of languages resources, as well as support for evaluation of systems, products, tools, etc.

LANGUAGE RESOURCES (LRs)

DEFINITION

The term language resources refers to a set of speech or language data and descriptions in machine readable form, used e.g. for building, improving or evaluating natural language and speech algorithms or systems, or, as core resources for the software localisation and language services industries, for language studies, electronic publishing, international transactions, subject-area specialists and end users.

Examples of language resources are written and spoken corpora, computational lexicons, terminology databases, speech collection and processing, etc. Basic software tools are also important for the acquisition, preparation, collection, management, customisation and use of these language resources and other resources.

APPLICATIONS

BIBLIOGRAPHY:

ELRA

Catalogue of Language Resources

Permalink Dejar un comentario