Compilation of corpora
1.1. Steps in the preparation of a corpus
-
Choosing the component texts:
linguistic and non-linguistic criteria; availability; simplicity; size
-
Copyright
sensitivity of source (financial and privacy considerations); agreement with providers; usage, publication
-
Acquiring digital originals
Web transfer; visit; OCR
-
Up-translation
conversion to standard format; consistency; character set encodings
-
Linguistic annotation
language dependent methods; errors
-
Documentation
TEI header; Open Archives etc.
-
Use / Download
-
(Web-based) concordancers for linguists
-
download needed for HLT use
-
licences for use
1.2. What annotation can be added to the text of the corpus?
Annotation = interpretation
-
Documentation about the corpus
-
Document structure
-
Basic linguistic markup: sentences, words , punctuation, abbreviations
-
Lemmas and morphosyntactic descriptions
-
Syntax
-
Alignment
-
Terms, semantics, anaphora, pragmatics, intonation,…
1.3. Markup Methods
-
hand annotation: documentation, first steps
generic (XML, spreadsheet) editors or specialised editors
-
semi-automatic: morphosyntactic and other linguistic annotation
cyclic approach: machine, hand, validate, correct, machine, …
-
machine, with hand-written rules: tokenisation
regular expression
-
machine, with inductivelly built models from annotated data:
“supervised learning”; HMMs, decision trees, inductive logic programming,…
-
machine, with inductivelly built models from un-annotated data:
“unsupervised leaning”; clustering technigues
-
1.4. Computer coding of corpora
A good encoding must ensure durability, enable interchange between computer platforms and applications
-
The basic standard used is Extended Markup Language, XML
-
There are a number of companion standards and technologies: XML transformations (XSLT), data definition (DTD, XML Schema, ISO Relax NG), addressing and queries (XPath, XQuery), …
-
The vocabulary of annotations for corpora and other language resources are defined by the Text Encoding Initiative, TEI
XML/TEI used much wider than just for corpora:
-
documentation: these slides slides
-
annotation of dictionaries: English-Slovene, Japanese-Slovene (from jaSlo)
-
for annotating text-critical editions
1.5. Examples of TEI encoding in corpora: meta-data
<teiHeader id=”ecmr.H” type=”text” lang=”sl-en” creator=ET
status=”update” date.created=”1999-04-13″ date.updated=”1999-06-22″ >
<fileDesc>
<titleStmt>
<title lang=”sl”>Ekonomsko ogledalo; 13 številk 98/99</title>
<title lang=”en”>Slovenian Economic Mirror; 13 issues, 98/99</title>
<respstmt>
<name>Andrej Skubic, FF</name>
<resp lang=”sl”>Zagotovitev digitalnega originala, poravnava</resp>
<resp lang=”en”>Provision of digital original, alignment</resp>
<name>Tomaž Erjavec, IJS</name>
<resp lang=”sl”>Tokenizacija, pretvorba v TEI</resp>
<resp lang=”en”>Tokenisation, conversion to TEI</resp>
</respStmt>
</titleStmt>
…
1.6. Examples of TEI encoding in corpora: Structure of the text
<quote id=”Osl.1.8.18″ rend=”center;it”>
<lg id=”Osl.1.8.18.1″>
<l id=”Osl.1.8.18.1.1″>Tam pod kostanjevim drevesom</l>
<l id=”Osl.1.8.18.1.2″>izdala si me,</l>
<l id=”Osl.1.8.18.1.3″>izdal sem te,</l>
<l id=”Osl.1.8.18.1.4″>ne da bi trenila z očesom.</l>
</lg>
</quote>
<p id=”Osl.1.8.19″>
<s id=”Osl.1.8.19.1″>Trije možje se niso niti ganili.</s>
<s id=”Osl.1.8.19.2″>Toda ko je <name>Winston</name>
znova pogledal v Rutherfordov propadli obraz, je opazil,
da so njegove oči polne solz.</s>
…
1.7. Examples of TEI encoding in corpora: Morphosyntactic descriptions
<s id=”Osl.1.2.2.1″>
<w lemma=”biti” ana=”Vcps-sma”>Bil</w>
<w lemma=”biti” ana=”Vcip3s–n”>je</w>
<w lemma=”jasen” ana=”Afpmsnn”>jasen</w><c>,</c>
<w lemma=”mrzel” ana=”Afpmsnn”>mrzel</w>
<w lemma=”aprilski” ana=”Aopmsn”>aprilski</w>
<w lemma=”dan” ana=”Ncmsn”>dan</w>
<w lemma=”in” ana=”Ccs”>in</w>
<w lemma=”ura” ana=”Ncfpn”>ure</w>
<w lemma=”biti” ana=”Vcip3p–n”>so</w>
<w lemma=”biti” ana=”Vmps-pfa”>bile</w>
<w lemma=”trinajst” ana=”Mcnpnl”>trinajst</w><c>.</c>
</s><fs id=”Vcps-sma” select=”sl” feats=”V0. V1.c V2.p V3.s V5.s V6.m V7.a”/>
<fs id=”Vcps-sman—-n” select=”cs” feats=”V0. V1.c V2.p V3.s V5.s V6.m V7.a V8.n V13.n”/>
<fs id=”Vcps-smay—-n” select=”cs” feats=”V0. V1.c V2.p V3.s V5.s V6.m V7.a V8.y V13.n”/>
<fs id=”Vcps-sna” select=”sl” feats=”V0. V1.c V2.p V3.s V5.s V6.n V7.a”/>
<fs id=”Vcps-snan—-n” select=”cs” feats=”V0. V1.c V2.p V3.s V5.s V6.n V7.a V8.n V13.n”/><fLib type=”Verb”>
<f id=”V0.” select=”en ro sl cs bg et hu hr sr sl-rozaj” name=”PoS”><sym value=”Verb”/></f>
<f id=”V1.m” select=”en ro sl cs bg et hu hr sr sl-rozaj” name=”Type”><sym value=”main”/></f>
<f id=”V1.a” select=”en ro sl cs bg et hu hr sr sl-rozaj” name=”Type”><sym value=”auxiliary”/></f>
<f id=”V1.o” select=”en ro sl cs et hr sr sl-rozaj” name=”Type”><sym value=”modal”/></f>
<f id=”V1.c” select=”ro sl cs hr sr sl-rozaj” name=”Type”><sym value=”copula”/></f>
<f id=”V1.b” select=”en” name=”Type”><sym value=”base”/></f>
1.8. Examples of TEI encoding in corpora: Alignment
<linkGrp id=”Oslen.1″ type=”body” targtype=”s” domains=”Oen Osl”>
<link xtargets=”Osl.1.2.2.1 ; Oen.1.1.1.1″>
<link xtargets=”Osl.1.2.2.2 ; Oen.1.1.1.2″>
<link xtargets=”Osl.1.2.3.1 ; Oen.1.1.2.1″>
<link xtargets=”Osl.1.2.3.2 ; Oen.1.1.2.2″>
… <link xtargets=”Osl.1.2.6.5 ; Oen.1.1.5.5″>
<link xtargets=”Osl.1.2.6.6 ; Oen.1.1.5.6 Oen.1.1.5.7″>
<link xtargets=”Osl.1.2.6.7 ; Oen.1.1.5.8″>
…
2. Examples of use
2.1. Lexicology
-
Concordances and collocations
“You shall know a word by the company it keeps.” (Firth, 1957)
-
Induction of multilingual lexica:
-
D.Tufis, Ana-Maria Barbu: Revealing translators knowledge: statistical methods in constructing practical translation lexicons for language and speech processing, in International Journal on Speech Technology, Vol.5, No. 3, 2002 Kluwer Pbls.
-
Nancy Ide, Tomaž Erjavec and Dan Tufiş: Sense Discrimination with Parallel Corpora, in Proceedings of the SIGLEX Workshop on Word Sense Disambiguation: Recent Successes and Future Directions. ACL2002, July Philadelphia 2002, pp. 56-60.
Automatically built 7-language dictionary from ‘1984′ corpus of EU project MULTEXT-East:
2.2. Automatic translation
-
VIČIČ, Jernej, ERJAVEC, Tomaž. Statistično strojno prevajanje na osnovi vzporednih korpusov. ERK 2002, 23.-25. 2002.
The Menola translator
Slovene sentence: evropi vlada veliki brat
ELAN model: europe government big brother
Bible model: evropi brother chief upright .
Czech translation: evropi vláda velké bratr .
3. The future of corpus and data-driven linguistics
3.1. The future of corpus and data-driven linguistics
Size:
-
Larger quantities of readily accessible data (Web as corpus)
-
Larger storage and processing power (Moore law)
Complexity:
-
Deeper analysis:
syntax, deixis, semantic roles, dialogue acts, …
-
Multimodal corpora:
speech, film, transcriptions,…
-
Annotation levels and linking:
co-existence and linking of varied types of annotations; ambiguity
-
Development of tools and platforms:
precision, robustness, unsupervised learning, meta-learning
3.2. Development of corpus linguistics for smaller languages
-
varied, high-quality and accessible corpora
-
technology of morphosyntactic annotation / lemmatisation
-
syntactically annotated corpora (treebanks)
-
application of developed methods
-
development of curricula…
BIBLIOGRAPHY: