Ruhr-Universität Bochum
Startseite
UeberblickÜberblick
Uni von A-ZA-Z
SucheSuche
KontaktKontakt

 


Home
CV
Research
Publications

Teaching

Misc

Vortragsreihe

Working Papers BLA

Special Issue on Beyond Semantics (CfP)

Sprachwissenschaftliches Institut

 
Stefanie Dipper
Sprachwissenschaftliches Institut » Personen » Stefanie Dipper » Project "Reference Corpus MHG"
 
    Project "Reference Corpus Middle High German (1050--1350)"

Funded by the German Research Foundation (DFG)

Project leaders Klaus-Peter Wegera, Bochum (PI), Thomas Klein, Bonn (Co-PI), Stefanie Dipper, Bochum (Co-PI)
Staff Lars Eschke, Birgit Herbers, Sarah Kwekkeboom, Frauke Thielert, Elke Weber
Duration 2009 - 2012

The project "Reference Corpus MHG" aims at creating a reference corpus of Middle High German which is annotated with morpho-syntactic information. The corpus will be made available to the research community via the web-based corpus search tool ANNIS.

The project is part of a large initiative whose goal is to bring a diachronic German corpus into being that allows for searches through texts from different regions and centuries. Project partner is the DFG-funded project "Reference Corpus Old German" (Project leaders: Karin Donhauser (HU Berlin), Jost Gippert (Frankfurt am Main), Rosemarie Lühr (Jena)).

   The Corpus
The corpus comprises texts from High German (1050--1200: all available texts, 1200-1350: a balanced selection). The texts are processed in a way to allow for historical language research as well as medieval studies. This is achieved by (i) diplomatic transcription of the texts, (ii) (semi-)automatic annotation of multiple information (see below), and (iii) making the annotated texts available via the linguistic database ANNIS, which provides a user-friendly interface for querying and visualizing the data and its annotations.

Annotations:

  • word boundaries, sentence boundaries (manual)
  • lemma (semi-automatic)
  • POS, according to STTS (semi-automatic)
  • morphological features + inflection class (semi-automatic)
  • normalized wordform (semi-automatic)
   Corpus Tools
For the above-mentioned tasks, we are developing (semi-)automatic tools:
  • OTTO, a transcription tool designed for diplomatic transcription of historical language data (Dipper/Schnurrenberger 2011, 2009). More Information about OTTO.
  • STAN, a tool for normalizing wordform variants (Schnurrenberger 2010)

   Publications
On the project:

  • Nina Bartsch, Stefanie Dipper, Birgit Herbers, Sarah Kwekkeboom, Klaus-Peter Wegera (Bochum), Lars Eschke, Thomas Klein, Elke Weber (Bonn) (2011): Annotiertes Referenzkorpus Mittelhochdeutsch (1050-1350) Poster session at the 33rd annual meeting of the German Linguistic Society (DGfS-2011) (Abstract, Poster)
On automatic analysis and tools:
  • Stefanie Dipper (To Appear). Morphological and Part-of-Speech Tagging of Historical Language Data: A Comparison. In Journal for Language Technology and Computational Linguistics, Special Issue. (= Proceedings of the TLT-Workshop on Annotation of Corpora for Research in the Humanities).
  • Stefanie Dipper and Martin Schnurrenberger (2011) OTTO: A Tool for Diplomatic Transcription of Historical Texts In Zygmunt Vetulani (ed.): Human Language Technology: Challenges for Computer Science and Linguistics. 4th Language and Technology Conference, LTC 2009. Revised Selected Papers, pp. 456-467. Springer. URL (Revised version of Dipper and Schnurrenberger (2009)).
  • Stefanie Dipper (2010) POS-Tagging of Historical Language Data: First Experiments In Proceedings of the 10th Conference on Natural Language Processing (KONVENS-10), Saarbrücken. PDF (preprint)
  • Stefanie Dipper, Lara Kresse, Martin Schnurrenberger, and Seong-Eun Cho (2010) OTTO: a transcription and management tool for historical texts In Proceedings of the ACL Linguistic Annotation Workshop (LAW IV), Uppsala. PDF
  • Martin Schnurrenberger (2010) Methods for graphemic normalization of unstandardized written language from Middle High German Corpora Master thesis, Ruhr University Bochum.
  • Stefanie Dipper and Martin Schnurrenberger (2009) OTTO: A Tool for Diplomatic Transcription of Historical Texts In Proceedings of the 4th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 516-520. Poznan, Poland. PDF
  • Stephanie Schmitz (2009) Implementierung und Evaluation eines Part-of-speech-Taggers für mittelhochdeutsche Korpora Bachelor thesis, Ruhr University Bochum.

 
 
Zum Seitenanfang  Seitenanfang
Letzte Änderung: Thursday, 08-Dec-2011 23:36:49 CET | Erstellt von: Stefanie Dipper
zur Navigation zum Inhalt