The Anselm Project - Overview | Computational Historical Linguistics

Overview

Note: This page focuses on aspects related to the computational linguistics side of the project. More information is available on the main website of the Anselm project (currently only in German).

St. Anselm of Canterbury in an English glass window of the 19th century

“Interrogatio Sancti Anselmi de Passione Domini” (‘Questions by Saint Anselm about the Lord’s Passion’) is a medieval religious treatise which is documented in an exceptionally broad number of written records. In total, there are around 70 German manuscripts and prints written up between the 14th and 16th centuries.

In the texts, Anselm of Canterbury asks questions to the Virgin Mary concerning the Passion of Jesus Christ. She answers him in the form of longer monologues. While the texts have comparable content, logical structure, and even (semi-)parallelity in sentence structure and wording, they do not follow fixed spelling conventions and show dialectal variations in graphematics, phonology, morphology, and syntax. This makes the corpus a highly interesting resource for comparative investigations in different areas such as linguistics, history or theology.

The project is funded by the Deutsche Forschungsgemeinschaft (DFG) (1st phase: 2010-2012, 2nd phase: since 2013). In a parallel project run by Dr. Simone Schultz-Balluff and Prof. Dr. Klaus-Peter Wegera (German Department, Ruhr University Bochum), diplomatic transcriptions of all residual manuscripts and printings in German have been created.

Linguistic annotation

The corpus is annotated on various layers:

On the normalization layer, each wordform is mapped to a corresponding wordform of Modern German according to specific guidelines.
For morphology and part of speech, we used a slightly modified version of the STTS.
For lemma annotation, each wordform is tagged according to Duden or – for extinct wordforms – Lexer.

Furthermore, we aligned corresponding words and phrases between different manuscripts to support cross-textual queries and developed guidelines for manually annotating sentence boundaries, since the manuscripts in the Anselm corpus contain no punctuation marks to signal clause or sentence boundaries. Our partner project annotates selected cue words, referring to prominent concepts (e.g. the Last Supper) and protagonists (e.g. Mary).

Methods and tools

We developed the web-based annotation tool CorA that allows for annotating multiple annotation levels, editing the primary data and modifying token boundaries during the annotation process. The manual normalization was used to develop the Norma tool, which performs automatic normalization using a flexible combination of methods (such as rule-based methods and weighted Levenshtein Distance). Furthermore, we trained RFTagger on our data in different settings. We are currently developing methods for segmenting texts without (consistently-used) punctuation marks.

Searching the corpus / Availability

Currently, the transcriptions are available in form of PDF documents only. In the near future, the data will be made available via ANNIS and for download in TEI and CorA-XML format.

The research reported on this website was supported by: