MHist

Multilinguality in historical documents – challenges and solutions for digital humanities (MHist)

Full-day Workshop organised in conjunction with the Digital Humanities 2014 Conference

7 July 2014 Lausanne, Switzerland

Workshop endorsed by the ACL Special Interest Group on Language Technologies for the Socio-Economic Sciences and Humanities, SIGHUM

Recently, the collaboration between the Language Technology community and the specialists in various areas of the Humanities has become more efficient and fruitful due to the common aim of exploring and preserving cultural heritage data. It is worth mentioning the efforts made during the digitisation campaigns in the last years and within a series of initiatives in the Digital Humanities, especially in making old manuscripts and prints available in the form of Digital Libraries.
The availability of old texts on-line produced a revolutionary shift in the way how such objects are analysed. They are no longer restricted to a small number of specialists, knowing the language of the document but to broader groups with various requirements:

non-expert users who would like to know what the document is about, understand the main topics, localise places, persons. These users have no or very little knowledge of old languages, and usually are less familiarised with toponyms (especially when these belong to geographical spaces unknown to the user);
researchers of neighbour fields, who often have only minimal knowledge of the language but considerable knowledge of the historical context and might be familiarised with historical toponyms and proper names;
students and researchers specialising in historical data, who have the required language skills but still can profit from additional information accompanying the texts.

These considerations imply that the storage and visualisation of old texts should be accompanied by a collection of tools empowering the text with suitable information and making it understandable for different user groups. Such tools usually involve automatic language processing methods. In contrast to processing of modern texts, for which language technology made a huge progress in the last years, automatic processing of old texts is still problematic mainly because:

Historical language data is sparse. First, compared to the wealth of documents written in modern languages, there are only few documents available for historical languages. Second, transcribing old manuscripts often requires expert knowledge. Third, due to the absence of a standard language, historical language variants differ in spelling, morphology, syntax, and lexical semantics from each other.
Texts are often multilingual, consisting of mixtures of different languages, such as single words or phrases or entire sentences written in Latin that are intermixed with passages written in the actual language of the text. In case of texts from areas with rich cultural mixtures (e.g. Balkans), one can find in addition paragraphs in “exotic” local languages.

The focus of this workshop is on the second aspect. We think that the challenges posed by multilinguality should be tackled by adapting existing multilingual language resources and tools, and, where necessary, by providing training data in the form of corpora or lexicons for a certain period of time in history.

We are looking for original unpublished work in one of the following topics but not limited to:

character-level MT for normalisation
historical and modern data as comparable corpora
historical texts in different languages as parallel or comparable corpora
MT for translation between language versions
OCR for multilingual documents
word- and/or paragraph-level language identification
crosslingual retrieval in historical documents
ontologies as language-independent interfaces between collections of historical texts
particularities of multilingual historical texts and challenges for IT
information extraction and retrieval for multilinual historical documents

Authors interested in submitting an abstract are required to send an expression of interest via email to Cristina Vertan, containing the title, the authors and 10 lines abstract no later than 15th May.

Submissions are due 10th of June in form of an abstract of about 1500 words, at the same address.