CMC-NormSoMe Corpus

In a pilot study we annotated data from different CMC genres: blog comments, chat (professional and social), Twitter, Whatsapp conversations and Wikipedia discussion pages. The data comes from the "EmpiriST 2015 shared task on automatic linguistic annotation of computermediated communication/social media" ( and is available for download here. We achieved a high inter-annotator agreement (Cohen’s κ > .8). For more details, see:

Ronja Laarmann-Quante and Stefanie Dipper (2016). An annotation scheme for the comparison of different genres of social media with a focus on normalization. In: Proceedings of the LREC Workshop on Normalisation and Analysis of Social Media Texts (NormSoMe). Portorož, Slovenia.