Learner corpora as collections of language produced by second language learners have been systematically collected since the 90s, and with computer-based language learning environments the opportunities for collecting data about the process and product of learning are substantially increasing. This offers a growing empirical basis on which theories of second language acquisition and interlanguage systems can be informed (cf. Meurers & Dickinson, 2017).
Yet, as soon as the research questions go beyond the acquisition of vocabulary and constructions with unambiguous surface indicators, corpora must be enhanced with linguistic annotation to support efficient retrieval of the data that is relevant for such research questions. In contrast to the different types of linguistic annotation schemes which have been developed for native language corpora, the discussion on which linguistic analysis and annotation is meaningful and appropriate for learner language is only starting.
When formulating linguistic generalizations, one generally relies on a long tradition of linguistic analysis that has established an inventory of categories and properties to abstract away from the surface strings. In this talk, we will see that traditional linguistic categories are not necessarily an appropriate index into the space of interlanguage realizations and their systematicity, which research into second language acquisition aims to capture.
Complementing the language explicitly given in the corpus, we also consider the need for information about the tasks (i.e., the functional context) that resulted in the texts collected in the corpus in order to annotate and support valid interpretations of the data.