Full Program »

Developing an NLP pipeline for automatic document enrichment and exercise generation

Thanks to the internet, foreign language (FL) learners nowadays have ubiquitous access to authentic text materials. However, direct access to extra information on unknown words often remains a clumsy undertaking. In order to facilitate access to such information, we have developed, within the iRead+ project, an enrichment pipeline which enables to embed enrichment markup in reading texts. These enriched documents can be used by developers of CALL applications. The aim of the project was to enable automatic enrichment of texts with word-specific and contextual information in order to create an enhanced reading experience on tablet PC and to support automatic generation of grammatical exercises. In this paper, we present the architecture of the enrichment pipeline developed in the iRead+ project, and describe the proof of concept that has been developed related to FL learners, covering two applications of enriched documents: an enriched reading environment and an exercise generator.

Within the iRead+ project, we developed an NLP pipeline which automatically creates an enriched version of a reading text. The result is an enriched XML document, containing both linguistic and semantic annotations based on the recognition and disambiguation of named entity references (NER). Application developers can use the enriched document to create, for example, language learning exercises or give extra information in documents presented to the reader via tablet computers or other devices. One of the challenges in the project consisted in improving the access to the databases used (e.g. DBpedia, Wiktionary, Cornetto), since each database has its own ontology structure.

Three proofs of concept have been developed. One demonstration tool concerns a language learner application, whereby an enriched reading environment and an exercise generator has been developed. The enriched reading environment is an integrated resource which alleviates the burden of external resource consultations for the FL learner. The main advantage of this "enriched reading experience" resides in the fact that the annotation and the enrichments have been added automatically. The data enrichments make the text more understandable and therefore augment chances for language acquisition. The exercise generator automatically creates morphological and morphosyntactic exercises on the basis of linguistic annotation found in the enriched documents. This includes exercises on verb conjugation, the use of prepositions, gender and number of nouns and adjectives, etc.

A new type of personal dictionary has been proposed that can be used to select texts that match the language skills of the language learner. This data structure, which contains vocabulary that describes the language level of the user, is updated either proactively by the user himself every time that he finds an interesting or difficult word or by the application when the learner makes a mistake in an exercise. This data structure evolves along with the skills of the learner, growing with new words and losing old ones once the learner is able to correctly use them in exercises. The algorithm that classifies and selects new texts for the user ranks the pool of texts based on the number and relevance of co-occurrences with the personal dictionary.


Hans Paulussen    
KU Leuven

Hans Paulussen is a senior researcher at the University of Leuven (KU Leuven KULAK). He is involved in computational linguistic projects on CALL, corpus compilation and tagging.


Powered by OpenConf®
Copyright ©2002-2013 Zakon Group LLC