What can we get from 1000 tokens? A case study of multilingual pos tagging for resource-poor languages

Long Duong, Trevor Cohn, Karin Verspoor, Steven Bird, Paul Cook

Research output: Chapter in Book/Report/Conference proceedingConference Paper published in Proceedings

Abstract

In this paper we address the problem of multilingual part-of-speech tagging for resource-poor languages. We use parallel data to transfer part-of-speech information from resource-rich to resourcepoor languages. Additionally, we use a small amount of annotated data to learn to "correct" errors from projected approach such as tagset mismatch between languages, achieving state-of-the-art performance (91.3%) across 8 languages. Our approach is based on modest data requirements, and uses minimum divergence classification. For situations where no universal tagset mapping is available, we propose an alternate method, resulting in state-of-the-art 85.6% accuracy on the resource-poor language Malagasy.

Original languageEnglish
Title of host publicationEMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
Place of PublicationDoha, Qatar
PublisherAssociation for Computational Linguistics (ACL)
Pages886-897
Number of pages12
ISBN (Electronic)9781937284961
Publication statusPublished - 1 Jan 2014
Externally publishedYes
Event2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014 - Doha, Qatar
Duration: 25 Oct 201429 Oct 2014

Conference

Conference2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014
CountryQatar
CityDoha
Period25/10/1429/10/14

Cite this

Duong, L., Cohn, T., Verspoor, K., Bird, S., & Cook, P. (2014). What can we get from 1000 tokens? A case study of multilingual pos tagging for resource-poor languages. In EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (pp. 886-897). Association for Computational Linguistics (ACL).