Large-scale text collection for unwritten languages

Florian Hanke, Steven Bird

Research output: Chapter in Book/Report/Conference proceedingConference Paper published in Proceedingspeer-review


Existing methods for collecting texts from endangered languages are not creating the quantity of data that is needed for corpus studies and natural language processing tasks. This is because the process of transcribing and translating from audio recordings is too onerous. A more effective method, we argue, is to involve local speakers in the field location, using an audio-only translation interface that is
portable and easy to use. We present encouraging early results of an experimental investigation of the efficiency of creating translations using this method, and report
on the quality of the resulting content.
Original languageEnglish
Title of host publicationProceedings of the 6th International Joint Conference on Natural Language Processing
PublisherAsian Federation of Natural Language Processing
Number of pages5
Publication statusPublished - 2013
Externally publishedYes
EventInternational Joint Conference on Natural Language Processing - Nagoya, Japan
Duration: 14 Oct 201318 Oct 2013
Conference number: 6th


ConferenceInternational Joint Conference on Natural Language Processing


Dive into the research topics of 'Large-scale text collection for unwritten languages'. Together they form a unique fingerprint.

Cite this