Collecting bilingual audio in remote Indigenous communities

Steven Bird, Lauren Gawne, Katie Gelbart, Isaac Mcalister

Research output: Chapter in Book/Report/Conference proceedingConference Paper published in Proceedingspeer-review

19 Citations (Scopus)


Most of the world's languages are under-resourced, and most under-resourced languages lack a writing system and literary tradition. As these languages fall out of use, we lose important sources of data that contribute to our understanding of human language. The first, urgent step is to collect and orally translate a large quantity of spoken language. This can be digitally archived and later transcribed, annotated, and subjected to the full range of speech and language processing tasks, at any time in future. We have been investigating a mobile application for recording and translating unwritten languages. We visited Indigenous communities in Brazil and Nepal and taught people to use smartphones for recording spoken language and for orally interpreting it into the national language, and collected bilingual phrase-aligned speech recordings. In spite of several technical and social issues, we found that the technology enabled an effective workflow for speech data collection. Based on this experience, we argue that the use of special-purpose software on smartphones is an effective and scalable method for large-scale collection of bilingual audio, and ultimately bilingual text, for languages spoken in remote Indigenous communities.

Original languageEnglish
Title of host publicationCOLING 2014 - 25th International Conference on Computational Linguistics, Proceedings of COLING 2014
Subtitle of host publicationTechnical Papers
Place of PublicationDublin, Ireland
PublisherAssociation for Computational Linguistics, ACL Anthology
Number of pages10
ISBN (Electronic)9781941643266
Publication statusPublished - 1 Jan 2014
Externally publishedYes
Event25th International Conference on Computational Linguistics, COLING 2014 - Dublin, Ireland
Duration: 23 Aug 201429 Aug 2014


Conference25th International Conference on Computational Linguistics, COLING 2014


Dive into the research topics of 'Collecting bilingual audio in remote Indigenous communities'. Together they form a unique fingerprint.

Cite this