The human language project: Building a universal corpus of the world's languages

Steven Abney, Steven Bird

Research output: Chapter in Book/Report/Conference proceedingConference Paper published in Proceedingspeer-review

27 Citations (Scopus)

Abstract

We present a grand challenge to build a corpus that will include all of the world's languages, in a consistent structure that permits large-scale cross-linguistic processing, enabling the study of universal linguistics. The focal data types, bilingual texts and lexicons, relate each language to one of a set of reference languages. We propose that the ability to train systems to translate into and out of a given language be the yardstick for determining when we have successfully captured a language. We call on the computational linguistics community to begin work on this Universal Corpus, pursuing the many strands of activity described here, as their contribution to the global effort to document the world's linguistic heritage before more languages fall silent.

Original languageEnglish
Title of host publicationACL 2010 - 48th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference
Pages88-97
Number of pages10
Publication statusPublished - 1 Dec 2010
Externally publishedYes
Event48th Annual Meeting of the Association for Computational Linguistics, ACL 2010 - Uppsala, Sweden
Duration: 11 Jul 201016 Jul 2010

Publication series

NameACL 2010 - 48th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference

Conference

Conference48th Annual Meeting of the Association for Computational Linguistics, ACL 2010
Country/TerritorySweden
CityUppsala
Period11/07/1016/07/10

Fingerprint

Dive into the research topics of 'The human language project: Building a universal corpus of the world's languages'. Together they form a unique fingerprint.

Cite this