Towards a data model for the universal corpus

Steven Abney, Steven Bird

Research output: Chapter in Book/Report/Conference proceedingConference Paper published in Proceedings

Abstract

We describe the design of a comparable corpus that spans all of the world’s languages and facilitates large-scale cross-linguistic processing. This Universal Corpus consists of text collections aligned at the document and sentence level, multilingual wordlists, and a small set of morphological, lexical, and syntactic annotations. The design encompasses submission, storage, and access. Submission preserves the integrity of the work, allows asynchronous updates, and facilitates scholarly citation. Storage employs a cloud-hosted filestore containing normalized source data together with a database of texts and annotations. Access is permitted to the filestore, the database, and an application programming interface. All aspects of the Universal Corpus
are open, and we invite community participation in its design and implementation, and in supplying and using its data.
Original languageEnglish
Title of host publicationProceedings of the 4th workshop on building and using comparable corpora
Subtitle of host publicationComparable corpora and the web
Pages120-127
Number of pages8
Publication statusPublished - 2011
Externally publishedYes
Event49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL-HLT 2011 - Portland, OR, United States
Duration: 19 Jun 201124 Jun 2011

Conference

Conference49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL-HLT 2011
CountryUnited States
CityPortland, OR
Period19/06/1124/06/11

Fingerprint Dive into the research topics of 'Towards a data model for the universal corpus'. Together they form a unique fingerprint.

  • Cite this

    Abney, S., & Bird, S. (2011). Towards a data model for the universal corpus. In Proceedings of the 4th workshop on building and using comparable corpora: Comparable corpora and the web (pp. 120-127)