Abstract
We describe the design of a comparable corpus that spans all of the world’s languages and facilitates large-scale cross-linguistic processing. This Universal Corpus consists of text collections aligned at the document and sentence level, multilingual wordlists, and a small set of morphological, lexical, and syntactic annotations. The design encompasses submission, storage, and access. Submission preserves the integrity of the work, allows asynchronous updates, and facilitates scholarly citation. Storage employs a cloud-hosted filestore containing normalized source data together with a database of texts and annotations. Access is permitted to the filestore, the database, and an application programming interface. All aspects of the Universal Corpus
are open, and we invite community participation in its design and implementation, and in supplying and using its data.
are open, and we invite community participation in its design and implementation, and in supplying and using its data.
Original language | English |
---|---|
Title of host publication | Proceedings of the 4th workshop on building and using comparable corpora |
Subtitle of host publication | Comparable corpora and the web |
Pages | 120-127 |
Number of pages | 8 |
Publication status | Published - 2011 |
Externally published | Yes |
Event | 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL-HLT 2011 - Portland, OR, United States Duration: 19 Jun 2011 → 24 Jun 2011 |
Conference
Conference | 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL-HLT 2011 |
---|---|
Country/Territory | United States |
City | Portland, OR |
Period | 19/06/11 → 24/06/11 |