Fast query for large treebanks

Sumukh Ghodke, Steven Bird

Research output: Chapter in Book/Report/Conference proceedingConference Paper published in Proceedings

Abstract

A variety of query systems have been developed for interrogating parsed corpora, or tree-banks. With the arrival of efficient, wide-coverage parsers, it is feasible to create very large databases of trees. However, existing approaches that use in-memory search, or relational or XML database technologies, do not scale up. We describe a method for storage, indexing, and query of treebanks that uses an information retrieval engine. Several experiments with a large treebank demonstrate excellent scaling characteristics for a wide range of query types. This work facilitates the curation of much larger treebanks, and enables them to be used effectively in a variety of scientific and engineering tasks.

Original languageEnglish
Title of host publicationNAACL HLT 2010 - Human Language Technologies
Subtitle of host publicationThe 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference
Pages267-275
Number of pages9
Publication statusPublished - 1 Dec 2010
Event2010 Human Language Technologies Conference ofthe North American Chapter of the Association for Computational Linguistics, NAACL HLT 2010 - Los Angeles, CA, United States
Duration: 2 Jun 20104 Jun 2010

Publication series

NameNAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference

Conference

Conference2010 Human Language Technologies Conference ofthe North American Chapter of the Association for Computational Linguistics, NAACL HLT 2010
CountryUnited States
CityLos Angeles, CA
Period2/06/104/06/10

    Fingerprint

Cite this

Ghodke, S., & Bird, S. (2010). Fast query for large treebanks. In NAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference (pp. 267-275). (NAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference).