Fast query for large treebanks

Sumukh Ghodke, Steven Bird

    Research output: Chapter in Book/Report/Conference proceedingConference Paper published in Proceedingspeer-review

    Abstract

    A variety of query systems have been developed for interrogating parsed corpora, or tree-banks. With the arrival of efficient, wide-coverage parsers, it is feasible to create very large databases of trees. However, existing approaches that use in-memory search, or relational or XML database technologies, do not scale up. We describe a method for storage, indexing, and query of treebanks that uses an information retrieval engine. Several experiments with a large treebank demonstrate excellent scaling characteristics for a wide range of query types. This work facilitates the curation of much larger treebanks, and enables them to be used effectively in a variety of scientific and engineering tasks.

    Original languageEnglish
    Title of host publicationNAACL HLT 2010 - Human Language Technologies
    Subtitle of host publicationThe 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference
    Pages267-275
    Number of pages9
    Publication statusPublished - 1 Dec 2010
    Event2010 Human Language Technologies Conference ofthe North American Chapter of the Association for Computational Linguistics, NAACL HLT 2010 - Los Angeles, CA, United States
    Duration: 2 Jun 20104 Jun 2010

    Publication series

    NameNAACL HLT 2010 - Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference

    Conference

    Conference2010 Human Language Technologies Conference ofthe North American Chapter of the Association for Computational Linguistics, NAACL HLT 2010
    CountryUnited States
    CityLos Angeles, CA
    Period2/06/104/06/10

    Fingerprint

    Dive into the research topics of 'Fast query for large treebanks'. Together they form a unique fingerprint.

    Cite this