Experiments with data-intensive NLP on a computational grid

Baden Hughes, Steven Bird, Kim Haejoong, Ewan Klein

Research output: Chapter in Book/Report/Conference proceedingConference Paper published in Proceedingspeer-review


Large databases of annotated text and speech are widely used for developing and testing language technologies. However, the size of these corpora and associated language models are outpacing the growth of processing power and network bandwidth available to most researchers. The solution, we believe, is to exploit four characteristics of language technology research: many large corpora are already held at most sites where the research is conducted; most data-intensive processing takes place in the development phase and not at run-time; most processing tasks can be construed as adding layers of annotation to immutable corpora; and many classes of language models can be approximated as the sum of smaller models. We report on a series of experiments with data-intensive language processing on a computational grid. Key features of the approach are its use of a scripting language for easy dissemination of control code to processing nodes, the use of a grid broker to manage the execution
of tasks on remote nodes and collate their output, the use a data-decomposition approach by which parametric and parallel processing of individual language processing components occurs on segmented data, and the use of researchers’ underutilized commodity machines.
Original languageEnglish
Title of host publicationProceedings of the International Workshop on Human Language Technology. http://eprints. unimelb. edu. au/archive/00000503
Number of pages8
Publication statusPublished - 2004
Externally publishedYes


Dive into the research topics of 'Experiments with data-intensive NLP on a computational grid'. Together they form a unique fingerprint.

Cite this