Abstract
Large databases of annotated text and speech are widely used for developing and testing language technologies. However, the size of these corpora and associated language models are outpacing the growth of processing power and network bandwidth available to most researchers. The solution, we believe, is to exploit four characteristics of language technology research: many large corpora are already held at most sites where the research is conducted; most data-intensive processing takes place in the development phase and not at run-time; most processing tasks can be construed as adding layers of annotation to immutable corpora; and many classes of language models can be approximated as the sum of smaller models. We report on a series of experiments with data-intensive language processing on a computational grid. Key features of the approach are its use of a scripting language for easy dissemination of control code to processing nodes, the use of a grid broker to manage the execution
of tasks on remote nodes and collate their output, the use a data-decomposition approach by which parametric and parallel processing of individual language processing components occurs on segmented data, and the use of researchers’ underutilized commodity machines.
of tasks on remote nodes and collate their output, the use a data-decomposition approach by which parametric and parallel processing of individual language processing components occurs on segmented data, and the use of researchers’ underutilized commodity machines.
Original language | English |
---|---|
Title of host publication | Proceedings of the International Workshop on Human Language Technology. http://eprints. unimelb. edu. au/archive/00000503 |
Number of pages | 8 |
Publication status | Published - 2004 |
Externally published | Yes |