Collecting Low-Density Language Materials on the Web

Timothy Baldwin, Steven Bird, Baden Hughes

Research output: Chapter in Book/Report/Conference proceedingConference Paper published in Proceedingspeer-review

Abstract

Most web content exists in a few dozen languages. Hundreds of other languages - the 'low-density languages' - are only represented in scarce quantities on the web. How can we locate, store and describe these low-density resources? In particular, how can we identify linguistically interesting resources, such as translation sets and multilingual documents? In this paper we describe ongoing research in which we integrate a number of discrete systems (language data crawler, automated metadata generation tools, language data repositories and federated search services) to address the identification, retrieval, description, storage and access issues for low-density language materials from the web.

Original languageEnglish
Title of host publicationProceedings of the 12th Australasian Web Conference
PublisherSouthern Cross University
Publication statusPublished - Dec 2006
Externally publishedYes
Event12th Australasian World Wide Web Conference, AusWeb 2006 - Noosa, QLD, Australia
Duration: 1 Jul 20065 Jul 2006

Conference

Conference12th Australasian World Wide Web Conference, AusWeb 2006
Country/TerritoryAustralia
CityNoosa, QLD
Period1/07/065/07/06

Fingerprint

Dive into the research topics of 'Collecting Low-Density Language Materials on the Web'. Together they form a unique fingerprint.

Cite this