Many uses, many annotations for large speech corpora: Switchboard and TDT as case studies

David Graff, Steven Bird

    Research output: Chapter in Book/Report/Conference proceedingConference Paper published in Proceedingspeer-review

    Abstract

    This paper discusses the challenges that arise when large speech corpora receive an ever-broadening range of diverse and distinct annotations. Two case studies of this process are presented: the Switchboard Corpus of telephone conversations and the TDT2 corpus of broadcast news. Switchboard has undergone two independent transcriptions and various types of additional annotation, all carried out as separate projects that were dispersed both geographically and chronologically. The TDT2 corpus has also received a variety of annotations, but all directly created or managed by a core group. In both cases, issues arise involving the propagation of repairs, consistency of references, and the ability to integrate annotations having different formats and levels of detail. We describe a general framework whereby these issues can be addressed successfully.

    Original languageEnglish
    Title of host publication2nd International Conference on Language Resources and Evaluation, LREC 2000
    Number of pages7
    Publication statusPublished - 2000
    Event2nd International Conference on Language Resources and Evaluation, LREC 2000 - Athens, Greece
    Duration: 31 May 20002 Jun 2000

    Conference

    Conference2nd International Conference on Language Resources and Evaluation, LREC 2000
    CountryGreece
    CityAthens
    Period31/05/002/06/00

    Fingerprint

    Dive into the research topics of 'Many uses, many annotations for large speech corpora: Switchboard and TDT as case studies'. Together they form a unique fingerprint.

    Cite this