Abstract
We investigate induction of a bilingual lexicon from a corpus of phonemic transcriptions that have been sentence-aligned with English translations. We evaluate existing models that have been used for this purpose and report on two additional models, which demonstrate performance improvements. The first performs monolingual segmentation followed by alignment, while the second performs both tasks jointly. We show that monolingual and bilingual lexical entries can be learnt with high precision from corpora having just 1k 10k sentences. We explain how our results support the application of alignment algorithms to the task of documenting endangered languages.
Original language | English |
---|---|
Title of host publication | Proceedings of the International Workshop on Spoken Language Translation |
Number of pages | 8 |
Publication status | Published - 2015 |
Externally published | Yes |