Abstract
Polysynthetic languages have exceptionally large and sparse vocabularies, thanks to the number of morpheme slots and combinations in a word. This complexity, together with a general scarcity of written data, poses a challenge to the development of natural language technologies. To address this challenge, we offer linguistically-informed approaches for bootstrapping a neural morphological analyzer, and demonstrate its application to Kunwinjku, a polysynthetic Australian language. We generate data from a finite state transducer to train an encoder-decoder model. We improve the model by" hallucinating" missing linguistic structure into the training data, and by resampling from a Zipf distribution to simulate a more natural distribution of morphemes. The best model accounts for all instances of reduplication in the test set and achieves an accuracy of 94.7% overall, a 10 percentage point improvement over the FST baseline. This process demonstrates the feasibility of bootstrapping a neural morph analyzer from minimal resources.
Original language | English |
---|---|
Title of host publication | Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics |
Editors | Dan Jurafsky, Joyce Chai, Natalie Schluter, Joel Tetreault |
Place of Publication | Pennsylvania |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 6652-6661 |
Number of pages | 15 |
Volume | 1 |
ISBN (Electronic) | 978-1-952148-25-5 |
DOIs | |
Publication status | Published - Jul 2020 |
Event | 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 - Online Duration: 5 Jul 2020 → 10 Jul 2020 |
Conference
Conference | 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 |
---|---|
City | Online |
Period | 5/07/20 → 10/07/20 |