Have a personal or library account? Click to login
Old Catalan Morphosyntax: Developing an Annotated Corpus Cover

Old Catalan Morphosyntax: Developing an Annotated Corpus

Open Access
|Dec 2021

Abstract

This paper presents a full procedure for the development of a Part-of-Speech (POS) tagged corpus of Old Catalan. As an extremely low-resource language with rich inflection and frequent homographs, Old Catalan poses non-trivial problems in the development of a searchable constituency-based treebank. We demonstrate, however, that a semi- supervised method of incrementally building training data using both neural and memory-based taggers, together with the Pyrrha annotation tool is highly efficient and yields accurate results. We propose that this simple and effective method could easily be extended to other low-resource historical languages for which no NLP tools exist yet.

DOI: https://doi.org/10.5334/johd.54 | Journal eISSN: 2059-481X
Language: English
Published on: Dec 21, 2021
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2021 Marieke Meelen, Afra Pujol i Campeny, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.