Levels of Annotation in the Slovene Training Corpus ssj500k 2.2

Mija Bon; Polona Gantar

doi:10.2478/jazcas-2019-0068

.blurhash-client-img { display: none !important; }

Levels of Annotation in the Slovene Training Corpus ssj500k 2.2

Journal of Linguistics/Jazykovedný casopis

Volume 70 (2019): Issue 2 (December 2019)

By: Mija Bon and Polona Gantar

Open Access

|Dec 2019

Abstract

This paper presents the Slovene Training Corpus ssj500k 2.2, which has been annotated on the levels of tokenization, sentence segmentation, part-of-speech tagging, lemmatization, syntactic dependencies, named entities, verbal multi-word expressions, and semantic role labeling. It describes the individual layers of annotation and shows the scope of using the training corpus in the production of various lexicons, such as the lexicon of multi-word units and the valency lexicon of modern Slovene. It concludes by presenting our future work, i.e. the annotation of multi-word expressions based on the Slovene Lexical Database.

References

[1] Arhar, Š. (2009). Učni korpus SSJ in leksikon besednih oblik za slovenščino. Jezik in slovstvo, 54, pages 3‒4.10.4312/jis.54.3-4.43-56
Search in Google Scholar Back to article
[2] Dobrovoljc, K., Krek, S., and Rupnik, J. (2012). Skladenjski razčlenjevalnik za slovenščino.
Search in Google Scholar Back to article
[3] Dobrovoljc, K. Erjavec, T., and Krek, S. (2016). Pretvorba korpusa ssj500k v Univerzalno odvisnostno drevesnico za slovenščino. Konferenca Jezikovne tehnologije in digitalna humanistika, pages 190–192. Ljubljana.
Search in Google Scholar Back to article
[4] Dobrovoljc, K., Erjavec, T., and Krek, S. (2017). The Universal Dependencies Treebank for Slovenian.10.18653/v1/W17-1406
Search in Google Scholar Back to article
[5] Dobrovoljc, K., Erjavec, T., and Krek, S. UD Slovenian SSJ. Accessible at: https://universaldependencies.org/treebanks/sl_ssj/index.html.
Search in Google Scholar Back to article
[6] Erjavec, T., Fišer, D., Krek, S., and Ledinek, N. (2010). The JOS Linguistically Tagged Corpus of Slovene. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC), pages 1806–1809. Paris, ELRA.
Search in Google Scholar Back to article
[7] Erjavec, T., Krek, S., and Fišer, D. (2010). jos100k corpus V2.0. Accessible at: http://hdl.handle.net/11356/1213.
Search in Google Scholar Back to article
[8] Erjavec, T., Krek, S., and Dobrovoljc, K. (2019). Training corpus jos1M 1.2, Slovenian language resource repository CLARIN.SI. Accessible at: http://hdl.handle.net/11356/1213.
Search in Google Scholar Back to article
[9] Gantar, P., Krek, S., and Kuzman, T. (2017). Verbal multiword expressions in Slovene. Europhras 2017, pages 247–259. Springer.10.1007/978-3-319-69805-2_18
Search in Google Scholar Back to article
[10] Gantar, P., Arhar Holdt, Š., Čibej, J., Kuzman, T., and Kavčič, T. (2018). Glagolske večbesedne enote v učnem korpusu ssj500k 2.1. In Proceedings of the conference on Language Technologies & Digital Humanities, pages 85–92.
Search in Google Scholar Back to article
[11] Gantar, P., Štrkalj Despot, K., Krek, S., and Ljubešić, N. (2018). Towards Semantic Role Labeling in Slovene and Croatian. In Proceedings of the conference on Language Technologies & Digital Humanities, pages 92–98.
Search in Google Scholar Back to article
[12] Gantar, P., Arhar Holdt, Š., and Čibej, J. (in print). Structural and Semantic Classification of Verbal Multi-Word Expressions in Slovene. Contributions to Contemporary History.
Search in Google Scholar Back to article
[13] Gorjanc, V., Gantar, P., Kosem, I., and Krek, S. (2017). Dictionary of Modern Slovene: Problems and Solutions. Ljubljana, Založba FF.
Search in Google Scholar Back to article
[14] Grčar, M., Krek, S., and Dobrovoljc, K. (2012). Obeliks: statistični oblikoskladenjski označevalnik in lematizator za slovenski jezik. In Zbornik Osme konference Jezikovne tehnologije. Ljubljana, Institut Jožef Stefan.
Search in Google Scholar Back to article
[15] Krek, S., Gantar, P., Dobrovoljc, K., and Škrjanec, I. (2016). Označevanje udeleženskih vlog v učnem korpusu za slovenščino. In Proceedings of the Conference on Language Technologies & Digital Humanities, pages 106–110. Faculty of Arts. University of Ljubljana.
Search in Google Scholar Back to article
[16] Krek, S., Dobrovoljc, K., Erjavec, T., Može, S., Ledinek, N., Holz, N., Zupan, K., Gantar, P., Kuzman, T., Čibej, J., Arhar Holdt, Š., Kavčič, T., Škrjanec, I., Marko, D., Jezeršek, L., and Zajc, A. (2018). Training corpus ssj500k 2.1, Slovenian language resource repository CLARIN.SI. Accessible at: http://hdl.handle.net/11356/1181.
Search in Google Scholar Back to article
[17] Krek, S., Dobrovoljc, K., Erjavec, T., Može, S., Ledinek, N., Holz, N., Zupan, K., Gantar, P., Kuzman, T., Čibej, J., Arhar Holdt, Š., Kavčič, T., Škrjanec, I., Marko, D., Jezeršek, L., and Zajc, A. (2019). Training corpus ssj500k 2.2, Slovenian language resource repository CLARIN.SI. Accessible at: http://hdl.handle.net/11356/1210.
Search in Google Scholar Back to article
[18] Ledinek, N., and Erjavec, T. (2009). Odvisnostno površinskoskladenjsko označevanje slovenščine: specifikacije in označeni korpusi. Simpozij Obdobja 28, pages 219–224.
Search in Google Scholar Back to article
[19] Ledinek, N. (2014). Slovenska skladnja v oblikoskladenjsko in skladenjsko označenih korpusih slovenščine. Ljubljana, Založba ZRC, ZRC SAZU.10.3986/9789612547479
Search in Google Scholar Back to article
[20] Štajner, T., Erjavec, T, and Krek, S. (2013). Razpoznavanje imenskih entitet v slovenskem besedilu. Slovenščina 2.0, 2, pages 58–82. Accessible at: http://slovenscina2.0.trojina.si/arhiv/2013/2/Slo2.0_2013_2_04.pdf.10.4312/slo2.0.2013.2.58-81
Search in Google Scholar Back to article
[21] Zupan, K., Ljubešić, N., and Erjavec, T. (2017). Annotation guidelines for Slovenian named entities Janes-NER.
Search in Google Scholar Back to article

Articles in this issue

DOI: https://doi.org/10.2478/jazcas-2019-0068 | Journal eISSN: 1338-4287 | Journal ISSN: 0021-5597

Journal RSS Feed

Language: English

Page range: 390 - 399

Published on: Dec 21, 2019

Published by: Slovak Academy of Sciences, Ľudovít Štúr Institute of Linguistics

In partnership with: Paradigm Publishing Services

Publication frequency: 3 issues per year

Keywords:

Related subjects:

Linguistics and semiotics,

Theoretical frameworks and disciplines,

Linguistics, other

© 2019 Mija Bon, Polona Gantar, published by Slovak Academy of Sciences, Ľudovít Štúr Institute of Linguistics
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Volume 70 (2019): Issue 2 (December 2019)