Corpus of Slovak Legislative Documents

Radovan Garabík

doi:10.2478/jazcas-2023-0004

.blurhash-client-img { display: none !important; }

Corpus of Slovak Legislative Documents

Journal of Linguistics/Jazykovedný casopis

Volume 73 (2022): Issue 2 (September 2022)

By: Radovan Garabík

Open Access

|Mar 2023

Abstract

The article describes the construction of the corpus of Slovak legislative documents. By analyzing several statistical values of the source metadata and documents, we efficiently improve corpus quality. We describe the methods used to clean up small variations in metadata, length based discrimination of document and examine the effectiveness of several strategies of deduplication. The corpus is a part of a comparable corpus of legislative documents of seven languages, created in the Multilingual Resources for CEF.AT in the Legal Domain (MARCELL) project.

References

BENKO, Vladimír (2013): Data Deduplication in Slovak Corpora. In: K. Gajdošová – A. Žáková (eds.): Slovko 2013: Natural Language Processing, Corpus Linguistics, E-learning. Lüdenscheid: RAM-Verlag, pp. 27–39.
Search in Google Scholar Back to article
BENKO, Vladimír (2014): Aranea: Yet Another Family of (Comparable) Web Corpora. In: P. Sojka – A. Horák – I. Kopeček – K. Pala (eds.): Text, Speech and Dialogue. 17th International Conference, TSD 2014, Brno, Czech Republic, September 8–12, 2014. Proceedings. LNCS 8655. Springer International Publishing Switzerland, pp. 257–264.
Search in Google Scholar Back to article
GARABÍK, Radovan – BOBEKOVÁ, Kristína (2021): Lematizácia, morfologická anotácia a dezambiguácia slovenského textu – webové rozhranie. In: Slovenská reč, Vol. 86, No. 1, pp. 104–109.
Search in Google Scholar Back to article
GARABÍK, Radovan – LEVICKÁ, Jana (2022): Naïve Terminological Annotation of Legal Texts in Slovak – Can it Be Useful?. In: Rasprave: Časopis Instituta za hrvatski jezik i jezikoslovlje. Vol. 48, No. 1, pp.2022, pp. 27–44.
Search in Google Scholar Back to article
GARABÍK, Radovan – MITANA, Denis (2022): Accuracy of Slovak Language Lemmatization and MSD Tagging – MorphoDiTa and SpaCy. In: LLOD Approaches for Language Data Research and Management, Abstract Book, Mykolo Romerio universitetas, Vilnius, pp. 93–95.
Search in Google Scholar Back to article
GARABÍK, Radovan – ŠIMKOVÁ, Mária (2012): Slovak Morphosyntactic Tagset. In: Journal of Language Modelling, No. 1, pp. 41–63.
Search in Google Scholar Back to article
JOHNSON, Ian – MACPHAIL, Alastair (2000): IATE-Inter-Agency Terminology Exchange: development of a single central terminology database for the institutions and agencies of the European Union. Workshop on Terminology resources and computation.
Search in Google Scholar Back to article
Law No. 185/2015 col. (copyright law of the Slovak Republic)
Search in Google Scholar Back to article
POMIKÁLEK, Jan (2011): Removing Boilerplate and Duplicate Content from Web Corpora. Ph.D. Thesis, Faculty of Informatics, Masaryk University in Brno.
Search in Google Scholar Back to article
ŠIMKOVÁ, Mária – GARABÍK, Radovan (2006): Синтаксическая разметка в Сло-вацком национальном корпусе. In: Tруды международной конференции Корпусная линг-вистика – 2006. Sankt-Petersburg: St. Petersburg University Press, pp. 389–394.
Search in Google Scholar Back to article
STRAKOVÁ, Jana – STRAKA, Milan – HAJIČ, Jan (2014): Open-source tools for morphology, lemmatization, pos tagging and named entity recognition. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Baltimore, Maryland, June 2014. Association for Computational Linguistics, pp. 13–18.
Search in Google Scholar Back to article
STRAKA, Milan – STRAKOVÁ, Jana (2017): Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Vancouver, Canada, August 2017.
Search in Google Scholar Back to article
VÁRADI, Tamás – KOEVA, Svetla – YALAMOV, Martin – TADIĆ, Marko – SASS, Bálint – NITOŃ, Bartłomiej – OGRODNICZUK, Maciej – PĘZIK, Piotr – BARBU MITITELU, Verginica – ION, Radu – IRIMIA, Elena – MITROFAN, Maria – PĂIȘ, Vasile – TUFIȘ, Dan – GARABÍK, Radovan – KREK, Simon – REPAR, Andraž – RIHTAR, Matjaž. (2020): The MARCELL Legislative Corpus. In: Proceedings of the Twelfth Language Resources and Evaluation Conference. Marseille, France. May 2020. European Language Resources Association, pp. 3761–3768.
Search in Google Scholar Back to article
ZEMAN, Daniel (2017): Slovak Dependency Treebank in Universal Dependencies. In: Jazykovedný časopis, Vol. 68, No. 2, pp. 385–395.
Search in Google Scholar Back to article

Articles in this issue

DOI: https://doi.org/10.2478/jazcas-2023-0004 | Journal eISSN: 1338-4287 | Journal ISSN: 0021-5597

Journal RSS Feed

Language: English

Page range: 175 - 189

Published on: Mar 27, 2023

Published by: Slovak Academy of Sciences, Ľudovít Štúr Institute of Linguistics

In partnership with: Paradigm Publishing Services

Publication frequency: 3 issues per year

Keywords:

corpus,

Slovak language,

body of law,

legislation

Related subjects:

Linguistics and semiotics,

Theoretical frameworks and disciplines,

Linguistics, other

© 2023 Radovan Garabík, published by Slovak Academy of Sciences, Ľudovít Štúr Institute of Linguistics
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Volume 73 (2022): Issue 2 (September 2022)