From the National Corpus of Polish to the Polish Corpus Infrastructure

Maciej Ogrodniczuk; Rafał L. Górski; Marek Łaziński; Piotr Pęzik

doi:10.2478/jazcas-2019-0061

.blurhash-client-img { display: none !important; }

From the National Corpus of Polish to the Polish Corpus Infrastructure

Journal of Linguistics/Jazykovedný casopis

Volume 70 (2019): Issue 2 (December 2019)

By: Maciej Ogrodniczuk, Rafał L. Górski, Marek Łaziński and Piotr Pęzik

Open Access

|Dec 2019

Abstract

The National Corpus of Polish emerged as a cumulative result of many years of work on large reference corpora by computer scientists and linguists in Poland. While its impact on research in linguistics, humanities and language technology is unquestionable and highly significant, the construction of the national corpus was halted in 2011. In the paper we call for activating the research community and funding institutions around the construction of a corpus infrastructure with the national corpus at its heart. It is claimed that on the verge of an artificial intelligence revolution the envisaged Polish Corpus Infrastructure would provide reliable language data, combine available resources and allow easy integration of new ones.

References

[1] Czerepowicka M. (2014). SEJF – Słownik elektroniczny jednostek frazeologicznych. Język Polski XCIV (2), pages 116–129.10.31286/JP.94.2.3
Search in Google Scholar Back to article
[2] Čermák, F. (1997). Czech National Corpus: A case in many contexts. International Journal of Corpus Linguistics 2 (2), pages 181–197.10.1075/ijcl.2.2.03cer
Search in Google Scholar Back to article
[3] Derwojedowa M., Kieraś W., Skowrońska D., and Wołosz R. (2014). Korpus polszczyzny XIX wieku — od mikrokorpusu do korpusu średniej wielkości. Prace Filologiczne LXV, pages 251–256.
Search in Google Scholar Back to article
[4] Grochola-Szczepanek H., Górski R. L., von Waldenfels R., and Woźniak M. (2019). Korpus języka mówionego mieszkańców Spisza. LingVaria LV (1), pages 165–180.10.12797/LV.14.2019.27.11
Search in Google Scholar Back to article
[5] Gruszczyński W., Adamiec D., and Ogrodniczuk M. (2013). Elektroniczny korpus tekstów polskich z XVII i XVIII w. (do 1772 r.) Polonica XXXIII, pages 311–318.
Search in Google Scholar Back to article
[6] Hajnicz E., Patejuk A., Przepiórkowski A., and Woliński M. (2016). Walenty: słownik walencyjny języka polskiego z bogatym komponentem frazeologicznym. In K. Skwarska and E. Kaczmarska (eds.) Výzkum slovesné valence ve slovanských zemích, pages 71–102. Prague, Czech Republic, Slovanský ústav AV ČR.
Search in Google Scholar Back to article
[7] Janus D., and Przepiórkowski A. (2007). Poliqarp: An open source corpus indexer and search engine with syntactic extensions. In Proceedings of the ACL 2007 Demo and Poster Sessions, pages 85–88, Prague, Czech Republic.10.3115/1557769.1557795
Search in Google Scholar Back to article
[8] Kieraś W., and Woliński M. (2018). Manually annotated corpus of Polish texts published between 1830 and 1918. In N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, and T. Tokunaga (eds.) Proceedings of the 11^th International Conference on Language Resources and Evaluation (LREC 2018), pages 3854–3859, Paris, France: European Language Resources Association.
Search in Google Scholar Back to article
[9] Kirk J., Čermáková A., Ebeling S. O., Ebeling J., Kren M., Aijmer K., Benko V., Garabík R., Górski R. L., Jantunen J., Kupietz M., Simkova M., Schmidt T., and Wicher O. (2018). Introducing the International Comparable Corpus. In S. Granger, M–A. Lefer and L. Aguiar de Souza Penha Marion (eds.) Book of Abstracts: Using Corpora in Contrastive and Translation Studies Conference (5^th edition). CECL Papers, Louvain-la-Neuve.
Search in Google Scholar Back to article
[10] Król M., Derwojedowa M., Górski R. L., Gruszczyński W., Opaliński K. W., Potoniec P., Woliński M., Kieraś W., and Eder M. (2019). Narodowy Korpus Diachroniczny Polszczyzny. Projekt. Język Polski XCXIX (1), pages 92–101.10.31286/JP.99.1.8
Search in Google Scholar Back to article
[11] Łaziński M. (2018). Nowe zjawiska w języku młodzieży. Gramatyka slangu. In B. Pędzich, M. Wanot-Miśtura, and D. Zdunkiewicz-Jedynak (eds.) Tyle się we mnie słów zebrało. Szkice o języku i tekstach, pages 339–356. Warsaw, Poland.
Search in Google Scholar Back to article
[12] Mykowiecka A., Marciniak M., and Rychlik P. (2017). Testing word embeddings for Polish. Cognitive Studies / Études Cognitives 17, pages 1–19.10.11649/cs.1468
Search in Google Scholar Back to article
[13] Ogrodniczuk M., Głowińska K., Kopeć M., Savary A., and Zawisławska M. (2013). Polish Coreference Corpus. In Z. Vetulani (ed.), Proceedings of the 6^th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, pages 494–498, Poznań, Poland: Wydawnictwo Poznańskie, Fundacja Uniwersytetu im. Adama Mickiewicza.
Search in Google Scholar Back to article
[14] Ogrodniczuk M., Derwojedowa M., Łaziński M., and Pęzik P. (2017). Narodowy Korpus Języka Polskiego – co dalej? Prace Filologiczne, LXXI, pages 237–245.
Search in Google Scholar Back to article
[15] Pęzik P. (2014). Graph-Based Analysis of Collocational Profiles. In V. Jesenšek and P. Grzybek (eds.) Phraseologie Im Wörterbuch Und Korpus (Phraseology in Dictionaries and Corpora), pages 227–243. ZORA 97. Maribor.
Search in Google Scholar Back to article
[16] Pęzik P. (2015). Spokes – a Search and Exploration Service for Conversational Corpus Data. In Selected Papers from CLARIN 2014, pages 99–109. Linköping Electronic Conference Proceedings. Linköping University Electronic Press.
Search in Google Scholar Back to article
[17] Pęzik P. (2016). Exploring Phraseological Equivalence with Paralela. In Polish-Language Parallel Corpora, edited by Ewa Gruszczyńska and Agnieszka Leńko-Szymańska, pages 67–81. Warsaw, Instytut Lingwistyki Stosowanej UW.
Search in Google Scholar Back to article
[18] Pęzik P. (forthcoming, 2019). Budowa i zastosowania korpusu monitorującego MoncoPL. Forum Lingwistyczne.10.31261/FL.2020.07.11
Search in Google Scholar Back to article
[19] Przepiórkowski A., Bańko M., Górski R. L., and Lewandowska-Tomaszczyk B. (eds.) (2012). Narodowy Korpus Języka Polskiego. Warsaw, Wydawnictwo Naukowe PWN.
Search in Google Scholar Back to article
[20] Riegel M., Wierzba M., Wypych M., Żurawski Ł., Jednoróg K., Grabowska A., and Marchewka A. (2015). Nencki Affective Word List (NAWL): The Cultural Adaptation of the Berlin Affective Word List–Reloaded (BAWL-R) for Polish. Behavior Research Methods 47(4), pages 1222–1236.10.3758/s13428-014-0552-1
Search in Google Scholar Back to article
[21] Twardzik W., and Górski R. L. (2003). Korpus staropolski Instytutu Języka Polskiego PAN w Krakowie. In S. Gajda (ed.) Językoznawstwo w Polsce. Stan i perspektywy, pages 155–157.
Search in Google Scholar Back to article
[22] Waszczuk J. (2012). Harnessing the CRF complexity with domain-specific constraints: The case of morphosyntactic tagging of a highly inflected language. In Proceedings of COLING 2012, pages 2789–2804. Mumbai, India.
Search in Google Scholar Back to article
[23] Waszczuk J., Kieraś W., and Woliński M. (2018). Morphosyntactic disambiguation and segmentation for historical Polish with graph-based conditional random fields. In P. Sojka, A. Horák, I. Kopeček, and K. Pala (eds.) Proceedings of the 21^st Text, Speech, and Dialogue International Conference (TSD 2018), Brno, Czech Republic. Lecture Notes in Artificial Intelligence 11107, pages 188–196. Springer-Verlag.10.1007/978-3-030-00794-2_20
Search in Google Scholar Back to article
[24] Woliński M. (2014). Morfeusz reloaded. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis (eds.) Proceedings of the 9^th International Conference on Language Resources and Evaluation (LREC 2014), pages 1106–1111, Reykjavík, Iceland: European Language Resources Association.
Search in Google Scholar Back to article
[25] Wróblewska A. (2012). Polish dependency bank. Linguistic Issues in Language Technology 7 (2), pages 1–18.10.33011/lilt.v7i.1261
Search in Google Scholar Back to article
[26] Żmigrodzki P., Bańko M., Batko-Tokarz B., Bobrowski J., Czelakowska A., Grochowski M., Przybylska R., Waniakowa J., and Węgrzynek K. (eds.) (2018). Wielki słownik języka polskiego PAN. Geneza, koncepcja, zasady opracowania. Kraków, Instytut Języka Polskiego PAN/LIBRON, 264 p.10.17651/WSJP2018
Search in Google Scholar Back to article

Articles in this issue

DOI: https://doi.org/10.2478/jazcas-2019-0061 | Journal eISSN: 1338-4287 | Journal ISSN: 0021-5597

Journal RSS Feed

Language: English

Page range: 315 - 323

Published on: Dec 21, 2019

Published by: Slovak Academy of Sciences, Ľudovít Štúr Institute of Linguistics

In partnership with: Paradigm Publishing Services

Publication frequency: 3 issues per year

Keywords:

corpus linguistics,

corpus lexicography,

dialect corpora

Related subjects:

Linguistics and semiotics,

Theoretical frameworks and disciplines,

Linguistics, other

© 2019 Maciej Ogrodniczuk, Rafał L. Górski, Marek Łaziński, Piotr Pęzik, published by Slovak Academy of Sciences, Ľudovít Štúr Institute of Linguistics
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Volume 70 (2019): Issue 2 (December 2019)