Machine Learning in Terminology Extraction from Czech and English Texts

Dominika Kováříková

doi:10.2478/lf-2021-0014

.blurhash-client-img { display: none !important; }

Machine Learning in Terminology Extraction from Czech and English Texts

Linguistic Frontiers

Volume 4 (2021): Issue 2 (September 2021)

By: Dominika Kováříková

Open Access

|Oct 2021

Abstract

The method of automatic term recognition based on machine learning is focused primarily on the most important quantitative term attributes. It is able to successfully identify terms and non-terms (with success rate of more than 95 %) and find characteristic features of a term as a terminological unit. A single-word term can be characterized as a word with a low frequency that occurs considerably more often in specialized texts than in non-academic texts, occurs in a small number of disciplines, its distribution in the corpus is uneven as is the distance between its two instances. A multi-word term is a collocation consisting of words with low frequency and contains at least one single-word term. The method is based on quantitative features and it makes it possible to utilize the algorithms in multiple disciplines as well as to create cross-lingual applications (verified on Czech and English).

References

Bečka, J. V., 1972. The lexical composition of specialized texts and its quantitative aspect. Prague Studies in Mathematical Linguistics, 4, 47—64.
Search in Google Scholar Back to article
Čermák, F. (2010). Lexikon a sémantika. Praha: NLN.
Search in Google Scholar Back to article
Křen, M. et al., 2010. SYN2010: žánrově vyvážený korpus psané češtiny. Institute of the Czech National Corpus, Charles University, Prague, available at: < http://www.korpus.cz >.
Search in Google Scholar Back to article
Chung, T. M., 2003. A corpus comparison approach for terminology extraction. Terminology, 9(2), 221—246.10.1075/term.9.2.05chu
Search in Google Scholar Back to article
Cvrček, V., 2013, Kvantitativní analýza kontextu. Praha: NLN/ÚČNK.
Search in Google Scholar Back to article
Frantzi, K. T., Ananiadou, S., 1999. The C/NC value domain independent method for multi-word term extraction. Journal of Natural Language Processing, 3(2), 115—127.10.5715/jnlp.6.3_145
Search in Google Scholar Back to article
Gamper, H., Stock, O., 1998/1999. Corpus-based terminology. Terminology, 5(2), 147—159.10.1075/term.5.2.05gam
Search in Google Scholar Back to article
Hall, M. et al., 2009. The WEKA Data Mining Software: An Update. SIGKDD Explorations 11(1), 10—18.10.1145/1656274.1656278
Search in Google Scholar Back to article
Heid, U., 1998/1999. A linguistic bootstrapping approach to the extraction of term candidates from German text. Terminology, 5(2), 161—181.10.1075/term.5.2.06hei
Search in Google Scholar Back to article
Kageura, K., Umino, B., 1996. Methods of automatic term recognition: A review. Terminology, 3(2), 259—289.10.1075/term.3.2.03kag
Search in Google Scholar Back to article
Kit, C., Liu, X., 2008. Measuring mono-word termhood by rank difference via corpus comparison. Terminology, 14(2), 204—229.10.1075/term.14.2.05kit
Search in Google Scholar Back to article
Kováříková, D., 2017. Kvantitativní charakteristiky termínů. Praha: NLN/ÚČNK.
Search in Google Scholar Back to article
L’Homme, M., Heid, U., Sager, J. C., 2003. Terminology during the past decade (1994-2004): An editorial statement. Terminology, 9(2),151—161.10.1075/term.9.2.02hom
Search in Google Scholar Back to article
Lauriston, A., 1995. Criteria for measuring term recognition. In: EACL ’95 Proceedings of the seventh conference on European chapter of the Association for Computational Linguistics. San Francisco: Morgan Kaufmann Publishers.10.3115/976973.976977
Search in Google Scholar Back to article
Lossio-Ventura, J. A. et al., 2014. Biomedical Terminology Extraction: A new combination of Statistical and Web Mining Approaches. JADT’2014: Journées internationales d’Analyse statistique des Données Textuelle, 421—432.
Search in Google Scholar Back to article
Manning, C. D., Schütze, H., 2000. Foundations of Statistical Natural Language Processing. Cambridge/London: The MIT Press.
Search in Google Scholar Back to article
Nazar, R., 2016. Distributional analysis applied to terminology extraction. Terminology, 22(2), 141—170.10.1075/term.22.2.01naz
Search in Google Scholar Back to article
Savický, P., Hlaváčová, J., 2003. Measures of word commonness. Journal of Quantitative Linguistics, 9(3), 215—231.10.1076/jqul.9.3.215.14124
Search in Google Scholar Back to article
Šrajerová, D., Kovářík, O., Cvrček, V., 2009. Automatic term recognition based on data-mining techniques. Proceedings of Computer Science and Information Engineering—CSIE. Los Angeles.10.1109/CSIE.2009.935
Search in Google Scholar Back to article
Ville-Ometz, F., Royauté, J., Zasadzinski, A., 2007. Enhancing in automatic recognition and extraction of term variants with linguistic features. Terminology, 13(1), 35—59.10.1075/term.13.1.03vil
Search in Google Scholar Back to article
Wermter, J., Hahn, U., 2005. Finding new terminology in very large corpora. In: Proceedings of the 3rd International Conference on Knowledge Capture (KCAP 2005), 137—144.10.1145/1088622.1088648
Search in Google Scholar Back to article

Articles in this issue

DOI: https://doi.org/10.2478/lf-2021-0014 | Journal eISSN: 2544-6339

Journal RSS Feed

Language: English

Page range: 23 - 30

Submitted on: Jul 1, 2020

Accepted on: Feb 1, 2021

Published on: Oct 14, 2021

Published by: Palacký University Olomouc

In partnership with: Paradigm Publishing Services

Publication frequency: 3 issues per year

Keywords:

term extraction,

automatic term recognition,

Related subjects:

Linguistics and semiotics,

Applied linguistics,

Quantitative, computational, and corpus linguistics,

Semiotics,

Biosemiotics,

Theoretical frameworks and disciplines,

General linguistics

© 2021 Dominika Kováříková, published by Palacký University Olomouc
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Volume 4 (2021): Issue 2 (September 2021)