An Automatically Generated Annotated Corpus for Albanian Named Entity Recognition

Klesti Hoxha; Artur Baxhaku

doi:10.2478/cait-2018-0009

.blurhash-client-img { display: none !important; }

An Automatically Generated Annotated Corpus for Albanian Named Entity Recognition

Cybernetics and Information Technologies

Volume 18 (2018): Issue 1 (March 2018)

By: Klesti Hoxha and Artur Baxhaku

Open Access

|Mar 2018

Abstract

Named Entity Recognition (NER) is an important task in many NLP pipelines. It has become especially important for knowledge bases that power many of the nowadays information retrieval systems. In order to cope with the high demand for annotated training corpora for supervised NER systems, automatic generation approaches have been proposed. In this paper we report on the first automatically generated NE annotated corpus for Albanian. News articles from Albanian news media were used as a document source. They were automatically tagged using a custom generated gazetteer from the Albanian Wikipedia. Our evaluation results show that this corpus can be used as a baseline corpus for human annotated ones or as a training corpus where no other is available.

References

1. Sang, E. F., F. De Meulder. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. – In: Proc. of 7th Conference on Natural Language Learning at HLT-NAAC, Association for Computational Linguistics, Vol. 4, 2003, pp. 142-147.10.3115/1119176.1119195
Search in Google Scholar Back to article
2. Bender, O., F. J. Och, H. Ney. Maximum Entropy Models for Named Entity Recognition. – In: Proc. of 7th Conference on Natural Language Learning at HLT-NAAC, Association for Computational Linguistics, Vol. 4, 2003, pp. 148-151.10.3115/1119176.1119196
Search in Google Scholar Back to article
3. Kono, G., K. Hoxha. Named Entity Recognition in Albanian Based on CRFs Approach. – In: Proc. of 2nd International Conference on Recent Trends and Applications in Computer Science and Information Technology, CEUR-WS.org, Vol. 1746, 2016, pp. 47-52.
Search in Google Scholar Back to article
4. Rao, D., P. McNamee, M. Dredze. Entity Linking: Finding Extracted Entities in a Knowledge Base. – In: Multi-Source, Multilingual Information Extraction and Summarization, Berlin, Heidelberg, Springer, 2013, pp. 93-115.10.1007/978-3-642-28569-1_5
Search in Google Scholar Back to article
5. Arapakis, I., L. A. Leiva, B. B. Cambazoglu. Know Your Onions: Understanding the User Experience with the Knowledge Module in Web Search. – In: Proc. of 24th ACM International on Conference on Information and Knowledge Management, 2015, pp. 1695-1698.10.1145/2806416.2806591
Search in Google Scholar Back to article
6. Toral, A., R. Munoz. A Proposal to Automatically Build and Maintain Gazetteers for Named Entity Recognition by Using Wikipedia. – In: Proc. of EACL, 2006, pp. 56-61.
Search in Google Scholar Back to article
7. Nemeskey, D. M., E. Simon. Automatically Generated NE Tagged Corpora for English and Hungarian. – In: Proc. of 4th Named Entity Workshop, 2012, pp. 38-46.
Search in Google Scholar Back to article
8. Skënduli, M. P., M. Biba. A Named Entity Recognition Approach for Albanian. – In: Proc. of International Conference on Advances in Computing, Communications and Informatics (ICACCI’13), 2013, pp. 1532-1537.10.1109/ICACCI.2013.6637407
Search in Google Scholar Back to article
9. Richman, A. E., P. Schone. Mining Wiki Resources for Multilingual Named Entity Recognition. In: – In: Proc. of ACL, 2008, pp. 1-9.
Search in Google Scholar Back to article
10. Auer, S., C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, Z. Ives. DBpedia: A Nucleus for a Web of Open Data. – In: Proc. of 6th International The Semantic Web and 2nd Asian Conference on Asian Semantic Web Conference, Springer-Verlag, 2007, pp. 722-735.10.1007/978-3-540-76298-0_52
Search in Google Scholar Back to article
11. Attardi, G., V. Cozza, D. Sartiano. Adapting Linguistic Tools for the Analysis of Italian Medical Records. – In: Proc. of 1st Italian Conference on Computational Linguistics CLiC-it & the 4th International Workshop EVALITA, 2014, pp. 17-22.
Search in Google Scholar Back to article
12. Attardi, G., V. Cozza, D. Sartiano. Annotation and Extraction of Relations from Italian Medical Records. – In: Proc. of 6th Italian Information Retrieval Workshop, CEUR-WS.org, Vol. 1404, 2015.
Search in Google Scholar Back to article
13. Prokofyev, R., G. Demartini, P. Cudré-Mauroux. Effective Named Entity Recognition for Idiosyncratic Web Collections. – In: Proc. of 23rd International Conference on World Wide Web, 2014, pp. 397-408.10.1145/2566486.2568013
Search in Google Scholar Back to article
14. McCallum, A., W. Li. Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons. – In: Proc. of 7th Conference on Natural Language Learning at HLT-NAACL, Association for Computational Linguistics, Vol. 4, 2003, pp. 188-191.10.3115/1119176.1119206
Search in Google Scholar Back to article
15. Vrandečić, D., M. Krötzsch. Wikidata: A Free Collaborative Knowledgebase. – Communications of ACM, Vol. 57, 2014, pp. 78-85.10.1145/2629489
Search in Google Scholar Back to article
16. Davis, M., L. Iancu. Unicode Text Segmentation. – Unicode Standard Annex, Vol. 29, 2012.
Search in Google Scholar Back to article
17. Ratcliff, J. W., D. E. Metzener. Pattern Matching: The Gestalt Approach. – Dr Dobbs Journal, Vol. 13, 1988.
Search in Google Scholar Back to article
18. Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. – In: Proc. of International Joint Conference on Artificial Intelligence (IJCAI’95), Vol. 14, 1995, No 2, pp. 1137-1145.
Search in Google Scholar Back to article

Articles in this issue

DOI: https://doi.org/10.2478/cait-2018-0009 | Journal eISSN: 1314-4081 | Journal ISSN: 1311-9702

Journal RSS Feed

Language: English

Page range: 95 - 108

Submitted on: Jun 30, 2017

Accepted on: Dec 1, 2017

Published on: Mar 30, 2018

Published by: Bulgarian Academy of Sciences, Institute of Information and Communication Technologies

In partnership with: Paradigm Publishing Services

Keywords:

Named entity recognition,

natural language processing,

language corpora,

semi-automatic annotation,

information extraction

Related subjects:

Computer sciences,

Information technology

© 2018 Klesti Hoxha, Artur Baxhaku, published by Bulgarian Academy of Sciences, Institute of Information and Communication Technologies
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Volume 18 (2018): Issue 1 (March 2018)