Unstructured documents processing through their anonymization

Peter Kvasnica

doi:10.2478/jee-2025-0028

.blurhash-client-img { display: none !important; }

Unstructured documents processing through their anonymization

Journal of Electrical Engineering

Volume 76 (2025): Issue 3 (June 2025)

By: Peter Kvasnica

Open Access

|Jun 2025

Abstract

In Slovakia, a large number of texts on decisions are published by the courts on the MSSR website. All decisions that are published must be anonymous, without all personal and other data (personal and sensitive data that can identify individual entities in the proceedings, reveal their private data, etc.) to the extent described by the decree. Anonymization of court decision data is carried out according to an instruction given in the Court Management application, which supports only basic anonymization functions. Such an anonymization method is time- and human-resource-intensive, and at the same time, it is susceptible to the leakage of sensitive data. The text processing is based on machine learning methods using the principles of user-friendly visualization to accelerate anonymization and minimize the risk of leaking sensitive data. Achieving and improving results in practice the method of recognizing named entities based on the processing of text documents, dictionaries, rules and a trained statistical model on manually annotated data is used. The article describes the basic techniques of the anonymization used in practice, that are machine learning and removing information. Our attention is given to text documents processing, tools and principles for creating a data set of annotated data and evaluating the success of anonymization. The iteratively improving learning model is dominant in that it enables the use of knowledge as well as annotated data.

References

Ministry of Justice of Slovak Republic, (2013). Courts Act No. 757/2004 Coll. https://obcan.justice.sk/infosud/zoznam/rozhodnutie
Search in Google Scholar Back to article
Ministry of Justice of Slovak Republic, (2015). Decree on the publication of court decision No. 482/2011 Coll. https://www.slov-lex.sk/pravne-predpisy/SK/ZZ/2011/482/20120101
Search in Google Scholar Back to article
Ministry of Justice of Slovak Republic, (2010). Court on Free Access to Information No. 211/2000 Coll. https://www.slovlex.sk/pravne-predpisy/SK/ZZ/2000/211/20160701
Search in Google Scholar Back to article
P. Kvasnica, “Creation of datasets for machine learning in the anonymsation of unstructured documents“. In: Proceedings of the 22and ISC’2024 Industrial Simulation Conference, pp. 13–18, ISBN 978-9-492589-30-3, 2024.
Search in Google Scholar Back to article
E. M. Nrl, et al., “MUC-7 EVALUATION OF IE TECHNOLOGY“, Overview of Results MUC-7 Program Committee. In Program, 1998.
Search in Google Scholar Back to article
H. H. Hock, B. D. Joseph, Language History, Language Change, and Language Relationship: An Introduction to Historical and Comparative Linguistics [online]. [s.l.]: De Gruyter, ISBN 9783110214307, 2009.
Search in Google Scholar Back to article
D. Hládek, et al. “The Slovak morphological classifier“. ELMAR, 2012 Proceedings, pp. 12–14, September, 2012. Available online: <http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6338504\nhttp://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6338504>.
Search in Google Scholar Back to article
O. Kaššák, Extrakcia pomenovaných entít pre slovenslý jazyk. Znalosti 2012. Sborník příspěvků 11. ročníku konference. Praha: Matfyzpress, pp. 52-61, 2012.
Search in Google Scholar Back to article
S. Chakrabarti, Mining the web, Discovering knowledge form hypertext data. Morgan Kaufmann publishers, ISBN-13: 9781558607545, 2002.
Search in Google Scholar Back to article
K. Han, Y. Song, H. Rim, Probabilistic Model for Definitional Question Answering. Korea University, Seoul, Korea, 2005.
Search in Google Scholar Back to article
B. Salton, G. Salton & C. Buckley, “Term weighting approaches in automatic text retrieval“. Information Processing and Management, 24(5), pp. 513-523, 1988.
Search in Google Scholar Back to article
S. Dumais, Enhancing performance in LSI retrieval. Technical Report 91/09/17, Bellcore, 1991.
Search in Google Scholar Back to article
T. M. Mitchell, Machine Learning, The McGraw-Hill Companies, Inc., New York, USA, 414 ps, 1997.
Search in Google Scholar Back to article
I. H. Witten, and E. Frank, DATA MINING, Practical Machine Learning Tools and Techniques, Morgan Kaufmann Publishers, 2005.
Search in Google Scholar Back to article
P. Stenetorp, et al., “A web-based tool for NLP-assisted text annotation. Proc. of the Demonstrations at the 13th Conf. of the European Chapter of the Association for Computational Linguistics. ACL, pp. 102-107, 2012.
Search in Google Scholar Back to article
A. Bagga, and B. Baldwin, “Algorithms for scoring coreference chains“. The First International Conference on Language Resources and Evaluation Workshop on Linguistics Coreference, pp. 563–566, 1998.
Search in Google Scholar Back to article
C. D. Manning, et al., “The Stanford CoreNLP Natural Language Processing Toolkit“. Proc. of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, ACL, pp. 55–60, 2014. Available online: http://aclweb.org/anthology/P14-5010.
Search in Google Scholar Back to article
P. Bednár, P. Butka, J. Paralic, Java Library for Support of Text Mining and Retrieval. ZNALOSTI 2005, Stará Lesná, Vyd. Univerzity Palackého Olomouc, pp. 162-169, ISBN 80-248-0755-6, 2005.
Search in Google Scholar Back to article
Ministry of Justice of Slovak Republic, (2010). Information on the componenets of anonymization services, accesible from the justice.sk domain. http://intranet/intranet/sudy
Search in Google Scholar Back to article
Spring by VMWare Tanzu (2021). How to bild effective agents, Retrievel Augmented Generation. https://docs.spring.io/spring-ai/reference/api/retrieval-augmented-generation.html
Search in Google Scholar Back to article

Articles in this issue

DOI: https://doi.org/10.2478/jee-2025-0028 | Journal eISSN: 1339-309X | Journal ISSN: 1335-3632

Journal RSS Feed

Language: English

Page range: 275 - 283

Submitted on: Apr 4, 2025

Published on: Jun 19, 2025

Published by: Slovak University of Technology in Bratislava

In partnership with: Paradigm Publishing Services

Keywords:

data anonymization,

named entity recognition,

dataset information extraction,

artificial intelligence,

machine learning

Related subjects:

Engineering,

Introductions and overviews,

Engineering, other

© 2025 Peter Kvasnica, published by Slovak University of Technology in Bratislava
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Volume 76 (2025): Issue 3 (June 2025)