Abstract
In Slovakia, a large number of texts on decisions are published by the courts on the MSSR website. All decisions that are published must be anonymous, without all personal and other data (personal and sensitive data that can identify individual entities in the proceedings, reveal their private data, etc.) to the extent described by the decree. Anonymization of court decision data is carried out according to an instruction given in the Court Management application, which supports only basic anonymization functions. Such an anonymization method is time- and human-resource-intensive, and at the same time, it is susceptible to the leakage of sensitive data. The text processing is based on machine learning methods using the principles of user-friendly visualization to accelerate anonymization and minimize the risk of leaking sensitive data. Achieving and improving results in practice the method of recognizing named entities based on the processing of text documents, dictionaries, rules and a trained statistical model on manually annotated data is used. The article describes the basic techniques of the anonymization used in practice, that are machine learning and removing information. Our attention is given to text documents processing, tools and principles for creating a data set of annotated data and evaluating the success of anonymization. The iteratively improving learning model is dominant in that it enables the use of knowledge as well as annotated data.