Have a personal or library account? Click to login

Figures & Tables

Figure 1.

Counts of example words “Może” (maybe), “Stażysta” (intern) and “Lekarz” (doctor) in each category
Counts of example words “Może” (maybe), “Stażysta” (intern) and “Lekarz” (doctor) in each category

Figure 2.

SOM visualization using GloVe embeddings
SOM visualization using GloVe embeddings

Figure 3.

SOM visualization of decision borders, built using NER model
SOM visualization of decision borders, built using NER model

Figure 4.

All keywords displayed on SOM
All keywords displayed on SOM

Figure 5.

Strong keywords displayed on SOM
Strong keywords displayed on SOM

Figure 6.

SOM obtained from RoBERTa embeddings
SOM obtained from RoBERTa embeddings

Figure 7.

Strongest keyword per query from RoBERTa model
Strongest keyword per query from RoBERTa model

Comparison of accuracy between the NER model and Polish RoBERTa in each category of document

CategoryRoBERTa [%]NER [%]
Civil law68.8872.00
Administrative law31.4055.73
Pharmaceutical law67.8074.36
Labor law62.8859.15
Medical law65.5677.05
Criminal law55.7762.22
International law0.0035.29
Tax law12.7777.27
Constitutional law13.3350.00

Number of words in each category with different thresholds of acceptance for how unique a keyword must be to a category_

CategoryShare of words
>50%>90%100%
Civil law587470466
Administrative law209192191
Pharmaceutical law253210210
Labor law358264263
Medical law834599595
Criminal law1059089
International law141212
Tax law423434
Constitutional law222
Total2,4041,8731,862

Coherence (C) and descriptiveness (D) of classes using RoBERTa embeddings_

Category CD
Civil law2310.60.055
Administrative law1510.530.08
Pharmaceutical law2980.680.048
Labor law5830.60.023
Medical law9940.590.027
Criminal law200.530.235
International law1370.610.076
Tax law250.450.195
Constitutional law2800.640.056

Number of words in each category during each step of data preparation

CategoryBeforeStep 1Step 2Result
Civil law1078727641023470
Administrative law51821786562192
Pharmaceutical law47421489558210
Labor law97072081828264
Medical law1750334471407599
Criminal law4579134741890
International law3741994012
Tax law119746814534
Constitutional law734062
Total54,14413,6214,9871,873

Number, coherence and descriptiveness of documents before and after using exclusively strong keywords

CategoryBefore changesAfter changes
Number of documentsCoherence scoreDescriptiveness scoreNumber of documentsCoherence scoreDescriptiveness score
Civil law8630.510.0312230.740.041
Administrative law1640.850.054840.850.062
Pharmaceutical law2120.790.0431010.80.064
Labor law3270.790.0431540.840.055
Medical law8460.680.0233770.710.034
Criminal law1920.820.288780.830.325
International law70.940.0930.950.123
Tax law610.90.577340.890.550
Constitutional law210.03821.000.066
Sum4,402 1,056
DOI: https://doi.org/10.14313/jamris-2025-004 | Journal eISSN: 2080-2145 | Journal ISSN: 1897-8649
Language: English
Page range: 33 - 41
Submitted on: Apr 27, 2024
Accepted on: Nov 1, 2024
Published on: Mar 31, 2025
Published by: Łukasiewicz Research Network – Industrial Research Institute for Automation and Measurements PIAP
In partnership with: Paradigm Publishing Services
Publication frequency: 4 times per year

© 2025 Paulina Puchalska, Kacper Krzemiński, Maksymilian Lis, Rafał Scherer, Paweł Drozda, Kajetan Komar-Komarowski, Konrad Szałapak, Andrzej Sobecki, Tomasz Zymkowski, Julian Szymański, published by Łukasiewicz Research Network – Industrial Research Institute for Automation and Measurements PIAP
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.