Figure 1.

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Comparison of accuracy between the NER model and Polish RoBERTa in each category of document
| Category | RoBERTa [%] | NER [%] |
|---|---|---|
| Civil law | 68.88 | 72.00 |
| Administrative law | 31.40 | 55.73 |
| Pharmaceutical law | 67.80 | 74.36 |
| Labor law | 62.88 | 59.15 |
| Medical law | 65.56 | 77.05 |
| Criminal law | 55.77 | 62.22 |
| International law | 0.00 | 35.29 |
| Tax law | 12.77 | 77.27 |
| Constitutional law | 13.33 | 50.00 |
Number of words in each category with different thresholds of acceptance for how unique a keyword must be to a category_
| Category | Share of words | ||
|---|---|---|---|
| >50% | >90% | 100% | |
| Civil law | 587 | 470 | 466 |
| Administrative law | 209 | 192 | 191 |
| Pharmaceutical law | 253 | 210 | 210 |
| Labor law | 358 | 264 | 263 |
| Medical law | 834 | 599 | 595 |
| Criminal law | 105 | 90 | 89 |
| International law | 14 | 12 | 12 |
| Tax law | 42 | 34 | 34 |
| Constitutional law | 2 | 2 | 2 |
| Total | 2,404 | 1,873 | 1,862 |
Coherence (C) and descriptiveness (D) of classes using RoBERTa embeddings_
| Category | C | D | |
|---|---|---|---|
| Civil law | 231 | 0.6 | 0.055 |
| Administrative law | 151 | 0.53 | 0.08 |
| Pharmaceutical law | 298 | 0.68 | 0.048 |
| Labor law | 583 | 0.6 | 0.023 |
| Medical law | 994 | 0.59 | 0.027 |
| Criminal law | 20 | 0.53 | 0.235 |
| International law | 137 | 0.61 | 0.076 |
| Tax law | 25 | 0.45 | 0.195 |
| Constitutional law | 280 | 0.64 | 0.056 |
Number of words in each category during each step of data preparation
| Category | Before | Step 1 | Step 2 | Result |
|---|---|---|---|---|
| Civil law | 10787 | 2764 | 1023 | 470 |
| Administrative law | 5182 | 1786 | 562 | 192 |
| Pharmaceutical law | 4742 | 1489 | 558 | 210 |
| Labor law | 9707 | 2081 | 828 | 264 |
| Medical law | 17503 | 3447 | 1407 | 599 |
| Criminal law | 4579 | 1347 | 418 | 90 |
| International law | 374 | 199 | 40 | 12 |
| Tax law | 1197 | 468 | 145 | 34 |
| Constitutional law | 73 | 40 | 6 | 2 |
| Total | 54,144 | 13,621 | 4,987 | 1,873 |
Number, coherence and descriptiveness of documents before and after using exclusively strong keywords
| Category | Before changes | After changes | ||||
|---|---|---|---|---|---|---|
| Number of documents | Coherence score | Descriptiveness score | Number of documents | Coherence score | Descriptiveness score | |
| Civil law | 863 | 0.51 | 0.031 | 223 | 0.74 | 0.041 |
| Administrative law | 164 | 0.85 | 0.054 | 84 | 0.85 | 0.062 |
| Pharmaceutical law | 212 | 0.79 | 0.043 | 101 | 0.8 | 0.064 |
| Labor law | 327 | 0.79 | 0.043 | 154 | 0.84 | 0.055 |
| Medical law | 846 | 0.68 | 0.023 | 377 | 0.71 | 0.034 |
| Criminal law | 192 | 0.82 | 0.288 | 78 | 0.83 | 0.325 |
| International law | 7 | 0.94 | 0.09 | 3 | 0.95 | 0.123 |
| Tax law | 61 | 0.9 | 0.577 | 34 | 0.89 | 0.550 |
| Constitutional law | 2 | 1 | 0.038 | 2 | 1.00 | 0.066 |
| Sum | 4,402 | 1,056 | ||||
