
Figure 1
Example of overlapping PDF pages.
Table 1
Missing Issues by year.
| 1938 | 151, 190, 221, 228, 255 |
| 1939 | 37, 50, 66, 114, 118, 119, 127, 129, 206, 226, 241, 271, 284, 296 |
| 1940 | 59, 65, 88, 111, 204, 282, 294 |
Table 2
Basic statistics related to Issues.
| YEAR | PAGES (NUMBER) | ISSUES (NUMBER) |
|---|---|---|
| 1938 | 1185 | 296 |
| 1939 | 1148 | 288 |
| (2288 PDF half pages + 4 full pages) | ||
| 1940 | 1188 | 296 |
| total | 3521 | 880 |
Table 3
OCR word errors.
| NUMBER OF TOKENS (ORIGINAL TEXT) | NUMBER OF TOKENS (OCRED FILE) | COMMON TOKENS (NUMBER) | MISSPELLED OR MISSING TOKENS | |
|---|---|---|---|---|
| 200/1938 (first page) | 4223 | 4217 | 3795 | 428 (10%) |
| 100/1939 (first half page) | 1864 | 1602 | 1352 | 512 (27%) |
Table 4
Preliminary results (lemmatization).
| LEMMATIZATION OUTCOME | ON CORRECT WORD FORM | ON INCORRECT WORD FORM | ||
|---|---|---|---|---|
| Correct | 604 | 78.9% | 24 | 24.5% |
| Correct except for capitalization | 43 | 5.6% | ||
| Incorrect | 119 | 15.5% | 74 | 75.5% |
| Total | 766 | 100% | 98 | 100% |
Table 5
Preliminary results (POS tagging).
| POS TAGGING RESULT | ON CORRECT WORD FORM | ON INCORRECT WORD FORM | ||
|---|---|---|---|---|
| Correct | 644 | 84.1% | 35 | 35.7% |
| Incorrect | 122 | 15.9% | 63 | 64.3% |
| Total | 766 | 100% | 98 | 100% |

Figure 2
The database schema (visualization generated by Arrows, https://arrows.app).
Table 6
Database summary.
| VERTICES (NODES) | EDGES (RELATIONSHIPS) | ||
|---|---|---|---|
| tokens | 14,297,480 | IS_IN_DOC | 14,297,480 |
| sentences | 1,404,085 | IS_IN_SENT | 14,297,480 |
| documents | 880 | IS_NEXT | 14,296,600 |
| IS_DEP | 12,893,395 | ||
| IS_ROOT | 1,404,085 | ||
| total | 15,702,445 | total | 57,189,040 |

Figure 3
The monthly frequency of the selected colocates of германскi (visualized with Plotly).

Figure 4
The co-occurrence schema of германскi (visualized with Plotly).

Figure 5
The monthly frequency of германскі (visualized with Plotly).
