Have a personal or library account? Click to login
Creating a Historical Migration Dataset from Finnish Church Records, 1800–1920 Cover

Creating a Historical Migration Dataset from Finnish Church Records, 1800–1920

Open Access
|Aug 2025

Figures & Tables

Table 1

Layout statistics for the main categories. Other refers to all remaining categories, primarily including empty images or images that do not contain migration records (due to incorrect metadata or mixed data within a book).

LAYOUT TYPEIMAGES% OF DATA
handdrawn94,47747.07%
preprinted79,24339.48%
half-table14,1627.06%
free text9,4574.71%
other3,3951.69%
johd-11-345-g1.png
Figure 1

Cumulative counts for different preprinted layout types.

johd-11-345-g2.png
Figure 2

Example of handdrawn moving table (Huittinen 1878). FFHA’s digital archive.

johd-11-345-g3.png
Figure 3

Example of preprinted moving table (Heinävesi 1909). FFHA’s digital archive.

Table 2

Typical elements of migration tables.

DATADESCRIPTION
Reference numberAn identifier for the record, which may represent a page reference, an order number within a specific year, or other context-dependent information.
DateDate of the recording, not necessarily the actual moving date.
Occupation and nameName of the person or main person of the family and his/her occupation.
Number of personsNumber of people moving, females and males separated.
Where to/Where fromName of the new/old parish depending if moving-in or moving-out.
Reference to communion bookReference to the page in the communion book where other details about the person are recorded.
NotesOther related markings.
johd-11-345-g4.png
Figure 4

Details of a typical moving table entry from Hankasalmi: Maria Sirkka, a servant (piika), moved to Rautalampi on January 9th. She is female (naisenpuoli), born on March 25, 1857, in Rautalampi. Her marital status is single, and her occupation is servant (palvelus). Additional information can be found in the communion book on page 296. No further remarks are recorded.

Table 3

Summary of manually annotated data for different stages of the pipeline, divided into training, development, and test sets. Image and cell counts are shown separately.

ANNOTATION TYPETRAINDEVTESTTOTAL
IMAGESCELLSIMAGESCELLSIMAGESCELLSIMAGESCELLS
De-skew key points9001902001,290
Table structure1,2521881921,632
Cell type23047,0004714,0004616,00032377,000
Text recognition411,947392,277804,224
Year recognition1,0261881921,326
johd-11-345-g5.png
Figure 5

Text recognition for tabular data.

johd-11-345-g6.png
Figure 6

Extreme example of page skew (left) and the output of the de-skew process (right). Red circles mark stage-I corner recognition, green dots mark stage-II corner recognition.

johd-11-345-g7.png
Figure 7

De-skew process. Two pages of the opening with the relevant six keypoints A-F and the image frame (dashed line).

johd-11-345-g8.png
Figure 8

Example of how clustering improves results. In some cases, the table cell detection model fails to detect all cells in a table (black circles on the left-hand side). By applying a clustering method, these gaps can be filled (black circles on the right-hand side).

johd-11-345-g9.png
Figure 9

Example of an opening with several year mentions outside of the header area.

Table 4

Skew angle, in degrees difference from vertical, of the left, middle, and right borders, reported on the test set. The angles in the original image (Base) are calculated using the manual annotation of the test set images, and Stage I and II are the two stages of the de-skew algorithm.

LEFTMIDDLERIGHT
Base0.33° ± 0.78–0.06° ± 0.53–0.28° ± 0.77
Stage I0.17° ± 0.62–0.15° ± 0.43–0.27° ± 0.64
Stage II0.08° ± 0.590.06° ± 0.82–0.005° ± 0.69
Table 5

Table detection.

TABLE TYPEACCURACYRECALLPRECISIONF1-SCORE
Preprinted93.293.2100.096.5
Handdrawn95.495.4100.097.6
All94.294.2100.097.0
Table 6

Row detection.

TABLE TYPEACCURACYRECALLPRECISIONF1-SCORE
Preprinted95.196.498.797.5
Handdrawn87.993.793.493.6
All91.495.196.095.5
Table 7

Column detection.

TABLE TYPEACCURACYRECALLPRECISIONF1-SCORE
Preprinted96.199.196.998.0
Handdrawn92.498.393.996.1
All94.498.795.697.1
Table 8

Cell type classification performance with Precision, Recall, and F1-score reported separately for class.

CELL TYPEPRECISIONRECALLF1-SCORESUPPORT
single-line96.387.391.69829
empty81.296.788.33692
repetition79.487.183.12020
multi-line67.969.668.7744
accuracy88.616285
macro avg81.285.282.916285
weighted avg89.588.688.816285
Table 9

Comparison of text recognition evaluation for numeric and textual lines.

EMCERAVG. LENGTHSUPPORT
textual28.2%0.1912.2 chars897
numeric65.8%0.183.2 chars1,232
All49.9%0.197.0 chars2,129
Table 10

Precision, Recall, and F1-score of per-page year mention extraction.

YEAR EXTRACTION METHODPRECISIONRECALLF1-SCORE
with LLM correction91.683.187.2
without LLM correction89.280.084.4
Table 11

Proportion of extracted parish names in the Elimäki books for which a known parish name can be found at edit distance of at most d.

d = 0d ≤ 1d ≤ 2d ≤ 3d ≤ 4
8%23%41%60%72%
johd-11-345-g10.png
Figure 10

Histograms of departures from and arrivals to Elimäki between 1875 and 1922.

johd-11-345-g11.png
Figure 11

Maps showing the origins and destinations of migration to and from Elimäki between 1875 and 1922.

DOI: https://doi.org/10.5334/johd.345 | Journal eISSN: 2059-481X
Language: English
Submitted on: Jun 6, 2025
|
Accepted on: Jul 21, 2025
|
Published on: Aug 29, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Ari Vesalainen, Jenna Kanerva, Aïda Nitsch, Kiia Korsu, Ilari Larkiola, Laura Ruotsalainen, Filip Ginter, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.