The I.Sicily Sketch Engine Corpus

Victoria Beatrix Fendel

doi:10.5334/johd.258

Full Article

(1) Overview

Repository location

DOI 10.5281/zenodo.13960482

Context

The dataset was prepared in the context of the Crossreads project, which collects, re-examines (through autopsy), re-edits (in the form of an EpiDoc .xml), and studies (petrographically, palaeographically, and linguistically) the epigraphic material from Sicily from the archaic through to the later imperial period. The dataset forms the basis of Fendel, forthcoming, Crossing thresholds: The lexicalisation and performance of memory in early imperial funerary inscriptions from Sicily, Lexis. The dataset was also used for the blogpost Fendel (2024a).

(2) Method

Steps

The source data is the .xml files (EpiDoc schematron) of the I.Sicily database (cf. Prag, 2021; Prag & Chartrand, 2019). The source data was sampled as outlined in the sampling strategy. The resulting subset of 723 inscriptions was split into those that are predominantly Latin and those that are predominantly Greek. These are the two sub-samples of the dataset. The Latin corpus consists of 371 inscriptions (3,577 tokens) and the Greek corpus of 352 inscriptions (2,763 tokens). Five inscriptions are in fact bilingual (from Catania and Syracuse). Bilingual inscriptions are assigned to the Greek or Latin portions of the corpus based on the language that accounts for comparatively more text. Table 1 provides a numerical overview:

Table 1

Overview of funerary and honorific inscriptions from Roman-period Sicily by location, text type, and language.

PROVENANCE	NUMBER OF INSCRIPTIONS	INSCRIPTIONS BY TEXT TYPE	INSCRIPTIONS BY LANGUAGE
Termini	190	180 funerary – 95% 10 honorific – 5%	31 Greek – 16% 159 Latin – 84%
Syracuse	282	269 funerary – 95% 13 honorific – 5%	225 Greek – 80% 56 Latin – 20% 1 bilingual – <1%
Catania	251	230 funerary – 92% 21 honorific – 8%	94 Greek – 37 % 152 Latin – 61% 4 bilingual & 1 Latin and Hebrew – 2%
Total	723

The .xml (EpiDoc) files were tokenized by means of the PyEpiDoc tokenizer (Crellin 2024). The tokenizer removes all Leiden diacritics and uses whitespaces as indicators of token boundaries. Manual correction ensured that each token is what we traditionally consider a word. Seeming univerbates for which no external evidence exists (see further Fendel, 2024a) and prosodic units (see Crellin, 2022) were separated into tokens. Material supplied was tokenized along with material attested. Files were subsequently lemmatized and part-of-speech tagged by means of the PROIEL model (ancient_greek-proiel-ud-2.12-230717 and latin-proiel-ud-2.12-230717) via the UDpipe using toConllu.py (Fendel 2024b). The .conllu output files were manually corrected for lemmata and part-of-speech tags in a text editor.

The corpus is lemmatized in Attic-Ionic Greek and classical Latin even if wordforms show alternative spellings for diachronic, diatopic, or diastratic reasons (e.g. ISic001649 bixit for vixit). Abbreviations are resolved. Calendar-related terms that were code-switched from Latin into Greek are lemmatized in Latin in the Greek inscriptions (i.e. names of months, nonae, kalendae), numerals written as letters are part-of-speech tagged as ADV and those written as words as NUM in all contexts (for the README file, see Fendel 2024b). Multi-word expressions are tagged as separate items in line with the PROIEL model (yet see Fendel, 2024a). The corrected .conllu files were converted into the .vert format required by Sketch Engine (for toVert.py see Fendel 2024b). The dataset contains the cat’ed .vert and .conllu files. Separate files for further development are available in Fendel (2024b). The data in .vert format has been implemented into Sketch Engine by means of a modified corpus configuration file (Fendel & Ireland, 2023). Due to bilingual inscriptions in both subsets of the dataset, best results are obtained by implementing both subsets with the language selected as ‘Ancient Greek’ into Sketch Engine.

Sampling strategy

The I.Sicily database was searched for inscriptions that (i) date from 1 BC to AD 401, (ii) have the cities of Syracuse, Termini, or Catania as their provenance, and (iii) are labelled ‘funerary’ or ‘honorific’ under inscription type. The funerary (and honorific) inscriptions of the Roman period form a sub-corpus that is internally as homogenous as possible within the I.Sicily corpus. Therefore, this sub-corpus was chosen to develop the tools for lemmatization and part-of-speech tagging and to create a gold standard.

Quality control

The inscriptions appear in the dataset with their unique ISicXXXXXX identifier such that they are linked to the database entries which provide further information on material aspects, previous editions, an image if available, along with crosslinks to the Trismegistos, EDR, and EDCS databases (Prag, 2021; Prag & Chartrand, 2019). The dataset is the third iteration of the Sketch Engine corpus built for the early imperial funerary inscriptions from Sicily. By means of Sketch Engine’s Wordlist feature, the previous versions were corrected manually for accentuation of Greek lemmata, consistency of part-of-speech tagging especially relating to fragmented items (tagged NOLEMMA and NOPOS), calendar terms and numerals, as well as Attic-Ionic and classical Latin lemmatization.

(3) Dataset Description

Repository name

Zenodo

Object name

The I.Sicily Sketch Engine corpus (early imperial funerary inscriptions)

Format names and versions

.conllu and .vert

Creation dates

2024-05-09 to 2024-10-18

Dataset creators

Victoria Beatrix Fendel

Language

Greek, Latin, Hebrew

License

Creative Commons Attribution 4.0 International

Publication date

2024-10-21

(4) Reuse Potential

The lemmatization and part-of-speech tagging pipeline designed for this dataset will be extended to the entire I.Sicily corpus in order to implement text search on the website. Current development focusses on using the lemmatization to plug into Crellin’s PyEpiDoc lemmatizer. The pipeline can also be applied to any other .xml file in EpiDoc format in order to create a Sketch Engine corpus, if manual correction is applied. The limitation of the current procedure is that morphological and syntactic tags were not manually corrected due to time constraints and the research focus. .conllu and .vert are table formats which can be opened in any text editor and searched manually. The .conllu files can also be implemented into any user interface that operates on .conllu files (see especially UDtools) or can be converted into relevant file formats, as has been done here into .vert files for Sketch Engine. The Sketch Engine corpus enables text search of the early imperial funerary inscriptions from Sicily which is currently unavailable otherwise and is thus of interest to linguists, classicists, and historians alike.

Acknowledgements

Jonathan Prag & Simona Stoyanova (EpiDoc source data), Robert Crellin (PyEpiDoc tokenizer, version May 2024), Matthew T. Ireland (modified Sketch Engine corpus configuration file).

Funding Information

Crossreads: text, materiality and multiculturalism at the crossroads of the ancient Mediterranean. CROSSREADS has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 885040).

Competing Interests

The author has no competing interests to declare.

Author Contributions

Victoria Beatrix Fendel: Conceptualization, Data curation, Investigation, Methodology, Validation, Writing – original draft, Writing – review & editing.