(1) Overview
Repository location
Context
The dataset was prepared in the context of the Crossreads project, which collects, re-examines (through autopsy), re-edits (in the form of an EpiDoc .xml), and studies (petrographically, palaeographically, and linguistically) the epigraphic material from Sicily from the archaic through to the later imperial period. The dataset forms the basis of Fendel, forthcoming, Crossing thresholds: The lexicalisation and performance of memory in early imperial funerary inscriptions from Sicily, Lexis. The dataset was also used for the blogpost Fendel (2024a).
(2) Method
Steps
The source data is the .xml files (EpiDoc schematron) of the I.Sicily database (cf. Prag, 2021; Prag & Chartrand, 2019). The source data was sampled as outlined in the sampling strategy. The resulting subset of 723 inscriptions was split into those that are predominantly Latin and those that are predominantly Greek. These are the two sub-samples of the dataset. The Latin corpus consists of 371 inscriptions (3,577 tokens) and the Greek corpus of 352 inscriptions (2,763 tokens). Five inscriptions are in fact bilingual (from Catania and Syracuse). Bilingual inscriptions are assigned to the Greek or Latin portions of the corpus based on the language that accounts for comparatively more text. Table 1 provides a numerical overview:
Table 1
Overview of funerary and honorific inscriptions from Roman-period Sicily by location, text type, and language.
| PROVENANCE | NUMBER OF INSCRIPTIONS | INSCRIPTIONS BY TEXT TYPE | INSCRIPTIONS BY LANGUAGE |
|---|---|---|---|
| Termini | 190 | 180 funerary – 95% 10 honorific – 5% | 31 Greek – 16% 159 Latin – 84% |
| Syracuse | 282 | 269 funerary – 95% 13 honorific – 5% | 225 Greek – 80% 56 Latin – 20% 1 bilingual – <1% |
| Catania | 251 | 230 funerary – 92% 21 honorific – 8% | 94 Greek – 37 % 152 Latin – 61% 4 bilingual & 1 Latin and Hebrew – 2% |
| Total | 723 |
The .xml (EpiDoc) files were tokenized by means of the PyEpiDoc tokenizer (Crellin 2024). The tokenizer removes all Leiden diacritics and uses whitespaces as indicators of token boundaries. Manual correction ensured that each token is what we traditionally consider a word. Seeming univerbates for which no external evidence exists (see further Fendel, 2024a) and prosodic units (see Crellin, 2022) were separated into tokens. Material supplied was tokenized along with material attested. Files were subsequently lemmatized and part-of-speech tagged by means of the PROIEL model (ancient_greek-proiel-ud-2.12-230717 and latin-proiel-ud-2.12-230717) via the UDpipe using toConllu.py (Fendel 2024b). The .conllu output files were manually corrected for lemmata and part-of-speech tags in a text editor.
The corpus is lemmatized in Attic-Ionic Greek and classical Latin even if wordforms show alternative spellings for diachronic, diatopic, or diastratic reasons (e.g. ISic001649 bixit for vixit). Abbreviations are resolved. Calendar-related terms that were code-switched from Latin into Greek are lemmatized in Latin in the Greek inscriptions (i.e. names of months, nonae, kalendae), numerals written as letters are part-of-speech tagged as ADV and those written as words as NUM in all contexts (for the README file, see Fendel 2024b). Multi-word expressions are tagged as separate items in line with the PROIEL model (yet see Fendel, 2024a). The corrected .conllu files were converted into the .vert format required by Sketch Engine (for toVert.py see Fendel 2024b). The dataset contains the cat’ed .vert and .conllu files. Separate files for further development are available in Fendel (2024b). The data in .vert format has been implemented into Sketch Engine by means of a modified corpus configuration file (Fendel & Ireland, 2023). Due to bilingual inscriptions in both subsets of the dataset, best results are obtained by implementing both subsets with the language selected as ‘Ancient Greek’ into Sketch Engine.
Sampling strategy
The I.Sicily database was searched for inscriptions that (i) date from 1 BC to AD 401, (ii) have the cities of Syracuse, Termini, or Catania as their provenance, and (iii) are labelled ‘funerary’ or ‘honorific’ under inscription type. The funerary (and honorific) inscriptions of the Roman period form a sub-corpus that is internally as homogenous as possible within the I.Sicily corpus. Therefore, this sub-corpus was chosen to develop the tools for lemmatization and part-of-speech tagging and to create a gold standard.
Quality control
The inscriptions appear in the dataset with their unique ISicXXXXXX identifier such that they are linked to the database entries which provide further information on material aspects, previous editions, an image if available, along with crosslinks to the Trismegistos, EDR, and EDCS databases (Prag, 2021; Prag & Chartrand, 2019). The dataset is the third iteration of the Sketch Engine corpus built for the early imperial funerary inscriptions from Sicily. By means of Sketch Engine’s Wordlist feature, the previous versions were corrected manually for accentuation of Greek lemmata, consistency of part-of-speech tagging especially relating to fragmented items (tagged NOLEMMA and NOPOS), calendar terms and numerals, as well as Attic-Ionic and classical Latin lemmatization.
(3) Dataset Description
Repository name
Zenodo
Object name
The I.Sicily Sketch Engine corpus (early imperial funerary inscriptions)
Format names and versions
.conllu and .vert
Creation dates
2024-05-09 to 2024-10-18
Dataset creators
Victoria Beatrix Fendel
Language
Greek, Latin, Hebrew
License
Creative Commons Attribution 4.0 International
Publication date
2024-10-21
(4) Reuse Potential
The lemmatization and part-of-speech tagging pipeline designed for this dataset will be extended to the entire I.Sicily corpus in order to implement text search on the website. Current development focusses on using the lemmatization to plug into Crellin’s PyEpiDoc lemmatizer. The pipeline can also be applied to any other .xml file in EpiDoc format in order to create a Sketch Engine corpus, if manual correction is applied. The limitation of the current procedure is that morphological and syntactic tags were not manually corrected due to time constraints and the research focus. .conllu and .vert are table formats which can be opened in any text editor and searched manually. The .conllu files can also be implemented into any user interface that operates on .conllu files (see especially UDtools) or can be converted into relevant file formats, as has been done here into .vert files for Sketch Engine. The Sketch Engine corpus enables text search of the early imperial funerary inscriptions from Sicily which is currently unavailable otherwise and is thus of interest to linguists, classicists, and historians alike.
Acknowledgements
Jonathan Prag & Simona Stoyanova (EpiDoc source data), Robert Crellin (PyEpiDoc tokenizer, version May 2024), Matthew T. Ireland (modified Sketch Engine corpus configuration file).
Funding Information
Crossreads: text, materiality and multiculturalism at the crossroads of the ancient Mediterranean. CROSSREADS has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 885040).
Competing Interests
The author has no competing interests to declare.
Author Contributions
Victoria Beatrix Fendel: Conceptualization, Data curation, Investigation, Methodology, Validation, Writing – original draft, Writing – review & editing.
