1 Overview
1.1 Repository location
1.2 Context
Syriac is a dialect of the Aramaic language native to the ancient city of Edessa in modern-day Şanlıurfa, Türkiye.1 Its classical variety, Classical Syriac, was widely used from the 2nd ad until its decline beginning in approximately the 8th century ad with a brief literary resurgence taking place between the 11th–13th centuries ad. In addition to its strong grammatical tradition and rich linguistic evidence spanning multiple millennia, Syriac is also an early adopter and champion of the Digital Humanities making it an excellent language choice for a corpus-based study investigating how and why human language grammars change over time.2
The following article proposes a bootstrapping approach to Syriac corpus linguistics that gives researchers greater control over the design aspects of the corpus building process while still leveraging open-source part-of-speech (POS) tagging technology in Syriac to access layers of grammatical information in their chosen texts. It is acknowledged that this method demands some technical skill on behalf of the researcher and that the depth of grammatical information accessed is dependent on one’s research objectives. Nonetheless, I demonstrate the utility of the SEDRA API function in its capacity to offer digital scholars higher accuracy and customisability relative to other Syriac corpora where POS tags are absent, where manuscript choices are limited or where human proofing is absent from corpus transcription and/or POS tagging processes.3
2 Method
2.1 Sampling strategy
A manuscript selection framework was developed to support the design of a well-balanced corpus representative of native Syriac grammar. The framework takes into account eight variables in total, which are summarised in Table 1.
Open-source transcriptions supplied by the Digital Syriac Corpus (DSC) comprised the initial sampling frame of manuscripts during corpus design due to its accessibility as open-source data and because many of these transcriptions have undergone some degree of human proofing, ensuring greater accuracy. All manuscripts in this platform were tagged according to the variables listed in Table 1, which was partially supported by the metadata associated with these manuscripts.4 Preference was given to manuscripts where their date of original composition was known to ensure full coverage of all significant periods of the Syriac language from 1st–13th century ad, including one manuscript per century where possible. Letters were also prioritised during the manuscript selection process since they generally offer a closer approximation of native Syriac vernacular relative to more stylised writing genres; poetic and biblical texts were excluded for this reason. Translated manuscripts were also avoided wherever possible to minimise the effect of source language interference on Syriac grammar.5
Table 1
Manuscript Selection Framework.
| Primary | 1 | Accessibility | Restricted to manuscripts hosted by the Digital Syriac Corpus including approximately 571 records comprising 78 transcribed works by 38 known authors (and 24 anonymous authors) as at June 2024. |
| 2 | Date | Preference given to manuscripts where estimated date of original composition is known. Years rounded to the nearest century with each century represented, where possible. | |
| 3 | Genre | Poetry and biblical manuscripts excluded due to known genre effects on syntax (Black, 2020). Letters prioritised, where available. | |
| 4 | Translated | Translated manuscripts excluded as a cautionary measure due to problems they pose to historical syntax (Campbell, 2013, p. 400), including potential word-order effects in Aramaic (Pat-El, 2012, p. 101). | |
| Secondary | 5 | Authorial attribution | In the first instance, historical manuscripts with known authors are selected; anonymous authors are included as a last resort. |
| 6 | Textual copies | Preference is given to manuscripts based on a manuscript witness; manuscripts based on copied texts were selected as a last resort. | |
| 7 | Textual autonomy | Preference is given to autonomous manuscripts with those appearing in a larger body of work selected as a last resort. | |
| 8 | Dialectal identification | In the absence of known effects on Aramaic syntax, manuscript selection was not motivated by dialect; nonetheless, an even spread was attempted, but not prioritised. |
The outcome of this sampling procedure is a specialised Syriac corpus comprising approximately 475,000 words and structured across nine authors spanning thirteen centuries (refer to Table 2). In corpus linguistic terms, this may be regarded as a ‘snapshot corpus’ (Hardie & McEnery, 2011, pp. 8–9) representing native Syriac syntax over multiple, consecutive centuries.
Table 2
Syriac Research Corpus.
| AUTHOR | TITLE | DIALECT | GENRE | TRANSLATED | EDITION | SOURCE | AUTONOMOUS | CENTURY | WORDS (POS%) | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Mara bar Serapion | Letter of Mara bar Serapion | W | Lr | N | B | Cureton (1855) | Y | 1st | 1,594 (61.61%) |
| 2 | Bardaisan | Book of the Laws of Countries | W | Ex | N | B | Drijvers (1965) | N | 2nd–3rd | 5,277 (75.38%) |
| 3 | Aphrahat | The Demonstrations | E | Ex | N | B | Parisot (1894) | Y | 4th | 77,685 (77.52%) |
| 4–6 | Rabbula of Edessa | Letter from Cyril to Rabbula | ||||||||
| Letter from Rabbula to Andrew of Samosata | W | Lr | N | B | Overbeck (1865) | N | 5th | 882 (58.84%) | ||
| Part of a Letter from Rabbula to Cyril | ||||||||||
| 7–8 | Barlaha | Letter from Simeon of Mart Maryam in Response to Barlaha | ||||||||
| Letter from Barlaha to Simeon on the Translation of the Psalms | W | Lr | N | D | Vat. Syr. 135 | Y | 6th | 9,391 (63.38%) | ||
| 9 | Isaac of Nineveh | Ascetic Discourses | E | Ex | N | B | Bedjan (1909) | Y | Late 7th | 90,320 (51.11%) |
| 10 | Anton of Tagrit | On Divine Providence | W | Ex | N | D | BL Add. 14,726 | Y | 9th | 19,429 (59.31%) |
| 11 | Dionysius bar Salibi | Commentaries | W | Ex | N | D | SOAH 00012 | Y | 12th | 231,157 (63.32%) |
| 12 | Gregorius bar Hebraeus | Treatise of Treatises | W | Ex | N | B | N/A | Y | 13th | 39,106 (56.72%) |
| 474,841 (avg. 63.13%) |
[i] Note: a ‘word’ is defined by space boundaries.
Abbreviations
Dialects: W = West; E = East
Autonomous: N = No (i.e., part of larger body of work); Y = Yes (i.e., autonomous manuscript)
Translated: N = Not (likely) translated
Edition: D = Diplomatic (i.e., transcribed from source); B = Best-text (i.e., transcribed from print)
Genre: Lr = Letter; Ex = Exposition
Source: BL Add. = British Library Additional Manuscripts collection; SOAH = Syriac Orthodox Archdiocese of Homs; Vat. = Digital Vatican Library; N/A = No source available on the DSC platform
2.2 POS-tagging
2.2.1 Pre-Processing
Each Syriac manuscript was normalised by removing punctuation marks in order to avoid interference with the tokenisation process and maximise the probability of an exact word match during the tagging process. A dictionary file was created comprising a list of all unique words across all text files.6
2.2.2 API Integration
Using Power Query in Excel, a function was created to automate the search of each word in the dictionary file using the Application Programming Interface (API) mechanism available in Syriac via the SEDRA project, a linguistic database providing open access to a corpus of Syriac lexicons and their corresponding linguistic information. The SEDRA API feature draws on over 50,000 human-tagged entries based on a crowd-sourcing model,7 and by default takes into account relevant affixal and/or clitic material on Syriac words. For each lexical entry in the list, a range of linguistic information was retrieved according to available API parameters and included nine syntactic categories—nouns, verbs, adverbs, adjectives, prepositions, particles, pronouns, numerals and idioms—or their sub-categories where available, e.g., denominatives, demonyms, substantives and proper nouns for the nominal class. The category of state, a linguistic feature of Semitic noun morphology, was further extracted (including where not applicable) and appended to the syntactic category with a hyphen. Together, the tags were appended with an underscore to the end of the corresponding Syriac word in the original text file in the following order:8
| 1. | <syntax category>-<state>__<syriac word> |
e.g., EMP-N__ “blessing” (Emphatic state noun) |
2.3 Quality control
A total of 297,981 words in the corpus were successfully tagged, averaging 63.13% per manuscript. The POS tagging process was based on an exact-match process whose accuracy is relative to the presence of vocalisation marks in the original text.9 Among all 297,981 words in the corpus with an available POS tag, approximately 73,188 (24.56%) words reflected some kind of homonymy involving word spellings or morphemes with two or more semantic and/or syntactic interpretations.10 These words were tagged as ‘DUP’ to facilitate a manual reading in their original context during corpus search procedures. Since the present research investigation focussed on nominal state morphology only, tags reflecting grammatical number and gender are not included here. Nonetheless, these features—in addition to other linguistic information relating to verb forms (e.g., tense)—are captured in the SEDRA database and available for extraction via the SEDRA API mechanism using the same procedure describe above in order to support future research projects. For more targeted and exhaustive searches of specific morpho-syntactic material, such as select prefixes, pronominal clitics and particles, regular expression search patterns were used instead after importing the corpus text into appropriate corpus linguistic software (e.g., LancsBox, AntConc).
3 Dataset Description
Repository name
Zenodo
Object name
Specialised POS Tagged Syriac Corpus for State Morphology
Format names and versions
TXT
Creation dates
2020.11.01–2023.03.01
Dataset creators
Charbel El-Khaissi
Language
Aramaic (Syriac)
License
Creative Commons Attribution 4.0 International
Publication date
2024.06.29
4 Reuse Potential
The POS-tagged files hold significant potential for reuse across various domains. In computational linguistics, the dataset serves as a resource for training and testing POS tagging models specifically tailored to historical and low-resource languages (e.g., Vidal-Gorène & Kindt, 2020) or for ensuring representation of historical languages in cross-linguistic or typological projects (e.g., Pimentel et al., 2021; Batsuren et al., 2022). Further, the text files can be integrated into a range of corpus analysis software, such as AntConc which recognises the format of POS tags that are used in this dataset (see example 1), enabling historical linguists and Syriac researchers—including lexicographers—to better understand usage patterns of certain (morpho-)syntactic phenomena in Syriac.11 Such frequency analysis of grammatical patterns may support scholars to validate, refute or clarify existing remarks on specific patterns according to the Syriac grammatical tradition, and insodoing, offer renewed perspectives on what may be considered ‘standard’ or ‘exceptional’ from a quantitative standpoint. To take an example, nominal negation phrases in the 12th century ad manuscript of the corpus in Table 2 are attested with relic noun dependents (or the ‘Absolute’ noun) with higher relative frequency to the same relic nouns in other texts of this later period. This kind of frequency analysis is important for understanding whether authors in certain texts are using a deliberately archaic, or classical, style of writing or whether such relic nouns are in fact grammatically “expected” in Syriac negation phrases (e.g., Muraoka, 2005, pp. 59–60). Taken together with other kinds of frequency results, these insights also enhance the overall effectiveness of educational materials in Syriac grammar (e.g., Kiraz, 2013) by helping teachers and students understand which Syriac constructions are “[un]common, [a]typical, [un]likely, and [im]probable” (Conrad, 2010, pp. 227–8) while also highlighting the influence that genre effects play in determining what constitutes a ‘standard grammatical rule’. The diachronic dimension of this dataset further assists historical linguists and philologists in estimating the date of original composition of manuscripts by correlating specific linguistic forms with certain chronological periods. For instance, the nominal negation morpheme
dəlaw may be tentatively used for dating later Syriac texts based on its later attestation in the 6th–7th century ad onwards and its absence in earlier periods.12 This diachronic component is useful when dating later Syriac translations of old texts—such as those originally composed near or around 1st–2nd century ad in Greek or Latin—to avoid anachronistically dating Syriac translations, which can thwart the accuracy of linguistic analysis in a historical context. Finally, the manuscript selection framework in Table 1, which formed part of the sampling strategy for this study, offers a good starting point for standardising corpus design practices in Syriac linguistics based on a range of philological, textual and linguistic criteria, and which can be adapted to suit the purpose of the researcher’s investigation. For instance, a second, similar corpus may be designed with translated texts, e.g., from Persian, to facilitate a controlled, comparative analysis between translated and non-translated Syriac from a corpus-based quantitative point of view.
Notes
[2] The methodologies and resources shared here were produced as part of the author’s PhD research project on the historical syntax of Aramaic (Syriac) at The Australian National University (2020—current) in Canberra, Australia. All hyperlinks referenced throughout this article are valid as at September 2024.
[3] See El-Khaissi (In press) for a review of Syriac in the Digital Humanities.
[4] Needless to say, Syriac POS tagging technology continues to be a work in progress with noteworthy machine learning experiments conducted recently including Vidal-Gorène and Kindt (2020), Kiraz et al. (2021) and Naaijer et al. (2023). See also the History of the Digital Syriac Corpus webpage for a list of older projects.
[5] Available in Text Encoding Initiative (TEI) format. Source files: https://github.com/srophe/syriac-corpus.
[6] I acknowledge this is more of an ideal than a practical goal in light of the strongly multilingual backdrop of Syriac speaking societies through time, e.g., Persian, Hebrew, Greek, Latin, Arabic (Butts 2016).
[7] Where a ‘word’ is defined by word boundaries and not by lexeme. The tokenisation process of each word, including lexeme and inflectional marking identification, as well as whether the word is marked by any affixes or vocalisation marks, is automatically handled by the SEDRA API mechanism.
[8] For example, I have tagged 1,008 lexical entries myself to date. SEDRA API conforms to the following schema suitable for Semitic Languages: https://sedra.bethmardutho.org/about/openapi.
[9] Readers are directed to the Repository Location for a full list of POS tag abbreviations and their definitions.
[10] Vocalising texts in Syriac was not common practice until around the 3rd century ad onwards, and even then, may have only involved partial vocalisation on the author’s or scribes’ part. Since vocalisation practices are more important for understanding inflectional (and therefore, grammatical) differences between verbs and nominals, it is considered less problematic for the present research investigation concerned primarily with nominals.
[11] Measured here by whether two or more unique morpho-syntactic values were returned for each word.
[12] Effective utilisation of the dataset requires familiarity with the Syriac language, scripts, and corpus software, including a working knowledge of regular expressions.
[13] This corroborates a similar proposal on the dating of Syriac negators as proposed by Joosten (1992) with the variant
law.
Acknowledgements
Thanks to Andrea Farina (King’s College London) and Dr Mathilde Bru (University College London) for the opportunity to share my research and receive valuable feedback in the Data in Historical Linguistics series. I acknowledge the Digital Syriac Corpus project team who have freely shared all Syriac transcriptions and metadata in the public domain. The open-source POS technology enabling this project would not be possible without the innovative work of the team at Beth Mardutho: The Syriac Institute led by Dr George Kiraz.
Funding Information
This study formed part of a PhD research project funded by the Australian Government Research Training Program (RTP) scholarship (2020–2023).
Competing Interests
The author has no competing interests to declare.

“blessing” (Emphatic state noun)