(1) Overview
Repository location
JOHD Dataverse: https://doi.org/10.7910/DVN/N1ZYWS
Context
Over the last several years of research in the field of low-resource computational linguistics, one particular method for improving dataset quality has received limited scholarly attention: the implementation of era- and/or genre-specific stop word lists (from here on: stop lists). Stop lists available toolkits for low-resource languages tend to be applicable to a broad language dataset rather than targeted to specific texts, authors, or time periods. Many such language datasets, however, include texts from a wide range of genres and dates.
Original literary output in Latin spanned several hundred years, during which significant dialectic and linguistic changes occurred. Additionally, individual genres tend to exhibit prescribed conventions in both form and content, which drive differences in word choice and usage. Applying the same stop list to, e.g., Augustan love poetry and late ecclesiastical Latin prose implies a non-existent continuity between the two (Table 1).
Table 1
Latin literature, although often used as a single corpus in computational research, in fact has a wide generic range and multi-millennial time span (See Conte 1994).
| TIME PERIOD | REPUBLICAN LATIN | AUGUSTAN LATIN | IMPERIAL LATIN | LATE ANTIQUE & CHRISTIAN LATIN |
|---|---|---|---|---|
| ‘Classical Latin’ | ||||
| Approximate Start | 2nd C. BCE | mid-1st C. BCE | 1st C. CE | 2nd C. CE |
| Approximate End | mid-1st C. BCE | early 1st C. CE | 3rd C. CE | |
| Example Authors & Works by Genre | ||||
| Epic poetry | Lucretius | Vergil Ovid | Statius Lucan | Dracontius |
| Theater | Plautus Terence | Pantomime | Seneca | Hrotsvitha |
| Small-format poetry (e.g., elegy) | Catullus | Horace Propertius Tibullus Ovid | Statius Martial | Anthologia Latina |
| Historiography & Military History | Caesar Sallust | Livy | Tacitus | Ammianus Marcellinus |
| Other prose | Cicero Varro Atticus | Flaccus Vitruvius Pompey Trogue | Petronius Pliny Quintilian Suetonius Apuleius | Nonius Jerome Tertullian Other apologists Acta Martyrum and Passiones |
In this data paper, we argue that computational researchers—particularly those working in low-resource contexts—should consult with linguistic specialists to create targeted stop lists developed with specific eras, genres, authors, or contexts in mind. We offer an exemplum of collaborative creation of stop lists targeted at Augustan poetry. Augustan texts—i.e., texts written during the Augustan Era—are composed in a dialect of Latin commonly referred to as ‘Classical Latin’ (see Table 1). Although the dataset we present is itself small, the potential impact of its implementation in computational humanities workflows is not.
(2) Method
2.1 Steps
2.1.1 Consideration & evaluation of existing Latin stop lists
In recent years, most Latin stop lists have been based on word frequency. By creating stop lists based exclusively on the frequency of words within an entire corpus of extant (digitized) literature, scholars may inadvertently remove words that are less common and/or more significant in certain eras, thereby erasing essential content from their analyses.
This is particularly true in the context of Latin literature given that a significant proportion of extant text comes from a Late Antique or Christian context. Over the past several years, the Classical Language Tool Kit (CLTK, Johnson et al., 2021) has offered several stop lists based on different methods of computing word frequency, including overall frequency, entropy, and variance.1 These stop lists demonstrate disproportionate influence from Late Antique linguistic contexts due to the large amount of extant text from these time periods. On the other hand, the Perseus Digital Library (PDL; Crane, accessed 2024) and the International Organization for Standardization (ISO, Diaz 2016) offer stop lists that suffer from brevity and inconsistency. A summary of previous lists is presented in Table 2; the script used to create this table is available in the GitHub repository. Our revised stop list is available below in Table 3.
Table 2
Summary of words included in existing Latin stop lists. “Y” indicates inclusion on a list; “N” indicates absence. CLTK-M: stop words by mean frequency; CLTK-V: stop words by variance; CLTK-E: stop words by entropy probability; CLTK-B: stop words by Borda count; ISO: stop words from the International Standardization Organization; PDL: stop words from the Perseus Digital Library. More details available in the file previous_stop_lists.py.
| WORDS | CLTK-M | CLTK-V | CLTK-E | CLTK-B | ISO | PDL |
|---|---|---|---|---|---|---|
| adhic, aliqui, aliquis, an, cur, deinde, es, etsi, fio, haud, idem, infra, interim, is, mox, necque, o, ob, possum, quare, quicumque, quilibet, quisnam, quisquam, quisque, quisquis, quoniam, sive, sui, sum, suus, trans, tum, unus | N | N | N | N | N | Y |
| a, e, erant, re, rebus, rem, tandem, vel | N | N | N | N | Y | N |
| at | N | N | N | N | Y | Y |
| contra, cuius, tantum | N | N | Y | N | N | N |
| magis | N | N | Y | N | N | Y |
| anno, deo, dicitur, dixit, dominus, ed, nummus, rex, totus | N | Y | N | N | N | N |
| super | N | Y | N | N | N | Y |
| bellum, bibit, dig, nouus, od, quaestio, uos | N | Y | N | Y | N | N |
| eorum | Y | N | N | N | N | N |
| cui, omnibus, sua | Y | N | Y | N | N | N |
| apud, igitur | Y | N | Y | N | N | Y |
| res | Y | N | Y | N | Y | N |
| ei, nobis, omnes, potest, quos, sine | Y | N | Y | Y | N | N |
| modo, quis, tam, ubi | Y | N | Y | Y | N | Y |
| dei, deus, secundum | Y | Y | N | Y | N | N |
| ea, eius, eo, esse, esset, eum, fuit, his, id, illa, mihi, nihil, nunc, omnia, quem, quid, quoque, se, sibi, sicut, sit, tibi | Y | Y | Y | Y | N | N |
| ante, ego, enim, ergo, iam, ille, inter, ipse, nam, ne, nisi, nos, post, pro, quia, sub, tu, uel, uero | Y | Y | Y | Y | N | Y |
| erat, haec, hoc, me, qua, quibus, quod, sunt, te | Y | Y | Y | Y | Y | N |
| ab, ac, ad, atque, aut, autem, cum, de, dum, est, et, etiam, ex, hic, in, ita, nec, neque, non, per, quae, quam, qui, quidem, quo, sed, si, sic, tamen, ut | Y | Y | Y | Y | Y | Y |
Table 3
Stop list targeted to Classical Latin poetry.
| CLASSICAL LATIN POETRY STOP LIST | |||||||
|---|---|---|---|---|---|---|---|
| a | contra | ex | ita | noster | qua | quod | super |
| ab | cum | facio | magis | nunc | quam | quoque | tam |
| abs | de | fero | magnus | nullus | que | res | tamen |
| ac | deus | habeo | meus | ob | qui | se | tantus |
| ad | dico | hic | modo | omnis | quia | sed | totus |
| aliquis | do | iam | multus | per | quicumque | si | tu |
| alius | dum | idem | nam | post | quid | sic | tuus |
| ante | e | igitur | ne | possum | quidam | sicut | vel |
| apud | ego | ille | nec | praeter | quidem | sine | vero |
| at | enim | in | neque | pro | quis | sub | vester |
| atque | ergo | inter | nihil | prope | quisquam | sui | vos |
| aut | et | intra | nisi | propter | quisque | sum | ubi |
| autem | etiam | ipse | nos | qualis | quisquis | supra | ultra |
| circa | etiamnum | is | non | quantus | quo | suus | ut |
| uti | |||||||
2.1.2 Removal of stop words with high semantic relevance to target corpus
We begin with the combined list of words from Table 2 and methodically remove and add words to create a targeted Latin stop list for poetry from the Augustan era. The words we remove can be sorted into three categories.
First, we remove words with a disproportionately high frequency in Late Antique and ecclesiastical Latin relative to Classical Latin. In most cases, these are references to or titles for the Christian god (e.g. dominus, deus, or rex). The word dominus is so common in Late Antique texts that it is the 53rd most common word in all Latin literature.2 In an ecclesiastical context, dominus means ‘lord [god]’. In a Classical Latin context, however, the word dominus (here, ‘master’ or [slave] ‘owner’) is much less frequent and carries more importance as a semantic feature for conceptual representation (as opposed to, e.g., the formulaic uses in ecclesiastical Latin). Removing dominus from the text of a Classical poet could constitute the erasure of slave narratives and essential literary constructs (servitium amoris, or love as slavery, Copley 1947; Lyne 1979). Similar considerations should be made for words like rex and deus.
Second, we remove words that are overrepresented in particular genres. Another inclusion on previous stop lists is the noun bellum (‘war’), which often occurs in historiographic texts like those of Julius Caesar. While bellum is a valid inclusion for stop lists targeted at historiography, as researchers creating a targeted stop list for Augustan poetry, we elected to remove it. Like slavery, war features in a metaphor of love as a military endeavor (militia amoris, Murgatroyd 1975). Another problem with previous stop word lists is their inconsistent inclusion of prepositions and pronouns. For example, some stop word lists include per (a common preposition meaning “through”), but not trans (a preposition meaning “across”). In our list, we correct these inconsistencies to include all pronouns and prepositions of similar semantic significance.
Third, we remove words that do not appear in Classical texts. This includes abbreviations such as od (a shortening for oculus dexter, ‘right eye’) and ed (possibly a shortened form of the pronoun idem, ‘the same’, or the verb edo, ‘eat’). Finally, we remove words that are redundant, inconsistent, or illogical inclusions. For example, nummus (‘coin’) has an overall frequency rank of 1560; the reason for its inclusion in any stop list is unclear. Other words (me, sit, omnes) are specific forms that would be lemmatized to their dictionary forms ego (‘I’), sum (‘be’), and omnis (‘all’).
2.1.3 Addition of stop words with low semantic relevance to target corpus
To correct inconsistencies in existing lists, we include all pronouns except nemo (‘no one’). We did not include nemo as a stop word because its usage in the nominative (a grammatical case in Latin) tends to be more pointed than that of other pronouns, which carry less and equivalent semantic significance. The generalized use in other cases reverts to forms of nullus (‘nothing’), which is included on the list. We also add words with low semantic weight and high frequency in Classical Latin. For example, facio (‘make/do’), do (‘give’), magnus (‘big/great’), multus (‘much’), and fero (‘carry’).3
2.1.4 Creation of Python script for easy implementation of specialized list
To make it easy for researchers to use our targeted stop lists, we provide a command-line based Python script (latin_stop_list.py) that can be implemented into an existing research pipeline.
2.2 A Note on Latin Lemmatization
For certain languages, including Latin, researchers may want to consider removing stop words prior to lemmatization. This allows for more control for words that have been lexicalized4 in specific usages without eliminating other more semantically significant forms. For example, the rhetorical adverb vero (‘indeed’) comes from the adjective verus (‘true’) but is lemmatized under the adjective verus. The form vero of the adjective verus has been lexicalized as an adverb distinct from other forms of verus. Removing vero after lemmatization would also remove all forms of verus. Removing stop words prior to lemmatization may improve the results for texts that more frequently employ lexicalized versions of common words in specific uses. For this reason, we provide an unlemmatized stop list with all enumerated forms.
(3) Dataset Description
Repository names
JOHD Dataverse & GitHub
Object names
| FILE NAME | FORMATS | DESCRIPTION |
|---|---|---|
| classical_latin_poetry_ stop_words_unlemmatized | .txt,* .csv | Unlemmatized stop list for classical Latin poetry |
| classical_latin_poetry_ stop_words_lemmatized | .txt,* .csv | Lemmatized stop list for classical Latin poetry |
| latin_stop_list | .py | Python script for importing stop lists |
*.txt files are available only in the GitHub repository; .txt files are included for the convenience of researchers using console applications in their research pipeline.
Format names and versions
data: .txt, .csv; Python script: .py (3.12)
Creation dates
2020.09.01 – 2024.09.15
Dataset creators
Rachel E. Dubit (Stanford University): Linguistic expertise; Annie K. Lamar (UCSB): Computational expertise.
Languages
Latin; English
License
MIT
Publication date
2024.09.27
(4) Reuse Potential
Our open-access stop list can serve as a starting point for other eras or genres of Latin literature. Only slight adjustments should be needed for closely related contexts, whereas those working with datasets from significantly earlier or later periods may wish to make larger changes. More broadly, the transdisciplinary and collaborative process by which these stop lists were created is of significant benefit to low-resource computational linguistics research teams. Such collaboration should begin with at data pre-processing, rather than being contained to validation or analysis.
Notes
[2] All word frequencies referenced in this paper are taken from Logeion. The frequencies from Logeion were collated at The University of Chicago through a combination of manual and automated curation. Frequency rankings are based on data from a major Latin dictionaries and lexica. For more, see logeion.uchicago.edu/about.
Acknowledgements
We wish to acknowledge the contributions of Quinn Dombrowski (Stanford University) and Professor Hans Bork (Department of Classics, Stanford University).
Competing Interests
The authors have no competing interests to declare.
Author Contributions
Rachel E. Dubit: Conceptualization; Formal analysis; Investigation; Project administration; Validation; Writing – original draft; Writing – review & editing.
Annie K. Lamar: Conceptualization; Data curation; Investigation; Methodology; Software; Visualization; Writing – original draft; Writing – review & editing.
