Era- and Genre-Specific Stop Word Lists for Low-Resource Computational Research: A Classical Latin Exemplum

Rachel E. Dubit; Annie K. Lamar

doi:10.5334/johd.246

Full Article

(1) Overview

Repository location

JOHD Dataverse: https://doi.org/10.7910/DVN/N1ZYWS

GitHub: https://github.com/LOREL-Lab/latin_stop_words

Context

Over the last several years of research in the field of low-resource computational linguistics, one particular method for improving dataset quality has received limited scholarly attention: the implementation of era- and/or genre-specific stop word lists (from here on: stop lists). Stop lists available toolkits for low-resource languages tend to be applicable to a broad language dataset rather than targeted to specific texts, authors, or time periods. Many such language datasets, however, include texts from a wide range of genres and dates.

Original literary output in Latin spanned several hundred years, during which significant dialectic and linguistic changes occurred. Additionally, individual genres tend to exhibit prescribed conventions in both form and content, which drive differences in word choice and usage. Applying the same stop list to, e.g., Augustan love poetry and late ecclesiastical Latin prose implies a non-existent continuity between the two (Table 1).

Table 1

Latin literature, although often used as a single corpus in computational research, in fact has a wide generic range and multi-millennial time span (See Conte 1994).

TIME PERIOD	REPUBLICAN LATIN	AUGUSTAN LATIN	IMPERIAL LATIN	LATE ANTIQUE & CHRISTIAN LATIN
‘Classical Latin’
Approximate Start	2^nd C. BCE	mid-1^st C. BCE	1^st C. CE	2^nd C. CE
Approximate End	mid-1^st C. BCE	early 1^st C. CE	3^rd C. CE
Example Authors & Works by Genre
Epic poetry	Lucretius	Vergil Ovid	Statius Lucan	Dracontius
Theater	Plautus Terence	Pantomime	Seneca	Hrotsvitha
Small-format poetry (e.g., elegy)	Catullus	Horace Propertius Tibullus Ovid	Statius Martial	Anthologia Latina
Historiography & Military History	Caesar Sallust	Livy	Tacitus	Ammianus Marcellinus
Other prose	Cicero Varro Atticus	Flaccus Vitruvius Pompey Trogue	Petronius Pliny Quintilian Suetonius Apuleius	Nonius Jerome Tertullian Other apologists Acta Martyrum and Passiones

In this data paper, we argue that computational researchers—particularly those working in low-resource contexts—should consult with linguistic specialists to create targeted stop lists developed with specific eras, genres, authors, or contexts in mind. We offer an exemplum of collaborative creation of stop lists targeted at Augustan poetry. Augustan texts—i.e., texts written during the Augustan Era—are composed in a dialect of Latin commonly referred to as ‘Classical Latin’ (see Table 1). Although the dataset we present is itself small, the potential impact of its implementation in computational humanities workflows is not.

(2) Method

2.1 Steps

2.1.1 Consideration & evaluation of existing Latin stop lists

In recent years, most Latin stop lists have been based on word frequency. By creating stop lists based exclusively on the frequency of words within an entire corpus of extant (digitized) literature, scholars may inadvertently remove words that are less common and/or more significant in certain eras, thereby erasing essential content from their analyses.

This is particularly true in the context of Latin literature given that a significant proportion of extant text comes from a Late Antique or Christian context. Over the past several years, the Classical Language Tool Kit (CLTK, Johnson et al., 2021) has offered several stop lists based on different methods of computing word frequency, including overall frequency, entropy, and variance.¹ These stop lists demonstrate disproportionate influence from Late Antique linguistic contexts due to the large amount of extant text from these time periods. On the other hand, the Perseus Digital Library (PDL; Crane, accessed 2024) and the International Organization for Standardization (ISO, Diaz 2016) offer stop lists that suffer from brevity and inconsistency. A summary of previous lists is presented in Table 2; the script used to create this table is available in the GitHub repository. Our revised stop list is available below in Table 3.

Table 2

Summary of words included in existing Latin stop lists. “Y” indicates inclusion on a list; “N” indicates absence. CLTK-M: stop words by mean frequency; CLTK-V: stop words by variance; CLTK-E: stop words by entropy probability; CLTK-B: stop words by Borda count; ISO: stop words from the International Standardization Organization; PDL: stop words from the Perseus Digital Library. More details available in the file previous_stop_lists.py.

WORDS	CLTK-M	CLTK-V	CLTK-E	CLTK-B	ISO	PDL
adhic, aliqui, aliquis, an, cur, deinde, es, etsi, fio, haud, idem, infra, interim, is, mox, necque, o, ob, possum, quare, quicumque, quilibet, quisnam, quisquam, quisque, quisquis, quoniam, sive, sui, sum, suus, trans, tum, unus	N	N	N	N	N	Y
a, e, erant, re, rebus, rem, tandem, vel	N	N	N	N	Y	N
at	N	N	N	N	Y	Y
contra, cuius, tantum	N	N	Y	N	N	N
magis	N	N	Y	N	N	Y
anno, deo, dicitur, dixit, dominus, ed, nummus, rex, totus	N	Y	N	N	N	N
super	N	Y	N	N	N	Y
bellum, bibit, dig, nouus, od, quaestio, uos	N	Y	N	Y	N	N
eorum	Y	N	N	N	N	N
cui, omnibus, sua	Y	N	Y	N	N	N
apud, igitur	Y	N	Y	N	N	Y
res	Y	N	Y	N	Y	N
ei, nobis, omnes, potest, quos, sine	Y	N	Y	Y	N	N
modo, quis, tam, ubi	Y	N	Y	Y	N	Y
dei, deus, secundum	Y	Y	N	Y	N	N
ea, eius, eo, esse, esset, eum, fuit, his, id, illa, mihi, nihil, nunc, omnia, quem, quid, quoque, se, sibi, sicut, sit, tibi	Y	Y	Y	Y	N	N
ante, ego, enim, ergo, iam, ille, inter, ipse, nam, ne, nisi, nos, post, pro, quia, sub, tu, uel, uero	Y	Y	Y	Y	N	Y
erat, haec, hoc, me, qua, quibus, quod, sunt, te	Y	Y	Y	Y	Y	N
ab, ac, ad, atque, aut, autem, cum, de, dum, est, et, etiam, ex, hic, in, ita, nec, neque, non, per, quae, quam, qui, quidem, quo, sed, si, sic, tamen, ut	Y	Y	Y	Y	Y	Y

Table 3

Stop list targeted to Classical Latin poetry.

CLASSICAL LATIN POETRY STOP LIST
a	contra	ex	ita	noster	qua	quod	super
ab	cum	facio	magis	nunc	quam	quoque	tam
abs	de	fero	magnus	nullus	que	res	tamen
ac	deus	habeo	meus	ob	qui	se	tantus
ad	dico	hic	modo	omnis	quia	sed	totus
aliquis	do	iam	multus	per	quicumque	si	tu
alius	dum	idem	nam	post	quid	sic	tuus
ante	e	igitur	ne	possum	quidam	sicut	vel
apud	ego	ille	nec	praeter	quidem	sine	vero
at	enim	in	neque	pro	quis	sub	vester
atque	ergo	inter	nihil	prope	quisquam	sui	vos
aut	et	intra	nisi	propter	quisque	sum	ubi
autem	etiam	ipse	nos	qualis	quisquis	supra	ultra
circa	etiamnum	is	non	quantus	quo	suus	ut
							uti

2.1.2 Removal of stop words with high semantic relevance to target corpus

We begin with the combined list of words from Table 2 and methodically remove and add words to create a targeted Latin stop list for poetry from the Augustan era. The words we remove can be sorted into three categories.

First, we remove words with a disproportionately high frequency in Late Antique and ecclesiastical Latin relative to Classical Latin. In most cases, these are references to or titles for the Christian god (e.g. dominus, deus, or rex). The word dominus is so common in Late Antique texts that it is the 53^rd most common word in all Latin literature.² In an ecclesiastical context, dominus means ‘lord [god]’. In a Classical Latin context, however, the word dominus (here, ‘master’ or [slave] ‘owner’) is much less frequent and carries more importance as a semantic feature for conceptual representation (as opposed to, e.g., the formulaic uses in ecclesiastical Latin). Removing dominus from the text of a Classical poet could constitute the erasure of slave narratives and essential literary constructs (servitium amoris, or love as slavery, Copley 1947; Lyne 1979). Similar considerations should be made for words like rex and deus.

Second, we remove words that are overrepresented in particular genres. Another inclusion on previous stop lists is the noun bellum (‘war’), which often occurs in historiographic texts like those of Julius Caesar. While bellum is a valid inclusion for stop lists targeted at historiography, as researchers creating a targeted stop list for Augustan poetry, we elected to remove it. Like slavery, war features in a metaphor of love as a military endeavor (militia amoris, Murgatroyd 1975). Another problem with previous stop word lists is their inconsistent inclusion of prepositions and pronouns. For example, some stop word lists include per (a common preposition meaning “through”), but not trans (a preposition meaning “across”). In our list, we correct these inconsistencies to include all pronouns and prepositions of similar semantic significance.

Third, we remove words that do not appear in Classical texts. This includes abbreviations such as od (a shortening for oculus dexter, ‘right eye’) and ed (possibly a shortened form of the pronoun idem, ‘the same’, or the verb edo, ‘eat’). Finally, we remove words that are redundant, inconsistent, or illogical inclusions. For example, nummus (‘coin’) has an overall frequency rank of 1560; the reason for its inclusion in any stop list is unclear. Other words (me, sit, omnes) are specific forms that would be lemmatized to their dictionary forms ego (‘I’), sum (‘be’), and omnis (‘all’).

2.1.3 Addition of stop words with low semantic relevance to target corpus

To correct inconsistencies in existing lists, we include all pronouns except nemo (‘no one’). We did not include nemo as a stop word because its usage in the nominative (a grammatical case in Latin) tends to be more pointed than that of other pronouns, which carry less and equivalent semantic significance. The generalized use in other cases reverts to forms of nullus (‘nothing’), which is included on the list. We also add words with low semantic weight and high frequency in Classical Latin. For example, facio (‘make/do’), do (‘give’), magnus (‘big/great’), multus (‘much’), and fero (‘carry’).³

2.1.4 Creation of Python script for easy implementation of specialized list

To make it easy for researchers to use our targeted stop lists, we provide a command-line based Python script (latin_stop_list.py) that can be implemented into an existing research pipeline.

2.2 A Note on Latin Lemmatization

For certain languages, including Latin, researchers may want to consider removing stop words prior to lemmatization. This allows for more control for words that have been lexicalized⁴ in specific usages without eliminating other more semantically significant forms. For example, the rhetorical adverb vero (‘indeed’) comes from the adjective verus (‘true’) but is lemmatized under the adjective verus. The form vero of the adjective verus has been lexicalized as an adverb distinct from other forms of verus. Removing vero after lemmatization would also remove all forms of verus. Removing stop words prior to lemmatization may improve the results for texts that more frequently employ lexicalized versions of common words in specific uses. For this reason, we provide an unlemmatized stop list with all enumerated forms.

(3) Dataset Description

Repository names

JOHD Dataverse & GitHub

Object names

FILE NAME	FORMATS	DESCRIPTION
classical_latin_poetry_ stop_words_unlemmatized	.txt,* .csv	Unlemmatized stop list for classical Latin poetry
classical_latin_poetry_ stop_words_lemmatized	.txt,* .csv	Lemmatized stop list for classical Latin poetry
latin_stop_list	.py	Python script for importing stop lists

*.txt files are available only in the GitHub repository; .txt files are included for the convenience of researchers using console applications in their research pipeline.

Format names and versions

data: .txt, .csv; Python script: .py (3.12)

Creation dates

2020.09.01 – 2024.09.15

Dataset creators

Rachel E. Dubit (Stanford University): Linguistic expertise; Annie K. Lamar (UCSB): Computational expertise.

Languages

Latin; English

License

MIT

Publication date

2024.09.27

(4) Reuse Potential

Our open-access stop list can serve as a starting point for other eras or genres of Latin literature. Only slight adjustments should be needed for closely related contexts, whereas those working with datasets from significantly earlier or later periods may wish to make larger changes. More broadly, the transdisciplinary and collaborative process by which these stop lists were created is of significant benefit to low-resource computational linguistics research teams. Such collaboration should begin with at data pre-processing, rather than being contained to validation or analysis.

Notes

[1] Note that the CLTK has since adopted the stop list from the PDL.

[2] All word frequencies referenced in this paper are taken from Logeion. The frequencies from Logeion were collated at The University of Chicago through a combination of manual and automated curation. Frequency rankings are based on data from a major Latin dictionaries and lexica. For more, see logeion.uchicago.edu/about.

[3] Note that some existing lists include fio (‘happen’ or ‘become’), a verb derived from facio, but not facio itself (despite its frequency rank of 22).

[4] Lexicalization is the process of adding words to a language’s lexicon, including the formalization of specific word-form usages or idioms.

Acknowledgements

We wish to acknowledge the contributions of Quinn Dombrowski (Stanford University) and Professor Hans Bork (Department of Classics, Stanford University).

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Rachel E. Dubit: Conceptualization; Formal analysis; Investigation; Project administration; Validation; Writing – original draft; Writing – review & editing.

Annie K. Lamar: Conceptualization; Data curation; Investigation; Methodology; Software; Visualization; Writing – original draft; Writing – review & editing.