Have a personal or library account? Click to login
Era- and Genre-Specific Stop Word Lists for Low-Resource Computational Research: A Classical Latin Exemplum Cover

Era- and Genre-Specific Stop Word Lists for Low-Resource Computational Research: A Classical Latin Exemplum

Open Access
|Nov 2024

Full Article

(1) Overview

Repository location

JOHD Dataverse: https://doi.org/10.7910/DVN/N1ZYWS

GitHub: https://github.com/LOREL-Lab/latin_stop_words

Context

Over the last several years of research in the field of low-resource computational linguistics, one particular method for improving dataset quality has received limited scholarly attention: the implementation of era- and/or genre-specific stop word lists (from here on: stop lists). Stop lists available toolkits for low-resource languages tend to be applicable to a broad language dataset rather than targeted to specific texts, authors, or time periods. Many such language datasets, however, include texts from a wide range of genres and dates.

Original literary output in Latin spanned several hundred years, during which significant dialectic and linguistic changes occurred. Additionally, individual genres tend to exhibit prescribed conventions in both form and content, which drive differences in word choice and usage. Applying the same stop list to, e.g., Augustan love poetry and late ecclesiastical Latin prose implies a non-existent continuity between the two (Table 1).

Table 1

Latin literature, although often used as a single corpus in computational research, in fact has a wide generic range and multi-millennial time span (See Conte 1994).

TIME PERIODREPUBLICAN LATINAUGUSTAN LATINIMPERIAL LATINLATE ANTIQUE & CHRISTIAN LATIN
‘Classical Latin’
Approximate Start2nd C. BCEmid-1st C. BCE1st C. CE2nd C. CE
Approximate Endmid-1st C. BCEearly 1st C. CE3rd C. CE
Example Authors & Works by Genre
Epic poetryLucretiusVergil
Ovid
Statius
Lucan
Dracontius
TheaterPlautus
Terence
PantomimeSenecaHrotsvitha
Small-format poetry (e.g., elegy)CatullusHorace
Propertius
Tibullus
Ovid
Statius
Martial
Anthologia Latina
Historiography & Military HistoryCaesar
Sallust
LivyTacitusAmmianus Marcellinus
Other proseCicero
Varro
Atticus
Flaccus
Vitruvius
Pompey Trogue
Petronius
Pliny
Quintilian
Suetonius
Apuleius
Nonius
Jerome
Tertullian
Other apologists Acta Martyrum and Passiones

In this data paper, we argue that computational researchers—particularly those working in low-resource contexts—should consult with linguistic specialists to create targeted stop lists developed with specific eras, genres, authors, or contexts in mind. We offer an exemplum of collaborative creation of stop lists targeted at Augustan poetry. Augustan texts—i.e., texts written during the Augustan Era—are composed in a dialect of Latin commonly referred to as ‘Classical Latin’ (see Table 1). Although the dataset we present is itself small, the potential impact of its implementation in computational humanities workflows is not.

(2) Method

2.1 Steps

2.1.1 Consideration & evaluation of existing Latin stop lists

In recent years, most Latin stop lists have been based on word frequency. By creating stop lists based exclusively on the frequency of words within an entire corpus of extant (digitized) literature, scholars may inadvertently remove words that are less common and/or more significant in certain eras, thereby erasing essential content from their analyses.

This is particularly true in the context of Latin literature given that a significant proportion of extant text comes from a Late Antique or Christian context. Over the past several years, the Classical Language Tool Kit (CLTK, Johnson et al., 2021) has offered several stop lists based on different methods of computing word frequency, including overall frequency, entropy, and variance.1 These stop lists demonstrate disproportionate influence from Late Antique linguistic contexts due to the large amount of extant text from these time periods. On the other hand, the Perseus Digital Library (PDL; Crane, accessed 2024) and the International Organization for Standardization (ISO, Diaz 2016) offer stop lists that suffer from brevity and inconsistency. A summary of previous lists is presented in Table 2; the script used to create this table is available in the GitHub repository. Our revised stop list is available below in Table 3.

Table 2

Summary of words included in existing Latin stop lists. “Y” indicates inclusion on a list; “N” indicates absence. CLTK-M: stop words by mean frequency; CLTK-V: stop words by variance; CLTK-E: stop words by entropy probability; CLTK-B: stop words by Borda count; ISO: stop words from the International Standardization Organization; PDL: stop words from the Perseus Digital Library. More details available in the file previous_stop_lists.py.

WORDSCLTK-MCLTK-VCLTK-ECLTK-BISOPDL
adhic, aliqui, aliquis, an, cur, deinde, es, etsi, fio, haud, idem, infra, interim, is, mox, necque, o, ob, possum, quare, quicumque, quilibet, quisnam, quisquam, quisque, quisquis, quoniam, sive, sui, sum, suus, trans, tum, unusNNNNNY
a, e, erant, re, rebus, rem, tandem, velNNNNYN
atNNNNYY
contra, cuius, tantumNNYNNN
magisNNYNNY
anno, deo, dicitur, dixit, dominus, ed, nummus, rex, totusNYNNNN
superNYNNNY
bellum, bibit, dig, nouus, od, quaestio, uosNYNYNN
eorumYNNNNN
cui, omnibus, suaYNYNNN
apud, igiturYNYNNY
resYNYNYN
ei, nobis, omnes, potest, quos, sineYNYYNN
modo, quis, tam, ubiYNYYNY
dei, deus, secundumYYNYNN
ea, eius, eo, esse, esset, eum, fuit, his, id, illa, mihi, nihil, nunc, omnia, quem, quid, quoque, se, sibi, sicut, sit, tibiYYYYNN
ante, ego, enim, ergo, iam, ille, inter, ipse, nam, ne, nisi, nos, post, pro, quia, sub, tu, uel, ueroYYYYNY
erat, haec, hoc, me, qua, quibus, quod, sunt, teYYYYYN
ab, ac, ad, atque, aut, autem, cum, de, dum, est, et, etiam, ex, hic, in, ita, nec, neque, non, per, quae, quam, qui, quidem, quo, sed, si, sic, tamen, utYYYYYY
Table 3

Stop list targeted to Classical Latin poetry.

CLASSICAL LATIN POETRY STOP LIST
acontraexitanosterquaquodsuper
abcumfaciomagisnuncquamquoquetam
absdeferomagnusnullusquerestamen
acdeushabeomeusobquisetantus
addicohicmodoomnisquiasedtotus
aliquisdoiammultusperquicumquesitu
aliusdumidemnampostquidsictuus
anteeigiturnepossumquidamsicutvel
apudegoillenecpraeterquidemsinevero
ateniminnequeproquissubvester
atqueergointernihilpropequisquamsuivos
autetintranisipropterquisquesumubi
autemetiamipsenosqualisquisquissupraultra
circaetiamnumisnonquantusquosuusut
uti

2.1.2 Removal of stop words with high semantic relevance to target corpus

We begin with the combined list of words from Table 2 and methodically remove and add words to create a targeted Latin stop list for poetry from the Augustan era. The words we remove can be sorted into three categories.

First, we remove words with a disproportionately high frequency in Late Antique and ecclesiastical Latin relative to Classical Latin. In most cases, these are references to or titles for the Christian god (e.g. dominus, deus, or rex). The word dominus is so common in Late Antique texts that it is the 53rd most common word in all Latin literature.2 In an ecclesiastical context, dominus means ‘lord [god]’. In a Classical Latin context, however, the word dominus (here, ‘master’ or [slave] ‘owner’) is much less frequent and carries more importance as a semantic feature for conceptual representation (as opposed to, e.g., the formulaic uses in ecclesiastical Latin). Removing dominus from the text of a Classical poet could constitute the erasure of slave narratives and essential literary constructs (servitium amoris, or love as slavery, Copley 1947; Lyne 1979). Similar considerations should be made for words like rex and deus.

Second, we remove words that are overrepresented in particular genres. Another inclusion on previous stop lists is the noun bellum (‘war’), which often occurs in historiographic texts like those of Julius Caesar. While bellum is a valid inclusion for stop lists targeted at historiography, as researchers creating a targeted stop list for Augustan poetry, we elected to remove it. Like slavery, war features in a metaphor of love as a military endeavor (militia amoris, Murgatroyd 1975). Another problem with previous stop word lists is their inconsistent inclusion of prepositions and pronouns. For example, some stop word lists include per (a common preposition meaning “through”), but not trans (a preposition meaning “across”). In our list, we correct these inconsistencies to include all pronouns and prepositions of similar semantic significance.

Third, we remove words that do not appear in Classical texts. This includes abbreviations such as od (a shortening for oculus dexter, ‘right eye’) and ed (possibly a shortened form of the pronoun idem, ‘the same’, or the verb edo, ‘eat’). Finally, we remove words that are redundant, inconsistent, or illogical inclusions. For example, nummus (‘coin’) has an overall frequency rank of 1560; the reason for its inclusion in any stop list is unclear. Other words (me, sit, omnes) are specific forms that would be lemmatized to their dictionary forms ego (‘I’), sum (‘be’), and omnis (‘all’).

2.1.3 Addition of stop words with low semantic relevance to target corpus

To correct inconsistencies in existing lists, we include all pronouns except nemo (‘no one’). We did not include nemo as a stop word because its usage in the nominative (a grammatical case in Latin) tends to be more pointed than that of other pronouns, which carry less and equivalent semantic significance. The generalized use in other cases reverts to forms of nullus (‘nothing’), which is included on the list. We also add words with low semantic weight and high frequency in Classical Latin. For example, facio (‘make/do’), do (‘give’), magnus (‘big/great’), multus (‘much’), and fero (‘carry’).3

2.1.4 Creation of Python script for easy implementation of specialized list

To make it easy for researchers to use our targeted stop lists, we provide a command-line based Python script (latin_stop_list.py) that can be implemented into an existing research pipeline.

2.2 A Note on Latin Lemmatization

For certain languages, including Latin, researchers may want to consider removing stop words prior to lemmatization. This allows for more control for words that have been lexicalized4 in specific usages without eliminating other more semantically significant forms. For example, the rhetorical adverb vero (‘indeed’) comes from the adjective verus (‘true’) but is lemmatized under the adjective verus. The form vero of the adjective verus has been lexicalized as an adverb distinct from other forms of verus. Removing vero after lemmatization would also remove all forms of verus. Removing stop words prior to lemmatization may improve the results for texts that more frequently employ lexicalized versions of common words in specific uses. For this reason, we provide an unlemmatized stop list with all enumerated forms.

(3) Dataset Description

Repository names

JOHD Dataverse & GitHub

Object names

FILE NAMEFORMATSDESCRIPTION
classical_latin_poetry_ stop_words_unlemmatized.txt,* .csvUnlemmatized stop list for classical Latin poetry
classical_latin_poetry_ stop_words_lemmatized.txt,* .csvLemmatized stop list for classical Latin poetry
latin_stop_list.pyPython script for importing stop lists

*.txt files are available only in the GitHub repository; .txt files are included for the convenience of researchers using console applications in their research pipeline.

Format names and versions

data: .txt, .csv; Python script: .py (3.12)

Creation dates

2020.09.01 – 2024.09.15

Dataset creators

Rachel E. Dubit (Stanford University): Linguistic expertise; Annie K. Lamar (UCSB): Computational expertise.

Languages

Latin; English

License

MIT

Publication date

2024.09.27

(4) Reuse Potential

Our open-access stop list can serve as a starting point for other eras or genres of Latin literature. Only slight adjustments should be needed for closely related contexts, whereas those working with datasets from significantly earlier or later periods may wish to make larger changes. More broadly, the transdisciplinary and collaborative process by which these stop lists were created is of significant benefit to low-resource computational linguistics research teams. Such collaboration should begin with at data pre-processing, rather than being contained to validation or analysis.

Notes

[1] Note that the CLTK has since adopted the stop list from the PDL.

[2] All word frequencies referenced in this paper are taken from Logeion. The frequencies from Logeion were collated at The University of Chicago through a combination of manual and automated curation. Frequency rankings are based on data from a major Latin dictionaries and lexica. For more, see logeion.uchicago.edu/about.

[3] Note that some existing lists include fio (‘happen’ or ‘become’), a verb derived from facio, but not facio itself (despite its frequency rank of 22).

[4] Lexicalization is the process of adding words to a language’s lexicon, including the formalization of specific word-form usages or idioms.

Acknowledgements

We wish to acknowledge the contributions of Quinn Dombrowski (Stanford University) and Professor Hans Bork (Department of Classics, Stanford University).

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Rachel E. Dubit: Conceptualization; Formal analysis; Investigation; Project administration; Validation; Writing – original draft; Writing – review & editing.

Annie K. Lamar: Conceptualization; Data curation; Investigation; Methodology; Software; Visualization; Writing – original draft; Writing – review & editing.

DOI: https://doi.org/10.5334/johd.246 | Journal eISSN: 2059-481X
Language: English
Submitted on: Sep 28, 2024
Accepted on: Nov 12, 2024
Published on: Nov 25, 2024
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2024 Rachel E. Dubit, Annie K. Lamar, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.