A Corpus of Portuguese Historical and Metalinguistic Grammars (1536–1864)

Saulo Rogério Pacheco Rocha

doi:10.5334/johd.542

(1) Overview

Repository location

Context

The growth of computational historical linguistics has revealed a significant gap in digital infrastructure: while descriptive genres like correspondence are increasingly available, the prescriptive and metalinguistic tradition remains largely inaccessible in machine-readable formats. This scarcity is driven by technical bottlenecks, as standard Optical Character Recognition (OCR) systems frequently fail when confronted with the typographical instability, archaic ligatures, and idiosyncratic abbreviations characteristic of 16^th to 19^th-century prints.

Consequently, the underrepresentation of historical grammars creates a theoretical blind spot, particularly in computational diachronic syntax. These prescriptive works provide a promising field for observing grammar competition¹ (Kroch, 1989) and the implementation of linguistic change within a highly monitored and valued textual genre. Beyond syntactic theory, the motivation for this research necessitates an engagement with the profound historiographical and ethical dimensions of these texts. Rather than existing as culturally objective ‘linguistic rulebooks’, these grammars functioned as active instruments of Portuguese imperialism. They document the systematic codification and advocate for the enforcement of the language that facilitated colonial expansion and shaped national identities in Portugal, Brazil, Guinea, and other nations under Portuguese colonial occupation across the chronological scope of this corpus.

By transforming these historically loaded texts into machine-readable formats, this project also addresses a severe data shortage, enabling researchers to train, fine-tune, and benchmark computational models specifically tailored to early modern Portuguese. To address these intersecting theoretical, historical, and technical gaps, we present the Corpus of Portuguese Historical and Metalinguistic Grammars (1536–1864), a highly curated dataset originally developed for a master’s thesis at the Federal University of Santa Catarina (UFSC) in 2025 (Pacheco Rocha, 2025).

(2) Method

Steps

The creation of the dataset involved image acquisition, automated text recognition (ATR), and database structuring. Digital facsimiles of first editions were sourced from the Biblioteca Nacional de Portugal (National Library of Portugal, BNP).² To ensure computational reliability and full control over encoding, direct transcriptions were produced via Transkribus³ (Nockels et al., 2022; Terras et al., 2025), bypassing existing mass-made digital editions. Because the pre-trained models available at the time conflated early Portuguese with Spanish patterns, a custom ATR model, Early Portuguese Printing (EPP),⁴ was developed in Transkribus using the PyLaia engine (Puigcerver, 2018; Terras et al., 2025). Its architecture integrates a 4-layer Convolutional Neural Network (CNN) for visual extraction with a 3-layer Long Short-Term Memory (LSTM) Recurrent Neural Network (256 units) for sequence analysis.

To optimize learning, an “anti-curricular” training strategy was employed, prioritizing the most typographically unstable 16^th-century works (Barros, 1540; Oliveira, 1536) to build a resilient recognition base. This ground truth, supplemented by five additional historical works, comprised 142,606 words (23,045 lines). As described on Transkribus’ Training Parameters section,⁵ the model’s training was configured for a maximum of 250 epochs with an early stopping patience of 20 epochs. The model converged efficiently, halting after 43 epochs to prevent overfitting. It achieved a final Character Error Rate (CER) of 1.91% on the training set and 2.58% on the validation set. The final transcriptions and metadata were structured into a relational SQLite database (portuguese_grammars.db) to facilitate advanced computational querying.

Sampling strategy

The corpus is deliberately bounded to six foundational metalinguistic works published in Portugal between the 16^th and 19^th centuries, capturing four centuries of intense typographic and diachronic variation. The selected works are detailed in Table 1.

Table 1

The Six Portuguese Grammars included in the database.

BIRTH⁶	AUTHOR	TITLE	PUB.	WORDS
1496	João de Barros	Grammatica da Lingua Portuguesa	1540	23,873
1507	Fernão de Oliveira	Grammatica da Lingoagem Portuguesa	1536	22,974
1530	Duarte Nunes de Leão	Origem da Lingoa Portuguesa	1606	21,263
1671	Luís Caetano de Lima	Orthographia da Lingua Portugueza	1736	26,997
1736	Bernardo de Lima e Mélo Bacellar	Grammatica Philosophica e Orthographia Racional da Lingua Portugueza	1783	25,219
1823	Francisco Júlio Caldas Aulete	Grammatica Nacional	1864	21,353

These specific texts were selected because they represent critical intersections between linguistic prescription and the broader sociopolitical evolution of the Portuguese nation, as described by Pacheco Rocha (2025). Rather than isolated academic milestones, these works are cultural artifacts that reflect distinct ideological eras in the historiography of the language.

The corpus spans the efforts of Renaissance pioneers (Barros, 1540; Oliveira, 1536) to elevate and standardize the vernacular Portuguese language during the early establishment of the Portuguese maritime empire across Africa, Asia, and the Americas; the growing etymological and historical consciousness of the 17^th century, a period marked by the Iberian Union of Spain and Portugal (Leão, 1606); the new rationalist and philosophical approaches characteristic of the 18^th-century Enlightenment (Bacellar, 1783; Lima, 1736); and finally, the 19^th-century institutionalization of language as a foundational pillar of modern national identity within the newly secular and public education systems of both Portugal and Brazil, where this specific grammar, the Grammatica Nacional, was highly influential (Aulete, 1864). This contiguous, diachronic sampling provides the necessary framework for observing long-term shifts in normative syntax, allowing researchers to investigate the complex tension between prescribed norms and actual usage, while acknowledging that the historical boundary between prescription and description is often porous, as metalinguistic texts can encode traces of actual linguistic practice.⁷

An example of the original facsimile of Oliveira’s Grammatica da Lingoagem Portuguesa (1536) alongside its corresponding transcription is provided in Figure 1:

Transkribus interface showing a 1536 Oliveira grammar facsimile page and transcription.

Quality control

To bridge diplomatic fidelity⁸ with computational standardization, a set of normalization protocols was established; key examples include:

Orthography and Brevigraphs: The functional Latin distinction between ‘u’ and ‘v’ was preserved. Highly localized symbols exclusive to Oliveira (1536) were handled pragmatically to optimize the training for the whole corpus: the Tironian notation et (⁊, U+204A) was transcribed as an ampersand (&), while the abbreviation symbol (ꝰ) was intentionally ignored. Similarly, rare logographic abbreviations with low diachronic recurrence, such as the r-rotunda (ꝛ) or specific p-prepositions (ꝑ, ꝓ), were transcribed as the simple letters ‘r’ and ‘p’, respectively. These normalization decisions regarding rare and idiosyncratic symbols were intended to prevent the model from misidentifying low-clarity signals and image noise, to reduce the overall character set complexity.
Digits and Abbreviations: Original numeric forms were maintained (e.g., Roman numerals like xviij). Abbreviation superscripts were unified using the combining tilde (~), sometimes represented by a superscript horizontal bar over both vowels and consonants (e.g., Ntõ, nũero, q̃).
Graphemes and Diacritics: Standard diacritics (’, ‘, ^, ~) were kept in original positions. The e-caudata (ę, U+0119) was explicitly preserved to prevent the model from learning to ignore sub-scripted marks, ensuring the accuracy of standard cedillas (ç), given the frequent typographic ambiguity between lowercase ‘c’ and ‘e’ in early modern fonts. Cedillas were especially important to this model as other models trained with multi-language datasets tend to mistranscribe it, as it was/is used in more highly represented languages, such as Spanish or French.
Ligatures and the Eszett: Most typographic ligatures (e.g., ﬆ, ﬅ) were expanded to align with Unicode standards and to reduce the character set. The eszett (ß), in this corpus a ligature of long-s (ſ) and short-s (s), was maintained as a single character due to its high graphical saliency and frequency.
Non-Latin and Ornamental Characters: Greek characters (e.g., Θ, Φ) were preserved, though visually analogous letters (Chi X, Zeta Z) were mapped to Latin equivalents, again, to reduce the character set. The Latin alpha (ɑ, U+0251) was explicitly preserved as its normalization would render Oliveira’s (1536) orthographic arguments incomprehensible. Structural symbols like the pilcrow (¶) and fleurons (❧, ☙, ❦) were maintained to preserve editorial architecture, although the capitulum marker (⸿) was united with the pilcrow’s transcription.

While the model achieved an overall validation CER of 2.58%, error rates were naturally higher in the highly unstable 16^th-century prints (Oliveira and Barros). To mitigate this, the automated outputs of these works underwent a manual verification against the facsimiles and established critical editions by Machado (Barros, 1957) and Buescu (Oliveira, 1975). Although a multi-annotator workflow with inter-annotator agreement metrics is standard for large-scale projects, the single-annotator curation of this corpus ensured absolute internal consistency in applying complex palaeographic normalization rules across all three centuries. Figure 2 provides an example of the original facsimile of Barros’ Grammatica da Lingua Portuguesa (1540), alongside its corresponding transcription:

Transkribus interface showing a 1540 Barros grammar facsimile page and transcription.

(3) Dataset Description

While the raw .txt files are provided in the repository for manual inspection and simple scripting, the dataset is also distributed as a relational SQLite database (portuguese_grammars.db) to facilitate seamless integration into computational pipelines. The database architecture relies on a primary grammars table. This structure binds the full transcribed text (text_content) to its historiographical metadata (author_name, author_birth_year, title, publication_year). By providing the data in this structured format, researchers can ingest the entire corpus and its associated variables directly into data analysis environments (such as Python’s pandas or R) without the need to parse filenames, write custom directory loops, or merge external metadata tables. The internal structure of the provided SQLite file is described in Table 2.

Table 2

portuguese_grammars.db Internal Structure.

portuguese_grammars.db
grammars		corpus_info
id(PK)	[INTEGER]	Key (PK)	[TEXT]
author_name	[TEXT]	Value	[TEXT]
author_birth_year	[INTEGER]	(Contains metadata, licence and readme)
Title	[TEXT]
publication_year	[INTEGER]

Repository name

Zenodo.

Object name

The dataset is distributed as portuguese_grammars_corpus.zip containing:

portuguese_grammars.db
/facsimiles (containing 6 PDF files with facsimiles and embedded text)
/raw_txt (containing 6 plain text exports)

Format names and versions

SQLite 3 (.db), Portable Document Format (.pdf), Plain Text (.txt, UTF-8).

Creation dates

2023-06-17 to 2025-03-30

Dataset creators

Saulo Rogério Pacheco Rocha (Federal University of Santa Catarina, UFSC): Conceptualization, Data curation, Investigation, Methodology, Validation.

Language

Portuguese (from 16^th–19^th centuries on primary text data); English (metadata, database schema, and documentation).

License

Creative Commons Zero (CC0 1.0 Universal).

Publication date

2026-01-23

(4) Reuse Potential

This dataset provides a foundational resource for quantitative research in diachronic linguistics, the historiography of the Portuguese language, and broader cultural history. Researchers can leverage the structured relational database to track long-term morphological and syntactic shifts, such as changing patterns of clitic pronoun placement, or other linguistic phenomena under investigation. By contrasting this prescriptive corpus with existing descriptive datasets (e.g., historical letters or theatrical plays), scholars can perform robust comparative analyses to determine how normative rules interact with actual, unmonitored language use, depending on the linguistic phenomena under investigation.

To illustrate the corpus’s analytical potential, preliminary findings from the Pacheco Rocha (2025) study on clitic placement demonstrate that metalinguistic prose does not operate in an artificial vacuum. This analysis paired the dataset with literary samples from the Tycho Brahe Platform (Veronesi & Galves, 2024), specifically benchmarking the results against the diachronic trends established by Galves, Britto, and Paixão de Sousa (2005). The findings track a shift from the proclitic base of Classical Portuguese, represented in works by Barros (1540), Oliveira (1536), and Leão (1606), to the categorical enclisis of Modern European Portuguese, crystallized in authors such as Bacellar (1783) and Aulete (1864). These results confirm that historical grammars functioned as mechanisms for validating the internalized syntax of the contemporary elite, while also playing a role in the historiography of coloniality by exporting European syntactic parameters to 19^th-century Brazilian education. These initial results firmly establish this dataset as a promising tool for correlating long-term structural syntactic shifts with the broader sociolinguistic history of the Portuguese language.

Beyond traditional historical linguistics, the dataset holds significant value for the Digital Humanities and historical Natural Language Processing (NLP). The high-quality, machine-readable texts can be aggregated to fine-tune or benchmark large language models (LLMs), as well as to train part-of-speech taggers and lemmatizers specifically tailored to early modern Portuguese, addressing a critical resource gap for this historical period. Furthermore, the inclusion of digital facsimiles with embedded text layers makes the dataset an excellent pedagogical tool for teaching digital palaeography and historical typography. Crucially, users must also account for the ethical and historiographical context of these documents. As previously noted, these grammars are not neutral artifacts; they are deeply entangled with the history of Portuguese imperialism, representing the deliberate standardization and imposition of the language during colonial expansion and the formation of national identities in Portugal, Guinea-Bissau, São Tomé and Príncipe, Cape Verde, Brazil, Angola, Mozambique, and East Timor.

However, potential users should be aware of certain limitations and barriers to reuse. First, despite rigorous palaeographic normalization and a low Character Error Rate (2.58%), residual ATR errors may persist. While the single-annotator curation ensured absolute internal consistency, the absence of multi-annotator cross-validation means that researchers conducting highly sensitive token-level queries may need to consult the provided facsimiles for manual verification. Furthermore, certain pragmatic transcription choices, such as mapping specific brevigraphs (e.g., ꝑ and ꝓ to ‘p’), ignoring the abbreviation sign (ꝰ) in Oliveira’s (1536) grammar, and converting ‘visually analogous’ Greek letters (e.g., Chi ‘X’, Zeta ‘Z’) to Latin equivalents (X, Z), result in a degree of information loss. While these normalizations were useful to optimize ATR performance and reduce character set complexity, users aiming to utilize the dataset to generate strictly semi-diplomatic editions must be aware of these constraints. Second, because the corpus exclusively comprises metalinguistic and prescriptive texts, linguistic findings derived from this data must be contextualized appropriately: they represent the idealized, normative language of the educated elite rather than the spoken vernacular of the time. Finally, while the six selected works are highly representative milestones, the relatively small sample size (141,679 words) necessitates caution against overgeneralization. Future collaborative efforts could expand upon this database by integrating additional historical grammars to broaden its chronological and geographical scope.

AI Declaration

LLM tools were used to assist with language polishing and code refactoring under the author’s direction. The author reviewed all outputs and takes full responsibility for the final content. No Generative AI tools were used to generate or manipulate the research data, measurements, or results.

Notes

[1] Grammar competition refers to the diachronic process where two conflicting sets of syntactic rules coexist within a single linguistic system during a period of language change. Diachronic syntactic studies in this framework typically require large quantities of linguistic data spanning several centuries to ensure statistical significance.

[2] Available at bndigital.bnportugal.gov.pt/ (last accessed: 11 May 2026).

[3] Transkribus is a platform for text and structure recognition of historical documents, available at transkribus.org/ (last accessed 11 May 2025).

[4] Available at transkribus.org/model/early-portuguese-printing (last accessed 11 May 2025).

[5] Available in ‘See training parameters’ at app.transkribus.org/models/text/early-portuguese-printing (last accessed 11 May 2025).

[6] Following standard methodological practices in generative linguistics, the chronological ordering of the texts within the corpus is based on the authors’ years of birth rather than the publication dates of the works. This approach controls for author generation effects, anchoring the linguistic data to the period of the author’s vernacular acquisition rather than the often-delayed date of textual publication.

[7] Evidence of this permeability is empirically accessible within The Corpus of Portuguese Metalinguistic Grammars (1536–1864); see Pacheco Rocha (2025).

[8] In this context, diplomatic fidelity refers to a strict adherence to the exact orthography, typography, and layout of the original historical document, without modernizing it.

Acknowledgements

The author acknowledges the Biblioteca Nacional de Portugal (BNP) and its technical team for their work in digitizing and openly publishing the historical facsimiles used to build this corpus. The author also extends their gratitude to the journal editor and the reviewers for their thoughtful and highly constructive feedback, which significantly improved the final version of this manuscript.

Author Contributions

Saulo Rogério Pacheco Rocha: Conceptualization, Data curation, Investigation, Methodology, Writing – original draft; Validation.