GRAF – Gendered Reference Analysis in French

Magdalena Lemus-Serrano; Marine Cozzolino; Tessa Vermeir; Mathilde Josserand; Marc Allassonnière-Tang

doi:10.5334/johd.510

1 Overview

Repository location

Open Science Framework (OSF): https://doi.org/10.17605/OSF.IO/BXVYN

Context

Grammatical gender is only found in 20% of the world’s languages (Allassonnière-Tang et al., 2021). Among these, French is one of the languages that uses the masculine/feminine distinction, expressed by the agreement patterns of determiners (my, your, his/her) and adjectives (big, old, brave). Beyond its standard use for masculine referents, French’s masculine gender is also used with a generic value to refer to mixed groups of people, and in general, to unknown gender referents. Despite this neutral label, studies have shown that “masculine generics” induce a strong male bias in mental representations, even from childhood (Gygax et al., 2019). This bias continues into adulthood and invisibilises women in various social contexts (Chatard et al., 2005; Sczesny et al., 2016). For example, adult speakers imagine men more often than women when reading job offers with masculine pronouns (Gaucher et al., 2011). One of the proposed responses to this asymmetry is gender-neutral language, which has been proposed in Spanish (Papadopoulos, 2022) and Italian (Sulis & Gheno, 2022). Several alternatives to the generic masculine have also been suggested for French (Abbou, 2011; Causse & Barasc, 2014; Coutant et al., 2015; Greco, 2018). One strategy consists in using native epicene words (gender-neutral terms, like “des universitaires” ‘university people’ instead of “des étudiants” (‘students (masculine/generic)’)), but since not all nouns, adjectives and determiners have an epicene form, this strategy is of limited applicability by definition. Another strategy, feminization, consists in using two gendered forms, both masculine and feminine, within the same sentence (“des étudiantes et des étudiants” ‘students (feminine) and students (masculine)’ instead of “des étudiants” (‘students (masculine/generic)’) (Sczesny et al., 2016), which inevitably reinforces the masculine/feminine binary in speech (Touraille & Allassonnière-Tang, 2023).

To avoid these issues in a systematic manner, other approaches suggest the creation of new grammatical genders with distinct agreement patterns. Some of these options can only be used in writing, e.g., the capitalization of the feminine ‘e’ letter in “des étudiantEs” ‘students’. Other alternatives compatible with both written and spoken French involve the use of a new letter, such as ‘æ’ (“aimæ” ‘loved (neutral)’ instead of “aimé/aimée” ‘loved (masculine/feminine)’, generally pronounced as [e]), ‘ë’ (“admiréë” ‘admired (neutral)’ instead of “admiré/admirée” ‘admired (masculine/feminine)’, generally pronounced as [e]), or yet “i” (“uni étudianti” ‘a student (neutral)’ instead of “un étudiant/une étudiante” ‘a student (masculine/feminine)’) (Allassonnière-Tang et al., 2023; Alpheratz, 2019; Borde, 2016).

These proposals stem from different regions and historical periods, and differ vastly in terms of levels of acceptance in their respective speaker communities, but they raise important issues for an analysis of gender-neutral phenomena in French. For example, the potential difficulty of users to learn new agreement patterns. However, recent studies also show that so far, innovative gender-neutral solutions are easily learnable (Marsolier et al., 2024). Another main issue concerns the perceived ubiquity of gender agreement in French, and questions whether it is actually feasible for speakers to modify every other word in a sentence. So far, studies have quantified the ratio of generic masculines within noun phrases (Flesch et al., 2026), but they do not provide data on 1) the number of words with grammatical gender (pronouns, articles, adjectives, and participles) that refer to humans in written and spoken French and 2) the ratio of masculine generics in written and spoken French respectively. The current database aims at filling this gap.

2 Method

The source of the written corpus is the Chambers-Rostand Corpus of Journalistic French.¹ From the corpus, we extracted 44 texts from the journal Le Monde 2002. The selection of texts was controlled to include texts from different authors and to minimize the potential effects of publication-specific guidelines. The selected general themes were national news, sport, economy, culture, and politics. The source of the spoken corpus is the CLAPI corpus,² from which 15 conversations from various registers were extracted. Additional details about the token numbers and their distribution are available in section 3.1 on descriptive statistics. The selected general themes were meal discussions, phone conversations, various types of meetings, and furniture construction.

Additional preprocessing was conducted on the data. For the spoken data, the names of the participants were removed. Annotations specific to the corpus were removed, e.g., xxxx, xxx, xx, x, pff, hm. Abbreviated versions of words were changed to full words, e.g., “p’t’être” → “peut-être” ‘maybe’, “c’t’après-midi” → “cette après-midi” ‘this afternoon’. Repetitions were also removed, e.g., “de de de” was changed to “de”. For both spoken and written data, special symbols such as square brackets were removed. All letters were transformed to lowercase.

Each text was manually annotated to identify words that refer to gendered nouns. Each annotation was checked by four annotators to ensure the reliability of the data. Finally, sentences were annotated using the UDPipe Universal Dependencies parser (Wijffels, 2025) trained on the Universal Dependencies corpus (Nivre et al., 2019). The result is a table that includes the following information in the shape of columns: Document ID, Token ID, Head ID, Token, Lemma, Part-of-Speech (column name UPOS), Features, Gender, Referent, and Generic. An example is shown in Table 1. Sentence ID refers to the position of a sentence in a given document. The token ID refers to where a word occurs in a given sentence. Head ID refers to the head of a word as annotated by the Universal Dependencies parser. The UPOS column refers to the part-of-speech as annotated by the Universal Dependencies parser. The Features column indicates additional annotation provided by the Universal Dependencies parser. Information generally includes gender, number, person, among others. The Gender column was manually annotated, to cross-check the Features column and show which words refer to gendered entities. The Referent column was manually annotated to show if a given word refers to a Human referent (coded as “Humain” in French). For example, the plural article “les” ‘the’ is annotated as belonging to the masculine syntactic gender but refers to tribunals, which are neither human nor animal entities. On the other hand, the article “il” ‘he’ is annotated as referring to a human entity. Finally, the generic column has also been manually annotated, to show if the word is a masculine generic reference. For example, the word “ceux” ‘those’ is annotated as TRUE since it is a masculine generic reference.

Table 1

An example of a sentence coded in the database. The following columns with their values are not shown in the table due to space limitation. Source = written, Sentence = ‘il s’est souvent battu devant les tribunaux contre ceux qui l’accusaient d’avoir été un tortionnaire’, Sentence id = 19, document ID = 1_M_C_040602_bothtranslated.txt. The abbreviations are read the following way: Gen = gender, Num = number, Prs = person, Art = article.

ID	HEAD	TOKEN	LEMMA	UPOS	FEATURES	GENDER	REFERENT	GENERIC
1	6	il	il	PRON	Gen=Masc, Num=Sing, Prs=3, Type=Prs	Masc	Humain
2	1	s	s	X
3	2	’	’	PUNCT
4	6	est	être	AUX	Mood=Ind, Num=Sing, Prs=3, Tense=Pres
5	6	souvent	souvent	ADV
6	0	battu	battre	VERB	Gen=Masc, Num=Sing, Tense=Past	Masc	Humain
7	9	devant	devant	ADP
8	9	les	le	DET	Definite=Def, Gen=Masc, Num=Plur, Type=Art	Masc
9	6	tribunaux	tribunal	NOUN	Gen=Masc, Num=Plur	Masc
10	11	contre	contre	ADP
11	9	ceux	celui	PRON	Gen=Masc, Num=Plur, Type=Dem	Masc	Humain	TRUE
12	15	qui	qui	PRON	PronType=Rel
13	15	l	le	DET	Definite=Def, Gen=Masc, Num=Sing, Type=Art	Masc	Humain
14	13	’	’	PUNCT
15	11	accusaient	accus	VERB	Mood=Ind, Num=Plur, Prs=3, Tense=Imp
16	21	d	de	ADP
17	21	’	’	PUNCT
18	21	avoir	avoir	AUX	VerbForm=Inf
19	21	été	être	AUX	Gen=Masc, Num=Sing, Tense=Past	Masc
20	21	un	un	DET	Definite=Ind, Gen=Masc, Num=Sing, Type=Art	Masc	Humain
21	15	tortionnaire	tortionnaire	NOUN	Gen=Masc, Num=Sing	Masc

Crucially, for the purpose of this study, nouns that refer to human groups in general such as “enfants” ‘children’ or “personnes” ‘people’ and epicene nouns such as “fonctionnaire” ‘government officer’ or “artiste” ‘artist’ were also not marked as humans, as we exclusively focused on nouns that have a masculine/feminine alternation and require a gender-neutral counterpart. There were several other cases for which the manual annotation of human referents was problematic. For example, some proper names such as “la chapelle des célestins” ‘the chapel of the Celestines’ include nouns that refer to human entities. In such cases, we decided not to mark those nouns as human under the Referent column, as they were identified as proper nouns rather than common nouns. Additionally, the term “Dieu” ‘God’ was treated as a proper noun (and thus marked as not-human) when it was in upper case, but it was annotated as referring to human entities when it was in lower case, e.g., “les dieux” ‘the gods’. Some annotations from the Universal Dependencies parser were also manually corrected. For example, the lemma of “jazz” was initially annotated as “jazr” and the UPOS was marked as verb. As another example, in the spoken corpus, some shortened version of pronouns (e.g., “i” instead of “il” ‘he’) were not identified by the Universal Dependencies parser. Each change in the database was confirmed by the annotators.

3 Dataset description

Repository name

OSF

Object name

corpus_article-CLEAN.csv.

Format names and versions

CSV.

Creation dates

2024-03-01–2025-12-22.

Dataset creators

Magdalena Lemus Serrano, Marine Cozzolino, Mathilde Josserand, Tessa Vermeir.

Language

Variables are named in English. The corpora are written in French.

License

CC Attribution 4.0.

Publication date

2025-12-26.

3.1 Corpus Size and Vocabulary Size

The corpus contains 101,436 tokens, representing all running words after the preprocessing (described in Section 2). Across these tokens, we observe 7,523 unique lemmas. The overall type–token ratio (TTR) is .074, typical for large corpora (Noreillie, 2019). Because the corpus contains spoken and written material, we also report vocabulary measures by source type (Table 2).

Table 2

Corpus size and diversity ratio (TTR).

SOURCE	TOKENS	UNIQUE LEMMAS	TYPE-TOKEN RATIO (TTR)
Spoken	79113	4727	.0597
Written	22323	4405	.1970

The spoken sub-corpus is nearly four times larger than the written corpus but contains only slightly more lemmas (4,727 vs 4402). As a result, the written sub-corpus shows a much higher TTR (.197) than the spoken corpus (.0597). This is consistent with well-documented linguistic patterns, whereby spoken discourse tends to reuse common lexical items frequently, leading to lower TTR (Noreillie, 2019).

3.2 Zipf’s Law: Frequency–Rank Distribution

To further characterize the distribution of lexical items in the corpus, we examined Zipf’s Law, which predicts a roughly inverse relationship between the frequency of a word and its rank in the frequency table. Specifically, we regressed the logarithm of lemma rank on the logarithm of lemma frequency for all tokens in the corpus. In the complete data set, the regression yielded a slope of –.868, indicating a strong inverse relationship between rank and frequency (R² = .972, p < 2.2e–16). This slope is consistent with the classic Zipfian pattern observed in natural language, reflecting a small number of high-frequency items (function words, common verbs, and nouns) and a long tail of low-frequency lexical items (Figure 1).

Log–log plot of lemma frequency as a function of rank in the full corpus.

To assess differences by modality, we computed separate frequency–rank regressions for each sub-corpus, which produced a slope of –.803 for the spoken sub-corpus versus a slope of –1.13 for the written sub-corpus. The slightly flatter slope of the spoken sub-corpus indicates higher reliance on a core vocabulary, so that frequent words remain extremely dominant. With a slope slightly steeper than –1, the written sub-corpus shows faster decay, meaning, a longer tail of rare words, more lexical variety, and more specialized or less predictable vocabulary (Figure 2).

Log–log frequency–rank distributions of lemmas in the spoken and written sub-corpora.

3.3 Grammatical gender in the dataset

This section presents a broad overview of grammatical gender marking in French based on our dataset. These baseline proportions provide the reference point for the discussion of grammatical gender with respect to human and specific referents in Section 3.3.1 and Section 3.3.2 respectively.

Across all relevant parts-of-speech, masculine-marked tokens overwhelmingly dominate the corpus. Of the total gender-inflecting set, the masculine forms represent 28,683 tokens, while the feminine forms represent 12,229 tokens. Thus, masculine items constitute approximately 70% of all overtly gendered tokens in the dataset, confirming a strong overall masculine skew independent of part-of-speech. Figure 3 provides a detailed breakdown of masculine and feminine counts within each relevant UPOS category in the dataset.

Gender distribution across parts-of-speech.

The results show numeric superiority for masculine gender consistently across all categories, with variations in magnitude. Table 3 provides the numeric results for masculine and feminine tokens per part-of-speech, as well as their respective proportion of feminine tokens.

Table 3

Number of masculine and feminine tokens per UPOS category.

UPOS	MASC TOKENS	FEM TOKENS	TOTAL	FEM PROP.
ADJ	4633	1698	6331	.268
DET	6313	3787	10100	.375
NOUN	10882	5971	16853	.354
PRON	4606	404	5010	.080
VERB	2249	369	2618	.141

The highest feminine proportion occurs in determiners (37.5%), followed by nouns (35.4%). Adjectives show a significantly lower feminine share (26.8%). The pronouns display minimal feminine representation (8.1%), reflecting the dominance of masculine-marked pronominal forms in the corpus. Verb-related gender agreement (e.g., participles) exhibits a feminine proportion of 14.1%, consistent with the overall masculine predominance elsewhere in the dataset. These differences in feminine proportions illustrate the irregularity of gender marking in French, where gender marking is often non-uniform across different lexical categories.

Next, we explore the grammatical gender with respect to the referent type in our dataset. Our main questions concern the proportion of human vs. non-human referents among gendered tokens (Section 3.3.1), according to the ‘Referent’ column in our dataset, as well as the proportion of gendered tokens that refer to specific individuals (feminine or masculine) as opposed to an undetermined group via the generic masculine, based on the ‘Generic’ column. (Section 3.3.2).

3.3.1 Human vs. non-human referents

To evaluate how frequently grammatical gender in French is used to encode social gender in humans, we first isolated all gender-inflecting tokens in the corpus. Within this subset, each token was tagged according to its semantic referent (Human, Animal, or non-human/non-animate categories). For the purposes of this analysis, we grouped all non-human referents together. Within the 40,928 gender-inflecting tokens (≈ 40% of the corpus), only 2,522 tokens (≈ 6.2%) are associated with human referents, while the vast majority, 38,406 tokens (≈ 93.8%), mark non-human entities. These results highlight that grammatical gender in French is overwhelmingly applied to non-human nouns such as objects (“table” ‘table’, “voiture” ‘car’), abstract nouns (“liberté” ‘freedom’, “idée” ‘idea’), animals, and other inanimate entities.

3.3.2 Generic vs. non-generic masculine

In addition to exploring human referents, we examined the distribution of masculine gender tokens in the corpus, focusing specifically on their referents and the distinction between generic and non-generic usage. By generic masculine, we refer to the grammatical phenomenon in French whereby masculine gender agreement serves as the default form. This occurs when referring to mixed-gender groups (e.g., “les visiteurs de la galerie” ‘the visitors of the gallery’), as well as to individuals whose gender is unknown or irrelevant in the context (e.g., “faire le point avec un conseiller” ‘make an update with the consultant’). A total of 28683 masculine tokens were identified in the dataset. Of these, 1,932 tokens (≈ 6.7%) referred to human entities. Among the human-referential masculine tokens, 775 tokens (≈ 40.1%) were classified as generic masculine, whereas 1,157 tokens (≈ 59.9%) were non-generic masculine. This distribution shows that while generic masculine forms are frequent, non-generic forms remain the majority when masculine tokens refer to humans.

4 Reuse Potential

This study provides a dataset for a quantitative assessment of grammatical gender in French, with a focus on referent type and generic masculines. Across the corpus, tokens that refer to human entities represent a minority (less than 7%) of gender-marked forms, regardless of the modality. This indicates that grammatical gender in French is predominantly used for non-human reference. Such ratio is highly relevant for associations and/or communities assessing the use of gender-neutral language, as it shows that only a minor part of language use will be affected in either written or spoken French.

Our results also align with previous findings on the generic masculine (Flesch et al., 2026), while providing a wider scope of data sources in both written and spoken forms of French. Among masculine tokens referring to humans, generic uses account for a minority (≈ 40% of masculine tokens), while non-generic masculine forms are more frequent (≈ 60%). Although the dataset remains limited, these preliminary results suggest that masculine gender marking is robustly associated with male human referents, and offers key empirical evidence for individuals and communities questioning biases in gendered speech.

Finally, beyond descriptive analysis, the dataset constitutes a useful resource for research on grammatical and computational approaches to gender-neutral language in French. While the corpus includes minor annotation noise and remains limited in size for large-scale model training, these constraints do not undermine the main empirical observations reported here and its re-use potential. For example, the dataset could additionally be used to study agreement phenomena on nouns, determiners, adjectives, and verbs. Furthermore, the set of nouns attested in generic masculine usage could serve as a basis for identifying which lexical items might be prioritized in the development of epicene forms within the French lexicon.

Notes

[1] https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2491, last accessed 11 December 2025.

[2] https://clapi.icar.cnrs.fr, last accessed 11 December 2025.

Acknowledgements

We are thankful for the constructive comments from the editors and the reviewers.

Competing Interests

The authors have no competing interests to declare.

Author Contributions

Magdalena Lemus-Serrano: Data curation, data visualization, conceptualization, writing; Marine Cozzolino: Data curation, conceptualization, writing; Tessa Vermeir: Data curation, conceptualization; Mathilde Josserand: Data curation, conceptualization; Marc Allassonnière-Tang: conceptualization, writing.