GRAF – Gendered Reference Analysis in French

Magdalena Lemus-Serrano; Marine Cozzolino; Tessa Vermeir; Mathilde Josserand; Marc Allassonnière-Tang

doi:10.5334/johd.510

GRAF – Gendered Reference Analysis in French

Journal of Open Humanities Data

Volume 12 (2026): Issue 1

By: Magdalena Lemus-Serrano , Marine Cozzolino, Tessa Vermeir , Mathilde Josserand and Marc Allassonnière-Tang

Open Access

|Apr 2026

Figures & Tables

Table 1

An example of a sentence coded in the database. The following columns with their values are not shown in the table due to space limitation. Source = written, Sentence = ‘il s’est souvent battu devant les tribunaux contre ceux qui l’accusaient d’avoir été un tortionnaire’, Sentence id = 19, document ID = 1_M_C_040602_bothtranslated.txt. The abbreviations are read the following way: Gen = gender, Num = number, Prs = person, Art = article.

ID	HEAD	TOKEN	LEMMA	UPOS	FEATURES	GENDER	REFERENT	GENERIC
1	6	il	il	PRON	Gen=Masc, Num=Sing, Prs=3, Type=Prs	Masc	Humain
2	1	s	s	X
3	2	’	’	PUNCT
4	6	est	être	AUX	Mood=Ind, Num=Sing, Prs=3, Tense=Pres
5	6	souvent	souvent	ADV
6	0	battu	battre	VERB	Gen=Masc, Num=Sing, Tense=Past	Masc	Humain
7	9	devant	devant	ADP
8	9	les	le	DET	Definite=Def, Gen=Masc, Num=Plur, Type=Art	Masc
9	6	tribunaux	tribunal	NOUN	Gen=Masc, Num=Plur	Masc
10	11	contre	contre	ADP
11	9	ceux	celui	PRON	Gen=Masc, Num=Plur, Type=Dem	Masc	Humain	TRUE
12	15	qui	qui	PRON	PronType=Rel
13	15	l	le	DET	Definite=Def, Gen=Masc, Num=Sing, Type=Art	Masc	Humain
14	13	’	’	PUNCT
15	11	accusaient	accus	VERB	Mood=Ind, Num=Plur, Prs=3, Tense=Imp
16	21	d	de	ADP
17	21	’	’	PUNCT
18	21	avoir	avoir	AUX	VerbForm=Inf
19	21	été	être	AUX	Gen=Masc, Num=Sing, Tense=Past	Masc
20	21	un	un	DET	Definite=Ind, Gen=Masc, Num=Sing, Type=Art	Masc	Humain
21	15	tortionnaire	tortionnaire	NOUN	Gen=Masc, Num=Sing	Masc

Table 2

Corpus size and diversity ratio (TTR).

SOURCE	TOKENS	UNIQUE LEMMAS	TYPE-TOKEN RATIO (TTR)
Spoken	79113	4727	.0597
Written	22323	4405	.1970

Log–log plot of lemma frequency as a function of rank in the full corpus.

Log–log frequency–rank distributions of lemmas in the spoken and written sub-corpora.

Gender distribution across parts-of-speech.

Table 3

Number of masculine and feminine tokens per UPOS category.

UPOS	MASC TOKENS	FEM TOKENS	TOTAL	FEM PROP.
ADJ	4633	1698	6331	.268
DET	6313	3787	10100	.375
NOUN	10882	5971	16853	.354
PRON	4606	404	5010	.080
VERB	2249	369	2618	.141

References

Authors

Metrics

Articles in this issue

DOI: https://doi.org/10.5334/johd.510 | Journal eISSN: 2059-481X

Journal RSS Feed

Language: English

Submitted on: Jan 8, 2026

Accepted on: Mar 5, 2026

Published on: Apr 14, 2026

Published by: Ubiquity Press

In partnership with: Paradigm Publishing Services

Publication frequency: 1 issue per year

Keywords:

gender,

human referent,

generic masculine,

spoken,

written

© 2026 Magdalena Lemus-Serrano, Marine Cozzolino, Tessa Vermeir, Mathilde Josserand, Marc Allassonnière-Tang, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.

Volume 12 (2026): Issue 1

GRAF – Gendered Reference Analysis in French

Figures & Tables

Table 1

Table 2

Figure 1

Figure 2

Figure 3

Table 3

Paradigm

My account