Table 1
An example of a sentence coded in the database. The following columns with their values are not shown in the table due to space limitation. Source = written, Sentence = ‘il s’est souvent battu devant les tribunaux contre ceux qui l’accusaient d’avoir été un tortionnaire’, Sentence id = 19, document ID = 1_M_C_040602_bothtranslated.txt. The abbreviations are read the following way: Gen = gender, Num = number, Prs = person, Art = article.
| ID | HEAD | TOKEN | LEMMA | UPOS | FEATURES | GENDER | REFERENT | GENERIC |
|---|---|---|---|---|---|---|---|---|
| 1 | 6 | il | il | PRON | Gen=Masc, Num=Sing, Prs=3, Type=Prs | Masc | Humain | |
| 2 | 1 | s | s | X | ||||
| 3 | 2 | ’ | ’ | PUNCT | ||||
| 4 | 6 | est | être | AUX | Mood=Ind, Num=Sing, Prs=3, Tense=Pres | |||
| 5 | 6 | souvent | souvent | ADV | ||||
| 6 | 0 | battu | battre | VERB | Gen=Masc, Num=Sing, Tense=Past | Masc | Humain | |
| 7 | 9 | devant | devant | ADP | ||||
| 8 | 9 | les | le | DET | Definite=Def, Gen=Masc, Num=Plur, Type=Art | Masc | ||
| 9 | 6 | tribunaux | tribunal | NOUN | Gen=Masc, Num=Plur | Masc | ||
| 10 | 11 | contre | contre | ADP | ||||
| 11 | 9 | ceux | celui | PRON | Gen=Masc, Num=Plur, Type=Dem | Masc | Humain | TRUE |
| 12 | 15 | qui | qui | PRON | PronType=Rel | |||
| 13 | 15 | l | le | DET | Definite=Def, Gen=Masc, Num=Sing, Type=Art | Masc | Humain | |
| 14 | 13 | ’ | ’ | PUNCT | ||||
| 15 | 11 | accusaient | accus | VERB | Mood=Ind, Num=Plur, Prs=3, Tense=Imp | |||
| 16 | 21 | d | de | ADP | ||||
| 17 | 21 | ’ | ’ | PUNCT | ||||
| 18 | 21 | avoir | avoir | AUX | VerbForm=Inf | |||
| 19 | 21 | été | être | AUX | Gen=Masc, Num=Sing, Tense=Past | Masc | ||
| 20 | 21 | un | un | DET | Definite=Ind, Gen=Masc, Num=Sing, Type=Art | Masc | Humain | |
| 21 | 15 | tortionnaire | tortionnaire | NOUN | Gen=Masc, Num=Sing | Masc |
Table 2
Corpus size and diversity ratio (TTR).
| SOURCE | TOKENS | UNIQUE LEMMAS | TYPE-TOKEN RATIO (TTR) |
|---|---|---|---|
| Spoken | 79113 | 4727 | .0597 |
| Written | 22323 | 4405 | .1970 |

Figure 1
Log–log plot of lemma frequency as a function of rank in the full corpus.

Figure 2
Log–log frequency–rank distributions of lemmas in the spoken and written sub-corpora.

Figure 3
Gender distribution across parts-of-speech.
