Have a personal or library account? Click to login
WikiTextGraph: A Python Tool for Parsing Multilingual Wikipedia Text and Graph Extraction Cover

WikiTextGraph: A Python Tool for Parsing Multilingual Wikipedia Text and Graph Extraction

Open Access
|Sep 2025

Figures & Tables

jors-13-572-g1.png
Figure 1

Workflow of how the software parses and, if prompted, generates the graph.

jors-13-572-g2.png
Figure 2

The Graphic User Interface (GUI) of the WikiTextGraph algorithm.

Table 1

Pages the algorithm removes during the text cleaning phase for each language version supported by the algorithm.

ENGLISH (en)
Wiktionary:Category:Draft:File:List of
MediaWiki:Module:Template:Wikipedia:Index of
Help:Portal:Image:(disambiguation)
SPANISH (es)
Wiktionary:Categoría:File:Archivo:Image:
MediaWiki:Plantilla:Wikipedia:Anexo:Módulo:
Portal:Help:Ayuda:Wikiproyecto:Usuario:
User:(desambiguación)
GREEK (el)
Wiktionary:Κατηγορία:Αρχɛίο:File:Image:
MediaWiki:Module:Πρότυπο:Wikipedia:Βικιπαίδɛια:
(αποσαφήνιση)Portal:Πύλη:Βοήθɛια:Topic:
Χρήστης:User:
POLISH (pl)
Wikipedia:Pomoc:Szablon:MediaWiki:Kategoria:
Wikiprojekt:Portal:Plik:Moduł:User:
Wątek:Topic:(ujednoznacznienie)
ITALIAN (it)
Wikipedia:Aiuto:Template:MediaWiki:Categoria:
Progetto:Portale:File:Modulo:Topic:
(disambigua)(disambiguazione)
DUTCH (nl)
Wikipedia:Help:Sjabloon:MediaWiki:Categorie:
Portaal:Bestand:Module:(disambiguatie:)(disambiguation)
User:
BASQUE (eu)
Wikipedia:Laguntza:Txantilloi:MediaWiki:Kategoria:
Maila:Atari:Ataria:Usuario:Modulu:
Fitxategi:Wikiproiektua:Eranskina:Txikipedia:Zerrenda:
(argipena)
HINDI (hi)
विकिपीडिया:साँचा:श्रेणी:मीडियाविकि:सहायता:
रवेशद्वार:चित्र:विकिपरियोजना:मॉड्यूल:(बहुविकल्पी)
GERMAN (de)
Wikipedia:Hilfe:Vorlage:MediaWiki:Kategorie:
Portal:Benutzer:Modul:Datei:Liste der
Liste desListe von(begriffsklärung)
VIETNAMESE (vi)
Wikipedia:MediaWiki:Trợ giúp:Bản mẫu:Tập tin:
Thể loại:Sách:Danh sách:Cổng thông tin:Mô đun:
(định hướng)
Table 2

Sections where extracting text stops. If one of these sections appears in the article, it indicates the stopping point for extracting content.

ENGLISH (en)
See alsoPublicationsReferencesNotes
FootnotesExternal linksFurther ReadingDraft:
SPANISH (es)
Véase tambiénReferenciasNotasEnlaces externos
BibliografíaOtra lectura
GREEK (el)
Δɛίτɛ ɛπίσηςΠαραπομπέςΣημɛιώσɛιςΕξωτɛρικοί σύνδɛσμοι
Προτɛινόμɛνη βιβλιογραφίαΒιβλιογραφίαΠεραιτέρω ανάγνωση
POLISH (pl)
Zobacz teżUwagiBibliografiaPrzypisy
Linki zewnętrzneLiteratura w języku polskim
ITALIAN (it)
NoteBibliografiaAltri progettiCollegamenti esterni
RiferimentiVoci correlate
DUTCH (nl)
NotenBibliografieAppendixExterne links
ReferentiesBronnenZie ookOverig
AnderLiteratuurVoetnoten
BASQUE (eu)
OharrakBibliografiaAhultasunakKanpo estekak
ErreferetziakIkus, gainera
HINDI (hi)
िप्पणियांटयह भी देखिएसंदर्भ सूचीबाहरी कड़ियाँ
सन्दर्भइन्हें भी देखेंइसके अतिरिक्त पठन
GERMAN (de)
WeblinksEinzelnachweiseEinzelnachweise und AnmerkungenAnmerkungen
LiteraturSiehe auchFußnotenVeröffentlichungen
VIETNAMESE (vi)
Xem thêmTham khảoLiên kết ngoàiTài liệu tham khảo
Tài liệuGhi chúChú thíchTài liệu khác
Hình ảnhĐọc thêm
Table 3

Table with the keywords to detect redirects for each language version.

ENGLISH (en)
#redirect
SPANISH (es)
#redirección#redirect
GREEK (el)
#ανακατɛύθυνση#redirect
POLISH (pl)
#patrz#redirect#przekieruj#tam
ITALIAN (it)
#rinvia#redirect
DUTCH (nl)
#doorverwijzing#redirect
BASQUE (eu)
#birzuzendu#redirect
HINDI (hi)
#अनुप्रेषित#पुनर्प्रेषित#redirect
GERMAN (de)
#weiterleitung#redirect
VIETNAMESE (vi)
#đổi#redirect
jors-13-572-g3.png
Figure 3

Log-log plot of the in-degree12 distribution for each language version. The x-axis represents the in-degree (i.e., the number of incoming links to a node), whereas the y-axis represents the frequency of nodes for each in-degree value. Each curve corresponds to a different language version, as indicated in the legend.

jors-13-572-g4.png
Figure 4

Log-log plot of the out-degree13 distribution for each language version. The x-axis represents the out-degree (i.e., the number of outgoing links from a node), whereas the y-axis represents the frequency of nodes for each out-degree. Each curve corresponds to a different language version, as indicated in the legend.

Table 4

Summary statistics of the Wikipedia language versions supported (at the time of writing) and processed by WikiTextGraph. Each row corresponds to a different language version, identified by its name and ISO 639 language code in parentheses. The columns report the following metrics calculated using the Python library NetworkX [35]. Order is the number of nodes (Wikipedia articles) in the directed network; Size is the number of edges (links between articles); Average In/Out degree is the average number of incoming and outgoing links per node; Max In/Out-degree is the highest number of incoming and outgoing links respectively; Density is the ratio of actual links to all possible links14; Date is the month and the year that corresponds to the version of the Wikipedia dump collected and processed.

LANGUAGEORDERSIZEAVERAGE IN/OUT DEGREEMAX IN-DEGREEMAX OUT-DEGREEDENSITY (×10–6)DATE
English (en)6,736,622159,598,68823.6247,9924,6154Jan. 2025
German (de)3,001,07477,213,28225.7209,49311,3599Nov. 2024
Dutch (nl)2,169,65827,487,94512.6163,52512,2636Nov. 2024
Spanish (es)1,905,58239,473,62820.7205,6624,08311Nov. 2024
Italian (it)1,820,88442,649,94823.4153,3375,17913Nov. 2024
Polish (pl)1,620,57030,140,61618.5138,6193,80011Nov. 2024
Vietnamese (vi)1,284,9288,977,3536.9266,1375,5765Jan. 2025
Basque (eu)435, 4294,436,02210.165,3391,57323Dec. 2024
Greek (el)237,0164,319,40118.216,3141,46477Nov. 2024
Hindi (hi)152,9901,097,8497.142,1233,99347Nov. 2024
DOI: https://doi.org/10.5334/jors.572 | Journal eISSN: 2049-9647
Language: English
Submitted on: Apr 15, 2025
Accepted on: Jul 28, 2025
Published on: Sep 12, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Paschalis Agapitos, Juan-Luis Suárez, Gustavo Ariel Schwartz, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.