The paper
The aim of this paper is to introduce the workflow of the AVOBMAT (Analysis and Visualization of Bibliographic Metadata and Texts) multilingual research tool. (Péter et al., 2020; Péter et al., 2022).
1. Uploading the corpus
Users can upload metadata and texts in several formats: Zotero collections in CSV and RDF formats and EPrints (library) repositories as XML files (metadata or metadata with links to the full texts). AVOBMAT can also import full texts, for example, by uploading a zip file of documents along with a CSV of the metadata. Documents from external databases can be imported by providing URLs to the full texts in the CSV. It can process texts in several formats since the Apache Tika library converts them to plain text.
2. Cleaning the corpus
AVOBMAT provides several options for cleaning the text corpus. For example, users can
remove non-alphabetical tokens (e.g. of OCR-ed texts);
upload a list of words and replace words (e.g. synonyms) and characters;
make use of regular expressions.
A context filter is implemented to keep the context of a keyword or keywords and remove all other parts of the document.
3. Configuring the parameters
Users can create different configurations for each analysis where the outcome depends on the language of the texts. There are two ways to assign a language to a document: researchers can manually select a language for the full dataset (52 languages) or choose the automatic language detection option. As for the latter, the system will choose a language independently for each document. Based on the language, it offers stopword and punctuation, filtering drawing on the spaCy library, and lemmatization (SpaCy Models and Languages). Extra stopword and punctuation lists can also be added. SpaCy language models are used for lemmatization, with LemmaGen models being used for languages not supported by spaCy (Juršic et al., 2010).
The following pre-processing options are implemented:
choose spaCy language model (small, large or transformer);
make text lowercase;
remove numbers;
set minimal character length.
The metadata enrichment includes the identification of the gender of the authors (male, female, unknown gender or without author) and automatic language detection. Users can also upload a list of male and female first names, supplementing and replacing the ones found in the dictionaries of the programme.
As for topic modelling, the user also has the option to separate the documents into sections of equal size. Users can specify the so-called window length for certain lexical diversity analyses (MSTTR, MATTR).
4. Validating the settings
AVOBMAT cleans and pre-processes a small sample of the uploaded database where the user can check if the set parameters are appropriate. The settings can be saved in a template if the configuration is acceptable. If the parameters need to be fine-tuned, the user can start the cleaning and configuration process again. AVOBMAT identifies missing values and gaps in the metadata.
5. Filtering the corpus
The user can search and filter the metadata and texts in faceted, advanced and command-line modes and perform all the subsequent analyses on the filtered dataset (Figure 1). The NLP analyses of the documents semantically enrich the metadata. For example, the recognized named entities such as person appear in all types of searches and the user can search for (disambiguated) named entities in 16 languages. The tool supports fuzzy and proximity searches.

Figure 1
AVOBMAT graphical interface.
6. Interactive metadata analysis
Having filtered the uploaded databases and selected the metadata field(s) to be explored (Figure 2), the user can, among other actions,

Figure 2
Interactive metadata visualization setting.
analyse and visualize the bibliographic data chronologically in line and area charts in normalized and aggregated formats (Figure 4);
create an interactive network analysis of the metadata fields (Figure 3);
make pie, horizontal and vertical bar charts.

Figure 3
Network analysis of authors, publishers and booksellers involved in the publications of 18th-century books concerning Freemasonry with a particular focus on James Anderson (author).

Figure 4
Chronological distribution of the detected languages of the 53411 articles and books in the University of Szeged publication repository.
7. Interactive content analysis
The following options are available for interactive content analysis.
7.1. N-gram viewer
This diachronic analysis of texts shows the yearly count of the specified n-grams generated at the pre-processing stage in aggregated and normalized views (Figure 5).

Figure 5
N-gram viewer. Distribution of “katolikus egyház” [Catholic church], “református egyház” [Reformed church], “evangélikus egyház” [Lutheran church] in the Délmagyarország daily newspaper, 1911–2009.
7.2. Frequency analysis
Frequency analyses and word clouds can be efficient tools to highlight the prominent terms in a corpus. The significant text analytical tool shows what differentiates a subset of the documents from others using four different metrics (e.g. Chi square) (Manning et al., 2009; Rudi and Vitányi, 2007; see Significant text aggregation). The TagSphere analysis enables users to investigate the context of a word by creating tag clouds showing the co-occurring words of a specified search term within a specified word distance (Figures 6 and 7) (Jänicke and Scheuermann, 2017). Words can be interactively removed from the clouds. Bar chart versions of the analyses present the applied scores and frequencies.

Figure 6
TagSphere analysis. Dan Brown’s novels. Keyword: god, word distance: 4 (shown in different colours), minimum word frequency: 7, lemmatized texts, stopwords removed.

Figure 7
The same TagSphere analysis as in Figure 6. Bar chart view with statistical data.
7.3. Lexical diversity
AVOBMAT calculates the lexical diversity of texts according to eight different metrics: Type-token ratio (TTR), Guiraud, Herdan, Mass TTR, Mean segmental TTR, Moving average TTR, Measure of Textual Lexical Diversity and Hypergeometric distribution Diversity (Figure 8) (Covington and McFall, 2010; McCarthy and Jarvis, 2010; Torruella and Capsada, 2013).

Figure 8
Lexical diversity metrics in J. K. Rowling’s Harry Potter novels.
7.4. Keyword-in-context
The keyword-in-context function supports the close reading of texts (Figure 9).

Figure 9
Keyword-in-context. The word “magic” in J. K. Rowling’s Harry Potter and the Philosopher’s Stone.
7.5. Topic modelling
The Latent Dirichlet Allocation function calculates and graphically represents topic models (Blei et al., 2003). It shows the most relevant words and most relevant documents related to each topic, visualizes the distribution of these topics chronologically, highlights the correlation of different topics and exports the results in various formats (Figures 10 and 11). It has the following parameters: the minimum number of occurrences of words, the number of topics and iterations, per-document topic distribution (alpha), and per-topic word distribution (beta) parameters. Users can interactively remove stopwords.

Figure 10
Topic modelling of Dan Brown’s novels.

Figure 11
Topic modelling of the Szegedi Egyetem [University of Szeged] Magazin, 1953–2011.
7.6. Part-of-speech tagging
AVOBMAT identifies the part-of-speech tags currently in 16 languages by using the spaCy language models. It produces different interactive visualizations and statistical tables of the results (Figures 12 and 13).

Figure 12
Part-of-speech analysis in Dan Brown’s novels.

Figure 13
Part-of-speech analysis of J. K. Rowling’s Harry Potter novels. Statistical results.
7.7. Named entity recognition, disambiguation and linking
It identifies named entities such as persons and places currently in 16 languages. The number and type of named entities differ by language, as seen in Table 1. AVOBMAT creates different statistical tables and visualization of these entities. The latter are displayed in full-text view. As for the English language, it disambiguates the entities and links them to Wikidata, VIAF and ISNI (Figure 14).

Table 1
Named entity recognition in different languages.

Figure 14
Named entity recognition and linking in Dan Brown’s Da Vinci Code.
8. Export results, configurations and publicize databases
The reproducibility and transparency of the experiments and results using the tool are enhanced by the ability to import and export the parameter settings in JSON format. The users can create templates for the pre-processing and analytical functions on the graphical interface. The tabular statistical data and visualizations of the performed analyses can be saved in PNG and different CSV formats, including a document-topic graph file for Gephi in case of topic modelling. The latter enables researchers to use the generated data in other software. Users can share and make their databases public.
Funding Information
The creation of the AVOBMAT software was partially funded by the EFOP-3.6.1-16-2016-00008, EFOP-3.6.3-VEKOP-16-2017-0002, 2019-1.2.1-EGYETEMI-ÖKO-2019-00018 and the Humanities and Social Sciences Cluster of the Centre of Excellence for Interdisciplinary Research, Development and Innovation of the University of Szeged.
Competing Interests
The authors have no competing interests to declare.
Author Contributions
Róbert Péter: Supervision, Conceptualization, methodology, funding acquisition: Writing – review & editing
Zsolt Szántó: Software, Methodology
Zoltán Biacsi: Software
Gábor Berend: Supervision, Methodology
Vilmos Bilicki: Supervision, Methodology
