(1) Overview
Repository location
https://doi.org/10.34894/GQXX3K
Alternative URL: (https://github.com/GOLEM-lab/Qidian_Webnovel_DataCollection)
Context
Many researchers (Evans et al., 2017; Rebora et al., 2021) have explored the various aspects of Digital Social Reading (DSR), and one of the many focuses is the reader response (Pianzola et al., 2022; Koolen et al., 2022). However, current research on storytelling and reader response has overlooked cross-cultural comparisons, with only a few exceptions (Hu et al., 2023). This research gap has been identified not only in the field of DSR but also in comparative literary studies. The primary focus of the comparison is on understanding the cultural influences behind literature by examining authors as transcultural readers rather than investigating the perspectives of readers (Zhang & Lauer, 2017). Questions such as whether cultural settings influence the understanding of topics, characters, and plots, or if culture shapes reading in certain dimensions but not others, have not been extensively investigated (Zhang & Lauer, 2017). While a few studies (Chesnokova et al., 2017; Zhang, 2022) have attempted to bridge this gap by investigating reader interpretation, they are typically conducted in controlled, laboratory settings with selected participants. As a result, our understanding of how readers from diverse cultural backgrounds engage with literature in naturalistic, real-world contexts remains limited.
At the same time, the most frequently studied books and readerships remain those with distinguished popularity, commercial success, social impacts, and scholarly prestige in the Anglophone world, which are subject to historical and social-cultural biases such as classism, sexism, racism, colonialism (Antoniak & Walsh, 2020; So & Wezerek, 2020; Hu et al., 2023), and might not be inclusive enough to understand reader response more broadly.
To address this gap, we introduce the Qidian-Webnovel Corpus, a dataset comprising 110 Chinese web novels available in both the original Chinese and their English translations. These works attract a global readership, with the largest reader populations located in North America, East Asia, and South Asia, according to user profiles.
(2) Method
The corpus was created through a systematic manual search process on Webnovel.com (hereafter Webnovel), beginning with the platform’s browse function. Webnovel organises content into three main categories: novels, comics, and fanfiction. This study focused exclusively on the “novel” category, as it aligns with the research objective of analysing narrative structures and corresponds to the primary content type available on the parallel Chinese platform, Qidian.com (hereafter Qidian).
The selection process employed specific filtering criteria on Webnovel’s browser interface. We set the content type filter to “Translate” to identify works that are human translations from Qidian, deliberately excluding the separate “MT” (Machine Translation) option to ensure translation quality and consistency across the corpus. The content status filter was set to “Complete” so that only finished novels were selected to ensure narrative completeness and to avoid potential issues with ongoing serialisation that might affect the stability of the dataset. And then sorted with “Popular” (popularity) to capture works with substantial reader engagement and cultural impact by sorting novels according to their popularity rankings.
Following the initial filtering, each novel underwent manual verification to ensure sufficient reader engagement: researchers manually accessed the main page of each candidate novel to verify that it had accumulated at least 200 comments. This threshold was established to ensure robust data for reader response analysis and to focus on works that had generated a moderate amount of discussion within the reader community.
After compiling the initial list of Webnovel URLs and titles, a cross-platform mapping process was implemented to identify corresponding works on Qidian.com. This mapping utilised NovelUpdates.com as an intermediary resource, as this aggregator website provides comprehensive translation title information that facilitates accurate matching between platforms.
The mapping process proceeded as follows:
Webnovel titles were cross-referenced with NovelUpdates entries
Translation variants were identified and recorded
Using the matched titles, manual searches were conducted on Qidian
Original Qidian URLs were recorded for each successfully mapped work
Both Webnovel and Qidian embed unique book identifiers within their URL structures. These key IDs were extracted and utilised as primary identifiers for systematic data collection through a combination of API-based collection and automated page retrieval. The websites’ backend API endpoints, which serve the publicly available content, were used to obtain user comments and associated URLs, while Playwright and Requests, a Python HTTP library, were occasionally employed to retrieve additional webpage content when necessary.
In addition to comment content, each comment and reply is associated with a unique user identifier, which enables the collection of reader profile metadata, including account level, bookshelf information, and user-specified location data. This demographic information provides valuable context for reader reception analysis; however, in accordance with privacy regulations, this user data is maintained exclusively for research purposes and will not be made publicly available.
The corpus was collected with a timestamp of 01/09/2024, ensuring temporal consistency across the dataset.
Due to access restrictions and copyrights issue, data scraping is limited to publicly available chapters from both platforms. These chapters can be useful for narratological and literary analysis inasmuch as they encompass opening narratives, including world-building passages, character introductions, and initial plot developments. As for reader response, we collected the available commenting data from both platforms and categorised it at three levels:
Book level (general discussions about the novel as a whole)
Chapter level (comments specific to individual chapters)
Paragraph level (comments targeting specific paragraphs)
This multi-level categorisation enables researchers to analyse reader engagement at different scales of narrative granularity, from macro-level story reception to micro-level textual response patterns.
Quality Control
To maintain data integrity, several quality control measures were applied:
Verification: Ensuring that all selected novels were actually completed and had a corresponding original Chinese edition on Qidian.
Metadata standardisation: Normalising book titles, author names, and comment structures to ensure consistency across the dataset.
Copyright check: Removing novels whose original versions were no longer available due to license expiration.
(3) Dataset Description
This data collection process yielded a total of 120 novels for inclusion in the corpus. However, the license for 10 of these novels on Qidian had expired, and it was difficult to locate comments on these novels on the Chinese platform, rendering them ineligible for inclusion. The final corpus consists of 110 stories (Yu et al., 2024). According to Webnovel’s categorisation visible on the website interface, these 110 stories consist of 103 Male Lead stories and 7 Female Lead stories. Male Lead stories (nanpin 男频) and Female Lead stories (nüpin 女频) are the two basic categories of Chinese web novels. These categories differ in protagonist gender, narrative perspective, character preferences, and plot elements. Male lead stories often focus on the male protagonist’s individual growth, adventure, conquest and success, and feature harem elements (one male protagonist in romantic relationships with multiple female characters), while Female Lead stories are primarily written by women and reflecting female aesthetic preferences and desires, emphasising interpersonal relationships. Though both categories share similar genres (fantasy, cultivation, science fiction, mystery, historical fiction, romance, etc.), male lead stories tends to feature more historical, Games, Eastern Fantasy, Horror, Urban stories, while female lead stories include more urban romance, danmei (boys’ love), and romance narratives where romantic plots serve as the primary storyline. This classification represents an internal taxonomy within the web novels community rather than a prescriptive reading standard (China Writers Network, 2021). Table 1 summarises the availability of metadata fields in the corpus collected from Webnovel (English) and Qidian (Chinese).
Table 1
Metadata of Stories in the corpus.
| SOURCE | COMMENT ID | COMMENT CONTENT | REPLY ID | REPLY CONTENT | BOOK ID | USER ID | USER LEVEL | RATING SCORE | REPLY AMOUNT | LIKE AMOUNT | CREATE TIME | QUOTE REVIEW ID | QUOTE CONTENT | QUOTE USER ID |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Webnovel (EN) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Qidian (CN) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | – | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Table 2 shows more details about the metadata for stories, comments and replies. Based on the information provided, we could also map the interactions between comments and their replies. Tables 3 and 4 contains the reader profile that we have for this corpus. It provides more information about readers’ reading footprint on the platform and some demographic information.
Table 2
Metadata for comments and replies.
| SOURCE | GENRES | CATEGORY/TAG | TOTAL COMMENTS | REPLIES | PRIMARY LANGUAGE | COMMENTS LANGUAGE DISTRIBUTION | REPLIES LANGUAGE DISTRIBUTION |
|---|---|---|---|---|---|---|---|
| Qidian (CN) | 14 | 27 | 2,791,837 | 855,577 | Chinese | Chinese: 95.7%; English: 0.1% | Chinese: 97.2%; English: 0.05% |
| Webnovel (EN) | 8 | 40 | 327,988 | 96,250 | English | English: 72.7%; Others: 27.3% | English: 68.2%; Others: 31.8% |
Table 3
Reader profile metadata.
| SOURCE | USER ID | USER NAME | GENDER | LEVEL INFO | WRITING DAYS | READING HOURS | NUM READ BOOKS | DESCRIPTION | DATE JOINED | LOCATION | NUM FOLLOWERS |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Webnovel (EN) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | – |
| Qidian (CN) | ✓ | ✓ | ✓ | ✓ | – | – | ✓ | ✓ | – | ✓ | ✓ |
Table 4
Demographic Information.
| LOCATION | NUMBER | PERCENTAGE (%) |
|---|---|---|
| Global | 55,100 | 42.80 |
| United States | 15,891 | 12.30 |
| Philippines | 14,425 | 11.20 |
| India | 9,591 | 7.40 |
| Indonesia | 3,231 | 2.50 |
| Nigeria | 2,972 | 2.30 |
| Malaysia | 2,277 | 1.70 |
| Canada | 2,046 | 1.50 |
| Australia | 1,602 | 1.20 |
| United Kingdom | 1,584 | 1.20 |
| Brazil | 1,478 | 1.10 |
Repository name
DataverseNL
Object name
Qidian-Webnovel Corpus 110
Format names and versions
CSV
Version 1.0
Creation dates
Start Date: 2024-04-16
End Date: 2024-08-15
Dataset creators
Data Collector: Ze Yu
Project Leader: Ze Yu
Data Collector: Emin Tatar
Supervisor: Federico Pianzola
Depositor: Christina Elsenga
Language
Chinese; English
License
License for metadata: This part of the dataset is released under the Creative Commons Attribution 4.0 International License (CC BY 4.0).
License for comments: The depersonalized reader comments data included in this dataset are subject to a separate Data Transfer Agreement. Users must review and accept the Data Transfer Agreement (DTA) before accessing or using the reader comments data. Researchers affiliated with universities or not-for-profit research institutes may use the restricted files of this dataset for conducting not-for-profit scientific research, in accordance with the DTA provided by the University of Groningen. This restriction is applied because user generated data is copyrighted and cannot be openly reshared, even though it is accessible in the EU for research purposes, according to the Text and Data Mining Exception (Directive 2019/790). We make it available for research purposes to avoid the unnecessary repetition of the complex scraping process that we performed.
Publication date
2024-10-03
(4) Reuse Potential
The Qidian-Webnovel Corpus addresses a critical gap in computational literary studies, where influential research has predominantly relied on monolingual datasets or English-dominated corpora (Fiormonte, 2017). While multilingual initiatives are emerging (Fischer et al. 2019; Schöch et al. 2021; Hamilton & Piper 2023; Viola 2024), this dataset provides researchers with unprecedented opportunities for systematic cross-cultural literary analysis.
Researchers can leverage this bilingual corpus to investigate how readers from different linguistic and cultural backgrounds engage with the same narrative content. The parallel structure of original Chinese novels and their English translations enables comparative analysis of reader reception toward narratives, cultural elements and translation. The dataset supports detailed examination of reader interactions, including how comments evolve across narrative progression at chapter and paragraph level and how different cultural readers engage with specific narrative elements (Yu & Pianzola, 2024). For example, the corpus offers substantial potential for sentiment analysis research, which computationally examines opinions, emotions, and attitudes expressed in text (Liu 2020; Kim & Klinger, 2021; Jockers 2014, 2015; Reagan et al., 2016; Hipson & Mohammad, 2021; Elkins 2022; Bizzoni & Feldkamp, 2023). Researchers can conduct comparative sentiment analysis across languages and cultural contexts. This enables investigation of how emotions are preserved, transformed, or reinterpreted in translation, enriching both comparative analysis of narrative reception and reader response.
The dataset further enables comparative studies of narrative dynamics, such as analyses of narrative space in character movement and world settings (Wilkens et al., 2024) and examining cross-cultural responses to narrative openings, as it is an important part of fictional worlds’ establishment (Rohrbacher, 2025).
Competing Interests
The authors have no competing interests to declare.
Author Contributions
Author: Ze Yu
Supervision: Federico Pianzola
Software: Emin Tartar
