A Multimodal Dataset of Climate Change Narratives in China and the UK

Hengyi Li; Pu Yan; Simon Mahony; Ulrich Tiedau; Andreas Vlachidis

doi:10.5334/johd.509

Full Article

(1) Overview

Repository location

Mendeley Data, V1.: https://data.mendeley.com/datasets/bd32wp6vhy/2.

DOI: https://doi.org/10.17632/bd32wp6vhy.2.

Context

This dataset was produced as part of the collaborative research project, “Decoding Climate Narratives: A Comparative Study of Climate Change Communication on Social Media Platforms in China and the UK,” kindly funded by the UCL-PKU Strategic Partner Fund. The project addresses a significant gap in understanding how digital storytelling about climate issues differs between the UK and China. While both countries are major actors in global climate policy, their digital media environments and patterns of online climate communication diverge substantially. This dataset provides an empirical foundation for comparing public engagement with climate-related content, levels of trust in climate science and scientists, and the narrative strategies employed on video-sharing platforms.

(2) Method

Steps

The data collection process employed platform-specific strategies combining keyword retrieval, merging of separate rankings, and geographical filtering to ensure comparable, systematic, and comprehensive coverage. Data was collected from posts beginning on 18 May 2007 with a cutoff date of 9 April 2025. The start date reflects the first substantial topical video our filtering strategy identified and coincided with a moment when, a decade after the 1997 Kyōto Protocol, climate politics had become a major international issue, while emerging video-sharing platforms like YouTube, launched in 2005 and BiliBili, launched in 2009 and the surge in climate coverage following Al Gore’s 2006 documentary An Inconvenient Truth brought climate change into mainstream public debate. BiliBili and YouTube were selected as the primary platforms for examining climate communication on social media. These leading videosharing sites were chosen over microblogging services and other forms of social media because they are particularly well-suited to comparative analysis of climatechange communication. In contrast to microblogging platforms, they combine longform video formats that support narrative analysis with rich multimodal affordances and wellestablished communities of science and educational content creators. The selection of BiliBili and YouTube serves a dual purpose. As well as being especially well-suited to narrative analysis, offering distinct advantages over other forms of social media, their audience base and geographical presence can provide a clear distinction between Western and East Asian narratives.

For BiliBili, we used the Chinese term “气候变化”, which translates as climate change. Because the platform limits search retrieval to 1,000 results per query, we adopted a multi-sort strategy. Using Python scripts, we harvested search results ranked under four criteria: comprehensive sort, most played, most Danmaku (弹幕, bullet comments), and most favourited. The resulting lists were merged and deduplicated using unique video identifiers, yielding a final set of 2,092 BiliBili videos. For YouTube, we used the English keyword “climate change” to identify relevant content. To focus on videos likely to be watched by UK audiences, we queried the YouTube Data API V3 with the parameter “regionCode= ‘GB’”, which instructs the API to return search results for videos that can be viewed in the specified country.¹ As with the BiliBili procedure, we collected results sorted by relevance, view count, and rating, and merged these lists to create a dataset of 706 unique videos.

The 2,092 BiliBili and 706 YouTube videos capture how social media users in China and the UK engage with climate-related content in their everyday online environments. The difference in sample sizes reflects platform-imposed retrieval limits: BiliBili returns up to 1,000 results per query and allows four sorting strategies, whereas the YouTube API provides three sorting strategies and returns fewer results per query. We addressed the sample size disparity in our analysis by focusing on engagement rates, thematic proportions, and employing statistical techniques suitable for unequal sample sizes. We enriched metadata through additional queries to the YouTube API V3 and the BiliBili API (e.g. “uploader_country”, country of origin of the video uploader), and used Requests, Selenium, and BeautifulSoup to extract extended metadata fields such as “video_description”. Metadata for both platforms, including YouTube engagement metrics, were retrieved on 9 April 2025 and reflect the values available at the time of collection.

All scripts used for data collection are archived in a public GitHub repository² to ensure full reproducibility of the dataset construction workflow.

To obtain video transcripts, we relied on different strategies for the two platforms.

For YouTube videos, we collected the platform’s built-in subtitles directly, as these are generally of high quality and widely available. In contrast, for BiliBili videos, all transcripts were generated using the Tencent Cloud speech-to-text API, as platform-provided subtitles were generally unavailable or inconsistent in quality. These transcripts were produced automatically and may contain recognition errors, particularly in videos with limited speech or non-Mandarin/English audio.

All data were collected from publicly accessible videos on BiliBili and YouTube in accordance with the respective platforms’ terms of service. The dataset contains only publicly available content and metadata and does not include any private or personally sensitive information. The study as a whole received ethics approval from the Research Committee of the Department of Information Management, Peking University (Ethics approval number: A130010).

Sampling strategy

The sampling strategy combines keyword-based retrieval with platform ranking algorithms to account for both geographic (UK) and linguistic (Simplified Chinese) contexts. The multi-rank merging strategy reduces ranking bias and includes both highly visible and niche but relevant content. No additional inclusion thresholds were applied; all videos returned under the specified search parameters were included and deduplicated, excluding only duplicate or unavailable videos at the time of collection.

Quality control

All videos were deduplicated using unique identifiers (“video_id” for Youtube or “bvid” for BiliBili). A manual relevance check was performed on a random 5% sample to confirm that the content addressed climate-related topics. Engagement metrics such as likes and shares were retained in their raw integer format. All timestamps were normalised to the standard format YYYY-MM-DD HH:MM:SS.

(3) Dataset Description

Repository name

The dataset is hosted on the Mendeley Data platform (Mendeley Data, V1.: https://data.mendeley.com/datasets/bd32wp6vhy/2. DOI: https://doi.org/10.17632/bd32wp6vhy.2).

Object name

Climate_Narratives_CN_UK_v1.0_214.7MB_2025.zip.

Format names and versions

The dataset consists of three main components:

Metadata tables (.csv): contains descriptive variables and video identifiers;
Transcripts (.txt): Text files containing the transcribed audio content of each video named using unique video identifiers;
Cover images (.jpg/.png): video thumbnail images, also named using unique identifiers.

Creation dates

Data collection was completed: 2025-04-09.

Dataset creators

Hengyi Li (Peking University): Collected video metadata and transcripts;

Pu Yan (Peking University): Designed the data framework;

Simon Mahony, Ulrich Tiedau, Andreas Vlachidis (UCL): Provided country contexts.

Language

BiliBili Data: Chinese (Simplified);

YouTube data: English (UK);

Variable names: English (UK) and standardised across both datasets for comparison.

License

CC-BY 4.0 International.

Publication date

2025-12-15.

(3) Reuse Potential

This dataset provides a substantial resource for multidisciplinary research beyond the original project’s scope on climate communication.

Cross-Cultural Narrative Analysis: The parallel structure of the dataset enables comparative study of how climate science, climate risks and related socio-political issues are framed across distinct media ecologies (Allgaier, 2019; Boykoff, 2011). Recent comparative work has also begun to examine cross-national climate narratives using digital trace data (Yan et al., 2022). The availability of full transcripts facilitates semantic analysis, topic modelling, discourse analysis, and the identification of diverging narrative strategies across Chinese- and English-language contexts.

Multimodal Analysis: The combined availability of video URLs, thumbnails, transcripts, and metadata allows researchers in computer vision, media studies, and multimodal communication to examine climate-related imagery, develop classification tasks, or conduct image-text alignment studies. The dataset is particularly suitable for research on visual framing and multimodal argumentation.

Sentiment and Stance Detection: Engagement metrics (e.g. likes, favourites, comment counts), alongside video content provide a foundation for analysing sentiment, stance, and patterns of audience responsiveness within environmental communication (Yan et al., 2022). These metrics support studies of public attention, topic salience, and social popularity within environmental discourses.

Pedagogical use: The dataset is well suited for teaching applications, including courses in data journalism, digital humanities, social media analytics, and computational social science. Its bilingual nature (Chinese-English) also makes it valuable for methods training in comparative or cross-linguistic research contexts.

While this dataset offers rich opportunities for analysis, users should be aware of its inherent limitations. First, it focuses exclusively on video content and therefore does not include user comments. Additionally, the data collection is limited to BiliBili and YouTube in a specific timeframe (2007–2025), which should be taken into account when generalising any findings.

Additional File

The additional file for this article can be found as follows:

Supplementary A

Table of important variables in the metadata of videos. DOI: https://doi.org/10.5334/johd.509.s1

Notes

[1] For detailed API documentation, please refer to https://developers.google.com/youtube/v3/docs/search/list.

[2] https://github.com/hengyi-li/cross-platform-climate-video-crawler.

Acknowledgements

We gratefully acknowledge the research support provided by the student research assistants: Bingjun Liu and Lu Liu at UCL, and Nuo Chen and Lingyun Li at Peking University.

Competing Interests

The authors have no competing interests to declare.

Author contributions

Hengyi Li: Investigation, Visualisation

Pu Yan: Writing – original draft, Funding acquisition

Simon Mahony: Writing – review & editing, Funding acquisition

Ulrich Tiedau: Writing – review & editing, Funding acquisition

Andreas Vlachidis: Writing – review & editing, Funding acquisition