(1) Overview
Repository location: https://zenodo.org/doi/10.5281/zenodo.10971589
Context
In the information age, social media posts and comments on current events provide a vast resource for insights into public opinion and online political discourse (Acerbi, 2020). The Philippines, which consistently ranks amongst the countries with the highest degrees of social media use (Baclig, 2022) with more than 87 million Facebook users as of 2023 (NapoleonCat, 2023), thus presents an apt target for data collection; however, there are few publicly accessible datasets that centre political comments on social media.
This dataset attempts to capture a wide spectrum of political comments on the transition between the former political entity of the Autonomous Region in Muslim Mindanao (ARMM) and the Bangsamoro Autonomous Region in Muslim Mindanao (BARMM), representing a critical turning point in the aspirations of Moros (the collective term for various Muslim peoples in the southern Philippines) for greater political autonomy. Three key events in the last decade are the focus of this data collection: 1) the Mamasapano Clash (2015); 2) the Marawi Siege (2017) representing widely reported encounters between the Philippine police or military and armed groups in ARMM; and 3) the establishment of BARMM (2019). This data collection further contributes to online resources for studying online views on Islam and Muslims in Southeast Asia, which is often overshadowed by the amount of research on the topic in Western settings (Hashmi, Radzuwan, & Ahmad, 2021).
(2) Method
Steps
The comments in this dataset were gathered through Facepager (Jünger and Keyling, 2019), an open-source third-party software, and came from the official Facebook pages of widely-distributed national newspapers: The Philippine Daily Inquirer, Manila Bulletin, The Philippine Star, and The Manila Times (see Chua, 2023), and regional newspapers such as Sunstar Cebu, Cebu Daily News, The Freeman, Sunstar Davao, MindaNews and The Mindanao Times (Table 1).
Table 1
Word Distribution per Corpus Type and Newspaper.
| NEWSPAPERS | CORPUS | TOTAL COMMENTS | TOTAL WORDS1 |
|---|---|---|---|
| MindaNews | Main | 88 | 5444 |
| Side | 25 | 941 | |
| The Mindanao Times | Main | 2 | 41 |
| Side | 0 | 0 | |
| Sunstar Davao | Main | 306 | 7350 |
| Side | 69 | 5911 | |
| Cebu Daily News | Main | 55 | 2251 |
| Side | 68 | 2488 | |
| The Freeman | Main | 186 | 8057 |
| Side | 41 | 1399 | |
| Philippine Daily Inquirer | Main | 7419 | 441517 |
| Side | 962 | 48723 | |
| Manila Bulletin | Main | 214 | 8468 |
| Side | 231 | 19273 | |
| Manila Times | Main | 433 | 31803 |
| Side | 73 | 6478 | |
| Philippine Star | Main | 1688 | 55045 |
| Side | 399 | 20123 | |
| Sunstar Cebu | Main | 136 | 5504 |
| Side | 83 | 2456 |
Sampling strategy
The dataset is part of a larger corpus of more than 7.6 million words containing Facebook posts and comments of the newspapers above. Data was collected from the month in which the target event occurred and its succeeding month, comprising the main corpus, and posts and comments from two random months in the same year, forming the side corpus, amongst other variables (Table 2).
Table 2
Data Sources.
| FEATURE | SOURCE |
|---|---|
| object_id, message, message_proc (processed message), from_name (public page sources), created_time, newspaper | Facepager, Graph API |
| region (Luzon/Visayas/Mindanao), corpus (main/side), administration (President Benigno Aquino III/President Rodrigo Roa Duterte), year (year of posting), month_year (month and year of posting), count (word count) | Manually entered |
| lang_label (Tagalog (Filipino), English, Cebuano, Taglish, Bislish, Bislog, Other) | Computational |
A list of target words for filtering the corpus was developed by expanding a core set of terms2 through fastText (Bojanowski et al., 2016), a library for the learning of word embeddings. The resulting list of 184 related terms in Filipino, English, and Cebuano was used to extract relevant comments from the corpus to form the present dataset. Due to the presence of multilingualism and codeswitching in the comments, a multilabel classifier trained on a set of 16,156 manually-encoded comments was used for language detection (see Cruz & Kestemont, 2022).
The classifier revealed three codeswitched varieties in the dataset: Taglish (Filipino and English), Bislish (Cebuano and English) and Bislog (Cebuano and Filipino), alongside the dominant languages Filipino, English, and Cebuano, with most of the comments written in Filipino or English (Figure 1), although language use varied significantly based on newspaper (Figure 2). The distribution of comments in the dataset shows more comments from months associated with the target events (Figure 3), the Luzon region (Figure 4), and the Aquino administration (Figure 5).

Figure 1
Distribution of Languages in the Corpus.

Figure 2
Distribution of Languages per Newspaper. Normalization occurred row-wise (per newspaper).

Figure 3
Distribution of Comments per Date. The data here is skewed towards the months encompassing the Mamasapano Clash (January–February 2015) and the Marawi Siege (May–June 2017), due to the high concentration of key words.

Figure 4
Distribution of Comments per Region.

Figure 5
Distribution of Comments per Administration.
Quality control
The original and pre-processed messages do not include any markup for tags or hyperlinks, and all duplicates (based on message content and creation time) were removed. While only public pages are identified in the column “from_name”, all comments from public pages save for those made by the newspapers are anonymized as ‘NAME’. Additionally, some level of error in language detection is present, as language labels were computationally derived and based on the pre-processed version of the original message, with a detection accuracy of 92.44% (Cebuano), 92.77% (Filipino/Tagalog) and 96.27% (English) as described in Cruz & Kestemont (2022). These discrepancies can be checked against the original message for comparison.
(3) Dataset Description
Repository name
Zenodo
Object name
MMB151719SOCMED
Format names and versions
CSV
Creation dates
2021-03-17–2023-06-03
Dataset creators
Frances Cruz (Conceptualization, Curation, Methodology)
Language
Tagalog (Filipino), Cebuano, English, Taglish, Bislog, Bislish
License
CC BY-NC-SA 4.0 DEED
Publication date
2024-06-05
(4) Reuse Potential
Even as the Philippine case demonstrates that social media pages have become a key forum for political discussions (Ressa, 2019; Curato, 2021), events such as the Cambridge Analytica scandal have revealed the global scale of social media being instrumentalized towards political ends (Confessore, 2018), and the relevance of datasets such as these in documenting and archiving such trends. This extends towards political discussions on identities, shown by Törnberg and Törnberg’s (2016) pioneering study that employed computer-aided textual analytical methods to study social media views on Islam and Muslims in Sweden. In line with this, the dataset contributes towards research questions on the representation of Islam and Muslims in regions outside of the West, as not only are there relatively few studies on Muslim representation centering the Southeast Asian region, but also few that deal with the impact of the region’s attendant discourses on social media (Ahmed & Matthes, 2017; Hashmi, Radzuwan, & Ahmad, 2021).
Additionally, code-switched and informal online texts have also been the subject of linguistic studies in Philippines (Monderin & Go, 2021), which can only foreseeably grow as language evolves online. The various languages included in the dataset, which include examples of codeswitched texts, can thus complement other datasets for the description of the informal register in major Philippine languages and otherwise low-resourced codeswitched varieties, for NLP tasks specific to social media texts, such as lemmatization and normalization of orthographic forms (Nocon et al., 2014), and in emotion, sentiment, or hate speech classification (Lapitan, Batista-Navarro, & Albacea, 2016; Camacho-Collados, et al., 2022).
Lastly, apart from acting as a social media archive of important current events, the multilingual collection of social media texts can contribute towards research on semantic fields, word frequency and network analyses for terms related to minority identities across languages, such as in Taylor (2014), where different word associations were linked to the same terms across languages. This dataset can thus expand the exploration of semantic fields on matters of minority identities and conflict within Southeast Asian languages and contexts.
Notes
Acknowledgements
The author would like to thank Mike Kestemont for reviewing earlier drafts of this paper.
Funding information
This research was funded by a FRASDP grant from the University of the Philippines.
Competing interests
The author has no competing interests to declare.
Author contributions
Frances Antoinette Cruz: Conceptualization; Methodology; Data curation; Writing.
