In the past decade, the Singapore government and the Singapore Exchange (SGX) launched three policy initiatives aimed at enhancing the quality and frequency of sustainability report (SR) for publicly listed companies in Singapore. With companies in Singapore being highly responsive to government demands, it is anticipated that we will witness a continuous evolution in their reporting practices within the annual reports over time, reflecting a trend towards greater compliance and alignment with regulatory requirements. Particularly, we would expect that companies would likely be inclined to realign their operations in response to mounting concerns about climate change.
In their annual reports (as exemplified by Figure 1), company leaders would openly discuss how their organizations are proactively tackling the challenges and capitalizing on the opportunities presented by climate change, emphasizing their commitment to sustainability, and addressing these issues directly in the introduction to shareholders. Discussions concerning climate and sustainability issues should undoubtedly be evident in the profiles of the reports, making them detectable through suitable statistical techniques that monitor the textual content of the documents over time. This prompted us to investigate whether any word profile or topic (distributions) changes could be identified in relation to the policy initiatives undertaken by the Singapore government.

Example of an SGX annual report from Yongnam Holdings Ltd.
To underscore the completion and formatting of the SR data, our primary research focus lies within the “Company Overview” and “Leader’s Words” sections of the report. These sections consistently included and recognized for covering the essence of the entire report.
Using BERT word embeddings (Devlin et al., 2018), we extract keywords from the two exampled reports, then apply dimensionality reduction with local linear embedding (Roweis & Saul, 2000) to visualize them in two dimensions. Figure 2 demonstrates the unsupervised separation of specific keywords, offering the possibility for efficient investigation of individual report topics. This article aims to validate the feasibility of using commonly used machine learning techniques, specifically dynamic topic modeling (DTM) proposed by Blei & Lafferty (2006) and latent Dirichlet allocation (LDA) introduced by Blei (2003), for identifying potential emerging trends within the data. In diverse managerial contexts, topic modeling has been successfully applied to explore and analyze various aspects, including innovation, stock returns, and sector identification, (Bommes et al., 2018, Zhang et al., 2016. However, our specific interest lies in determining whether LDA possesses sufficient sensitivity to discern the individual trees within the vast forest of a multi-industry dataset. The research question that this article posed is aiming at identifying changes in the dynamics of the topic sequences.

Locally linear embedding on SGX report. China Star Food Group Ltd. and Yongnam Holdings Ltd. are food processing and steel manufacturing companies, respectively. Source code: https://github.com/QuantLet/Financial-Report-Analysis-BERT-W2V-GLOVE/blob/master/LLE_wordEmbeddings.ipynb.
To this end, we collected annual reports from SGX-listed companies to examine the impact of regulatory changes in the 2010–2020 period on both the identity narrative (how companies portrayed themselves) and the strategic narrative (what organizational leaders discussed about their strategic outlook for the preceding year). Our ex ante expectation was that applying state-of-the-art practices in topic modeling would enable us to identify emerging sustainability trends, and we anticipated observing more noticeable impacts on the strategic narrative compared to the identity narrative. As the SR guidance matured throughout the reporting period, there should be a noticeable and consistent rise in the frequency and importance of sustainability-related subjects discussed in both narratives.
SGX was formed in 1999. As of 2019, there were 640 main board listings (Figure 3), comprising public companies that must meet specific profit and revenue thresholds, alongside 215 catalist companies, which are categorized as younger and higher-risk firms.

SGX Companies’ sector distribution.
Our study period commenced in 2010, coinciding with the initiation of a public consultation process on SR (IFRS, 2011) in August of the same year. Table 1 presents SGX’s roadmap on SR and lists the critical events. SGX released its inaugural voluntary SR guidelines (SGX, 2011a, b). The guidelines primarily centered on defining the key aspects of SR, including the scope of reporting (who), the rationale for reporting (why), the reporting methodologies (how), the reporting timeframe (when), the disclosure platform (where), and the specific information companies should include in their reports (what).
SGX’s roadmap on sustainability reporting
|
|
SGX (2011b) expressed a clear stance on the significance of SR and offered “broad principles to guide listed companies in developing their SR frameworks.” The report explicitly stated that “Sustainability reporting is not a mandatory requirement for listed companies under the Listing Manual (SGX, 2011b, p. 7).” However, it strongly encouraged all companies to engage in reporting, particularly those operating in industries that (a) face environmental and social risks, (b) generate substantial environmental pollutants, (c) heavily rely on natural resources, and (d) operate within supply chains where end customers demand responsible behavior (SGX, 2011b, p. 9). Essentially, SGX (2011b) underscored climate change disclosures and biodiversity management as critical areas of environmental concern (SGX, 2011b, p. 11). A pivotal aspect of the disclosure guidance centered around the inclusion criteria, based on Rule 703 (SGX, 2011b, p. 13), which mandates that a listed company must disclose information necessary to prevent the creation of a false market in its securities or any information that could materially impact the price or value of its securities. The disclosure of sustainability issues may fall within the ambit of Rule 703. As the SR regime was primarily voluntary, Ch’ng (2015) discovered that by the end of 2013, 160 companies, accounting for 29.8% of the 537 listed companies on the Mainboard of SGX, actively communicated their sustainability practices.
During the Singapore Compact Summit on 17 October 2014, SGX chief executive officer (CEO) Magnus Bocker announced the bourse would impose mandatory SR through a “comply or explain” approach. The CEO stated that SGX would initiate a 1-year study and consultation process, following which companies would likely be given a 2-year timeframe before the new rules for mandatory SR would be officially implemented. The consultation process comprised focus groups involving listed companies, alongside surveys distributed to institutional investors and sustainability professionals. According to SGX representative Michael Tang’s account in Schillebeeckx (2019), as of May 2015, the adoption of SR among SGX-listed firms was still limited. In January 2016, Singapore initiated the public consultation on the proposed listing rules concerning SR (Schillebeeckx, 2019), and by mid-2016, the responses were gathered and analyzed (SGX, 2016a).
Later, SGX (2016b) introduced new requirements mandating listed companies to conduct annual sustainability reviews and launched a comprehensive 30-page SR Guide on 20 June 2016. It included references to various reporting standards such as GRI (Global Reporting Initiative), SASB (Sustainability Accounting Standards Board), and IIRC (International Integrated Reporting Council)’s frameworks. These guidelines highlighted that issuers “may consider provisions of the Climate Disclosure Standards Board or the Carbon Disclosure Project” (SGX, 2016b, p. 11). Explicitly referencing the text from SGX (2011b), the new SR Guide (SGX, 2016b) retained both climate change disclosures and biodiversity management as salient environmental topics. The “comply or explain” approach was officially formalized and scheduled to commence from the financial year 2017, with a 1-year grace period. According to Fang & Malhotra (2016) from PwC (PricewaterhouseCoopers), companies whose financial year ended on 31 December 2017 were required to publish their inaugural SR by 31st December 2018. Starting from financial years ending in 2018 and onward, the guidance recommended that companies should publish their reports within 5 months of the closing of the financial year, which is 1 month after the annual reports are due.
SGX’s “Comply or explain” SR framework went into effect in 2018. SGX (2018b) detailed the contents of the sustainability report. Climate change was explicitly mentioned, but biodiversity was no longer included. Alongside this guide, an investor guide titled “Reading Sustainability Reports” was published on 7 December 2018 (SGX, 2018a). The proposed framework consisted of five components, including the identification of material ESG factors through stakeholder consultation, disclosure of policies, practices, and performance with quantitative data, setting targets based on investor feedback, selecting and disclosing the SR framework(s) for comparability, and providing a Board statement acknowledging compliance or explaining any deviations (SGX, 2018a, p. 4).
As of 2019, an impressive 99.8% of SGX-listed companies have published an SR, with any instances of non-compliance attributed solely to delays in releasing the report. This compelling evidence indicates that the regulation concerning SR has been highly effective, as stated by Loh & Tang (2021).
Four student research assistants (RAs) collected annual reports from the 380 firms listed on SGX. After conducting data collection and cleaning processes, the text data were extracted from approximately 364 to 368 annual reports each year, spanning from 2010 to 2018, through a combination of optical character recognition (OCR) and manual validation methods. However, not all annual reports could be found for every selected company in every year, and in some instances, OCR extraction was unsuccessful. With data collected from 48.79% to 49.32% of the listed companies in the sample, our findings reflect the state of the art. To ensure data consistency and focus on representative content, our main analysis centers on the Company Overview and Leader’s Words sections of the report, as these typically represent the essence of the entire document.
To streamline our research, we focused on identifying keywords that represented the most significant environmental topic “climate change.” At the outset, we performed straightforward word counts to discern trends within the data (Tables 2 and 3). The primary observation from this high-level data analysis is that there is no evident continuous increase in attention toward climate-related topics in either narrative. Contrarily, we observe a decrease in attention toward climate topics until 2014, followed by a slight pick-up in attention thereafter. This observation suggests that companies initially ignored the voluntary guidance, but they became responsive to the trend of SR only after SGX announced the forthcoming mandatory approach in 2014.
Identity narrative.
| Year | Number of reports | Identity narrative | |||
|---|---|---|---|---|---|
| Sustainable | GHG | Carbon | Climate change | ||
| 2010 | 365 | 33 | 3 | 6 | 1 |
| 2011 | 366 | 35 | 2 | 5 | 0 |
| 2012 | 368 | 46 | 2 | 8 | 0 |
| 2013 | 368 | 59 | 2 | 8 | 0 |
| 2014 | 365 | 40 | 2 | 9 | 0 |
| 2015 | 366 | 55 | 2 | 8 | 0 |
| 2016 | 365 | 64 | 2 | 12 | 0 |
| 2017 | 367 | 65 | 0 | 8 | 0 |
| 2018 | 364 | 57 | 0 | 9 | 0 |
| Total | 3,294 | 454 | 15 | 73 | 1 |
Note: GHG = greenhouse gas.
Strategic narrative.
| Year | Number of reports | Strategic narrative | |||
|---|---|---|---|---|---|
| Sustainable | GHG | Carbon | Climate change | ||
| 2010 | 365 | 191 | 3 | 33 | 7 |
| 2011 | 366 | 211 | 4 | 28 | 4 |
| 2012 | 368 | 226 | 2 | 28 | 2 |
| 2013 | 368 | 263 | 1 | 11 | 1 |
| 2014 | 365 | 256 | 3 | 9 | 0 |
| 2015 | 366 | 297 | 1 | 10 | 7 |
| 2016 | 365 | 318 | 4 | 14 | 2 |
| 2017 | 367 | 446 | 0 | 18 | 7 |
| 2018 | 364 | 456 | 0 | 26 | 7 |
| Total | 3,294 | 2,664 | 18 | 177 | 37 |
Note: GHG = greenhouse gas.
By ceasing the dependency on dominating measure – i.e.,
Topic modeling is a potent technique for data mining, latent data discovery, and revealing relationships in text documents. Particularly, LDA topic models are increasingly used to process and comprehend large textual corpora (Chauhan & Shah, 2021).
LDA is an unsupervised classification algorithm designed for corpus modeling. It is widely acknowledged as one of the most effective techniques for text summarization and information retrieval, presenting information in a concise and understandable form (Chauhan & Shah, 2021), which is especially valuable in our case, where we deal with complex multi-industry corpora.
It is a generative probabilistic model that posits topics as mixtures of an underlying set of words, while each document is a mixture over a set of topic probabilities, allowing for the generation of new documents by sampling words from topics, detailed in Figure 4. LDA assumes that the words with the highest probabilities within each topic generally offer a meaningful insight into the essence of the topic. Several implementations are evident within financial and managerial domains, such as event studies (Dyer et al., 2017), investor sentiment analysis (Feuerriegel & Prollochs, 2021), and corporate organization research (Culasso et al., 2023).

LDA plate notation. The subscript is dropped in the diagram.
Using a given text collection (corpus) containing
Topic coherence measures (or confirmation measures) aim to evaluate individual topics by quantifying the semantic similarity among the top-scoring words within each topic, approximating human judgment to determine the optimal number of topics for a given corpus, i.e., Company Overview corpus, Leader’s Words corpus. Several proxies have been proposed (Campagnolo et al., 2022). One state-of-the-art method used as a proxy for topic coherence is the
This section presents the empirical results, focusing on two corpora: Company Overview and Leader’s Words. We initiate by assessing the topic coherence and then proceed with a discussion on the identified topics.
This delves into the determination of the optimal number of topics for the LDA model and observes its evolution over time. However, the findings reveal that the topics achieving higher coherence scores for both Company Overview and Leader’s Words corpora do not exhibit any discernible pattern or consistency as time progresses. Among all the optimal topics extracted, only one is found in 2012–2013 to be related to sustainability.
Figures A3 and A4 demonstrate that in both corpora, the coherence scores exhibit a jagged pattern without showing any clear increasing or decreasing trend as the number of topics increases. This could be attributed to the high variety of industries represented in the collected annual reports, making it challenging to achieve coherence in terms of the topics selected and determining the optimal number of topics. Figure A5 provides insight into the ideally optimal numbers of topics. During the last 2 years (2017–2018), both corpora exhibit higher coherence scores with a larger number of topics, indicating that the writing style among companies is becoming more diverse. Furthermore, our observations reveal a rising trend in the Leader’s Words corpus, contrasted by a declining trend in the Company Overview corpus. This suggests that over time, Company Overviews become increasingly limited in their content scope, while Leader’s Words continue to exhibit a sparse coverage of topics.
We adhered to best practices to comprehend the LDA topics (Jelodar et al., 2019, Maier et al., 2018). Two of the three co-authors independently reviewed each topic from both the strategic and identity narratives, examining the word distribution figures for the first 20 words. After this step, we engaged three RAs to perform the same task.
The three RAs worked independently and submitted their findings individually to one co-author. The co-author then compared the three answer sets from the students and the two answer sets from the co-authors to identify commonalities and discrepancies. A week later, the lead author gathered the RAs for a discussion to address areas of disagreement and achieve consensus on a single topic for each word distribution in each year. For example, in Figure A6, topic 6 is labeled as “medical and healthcare,” topic 9 as “tourism,” and topic 10 as “fossil fuel.”
A wide range of topics is linked to specific industries, and topics related to corporate performance (i.e., topics 1 and 2 in Figure A14) are consistently present across all Leader’s Words corpora. However, sustainability-focused topics are solely evident within the Leader’s Words corpus, constrained to the sphere of the industrial sector during the years 2012–2013 and 2017–2018 (Figure 5).

Limited presence of sustainability-related topics. Note that the x-axis is represented in percentage (%).
During the period 2012–2013, SR remained a voluntary undertaking, but significant progress and more practical discussions on the matter gained momentum beyond that period. In the years 2017–2018, SGX formalized and initiated the mandatory implementation of SR using the “comply or explain” approach. The corpora reveal emerging topics on sustainability, but they are only observed during the periods of pre-deployment of SGX policies when there were many significant discussions on the subject.
As of now, SR has become mandatory, and keywords related to sustainability can be observed in corporations’ annual reports. However, it is essential to note that as a result of this thorough NLP analysis, sustainability is not the main focus of these reports. While previous studies, such as Loh et al. (2017) and Chen (2024), highlight a positive correlation between SR and a firm’s market value, as well as SR’s potential to reduce investment risk, our analysis reveals limited substantive content to support a significant discourse on sustainability-related topics. This suggests that despite the potential benefits of SR, the actual depth of discussion around sustainability within reports remains shallow.
Over the past decade, emerging environmental topics such as sustainability and climate change have been attracting substantial attention. SGX, being one of the leading exchanges worldwide, has responded by taking numerous policy initiatives related to these crucial subjects. This article examines 11 years of annual reports from publicly listed corporations in Singapore and applies unsupervised topic modeling techniques, specifically LDA, to determine whether these subjects have been adequately reflected. We identify sustainability-related topics that only occurred within the industrial sector corpora for the years 2012–2013 and 2017–2018, which align with the pre-deployment of SGX policies. However, we do not observe any sustainability-related topics following additional regulatory changes. Specifically, the Company Overview corpora are often identified as industry-related topics (e.g., medical and healthcare, tourism), while the Leader’s Words corpora are associated with company performance topics. Despite the fact that SR has become mandatory, our approach fails to detect sustainability-related topics in either the Company Overview or the Leader’s Words.
Unsupervised learning techniques like LDA can automate report reviews, potentially improving auditing processes, corporate management analysis, and assessments of company culture. However, the complexity of datasets spanning multiple industries poses a significant challenge for unsupervised topic modeling, particularly in detecting policy shifts and identifying changes in managerial focus. Additionally, the sparsity of the data may hinder the convergence to representative topics, leading to issues like low topic coherence. As part of future work, we plan to integrate large language models to achieve more robust results, given their capability to handle complex corpora and incorporate more content from the sustainability reports.
This research was supported by Deutsche Forschungsgemeinschaft (DFG) via IRTG 1792 “High Dimensional Nonstationary Time Series,” Humboldt Universitä zu Berlin; and the project “IDA Institute of Digital Assets,” contract number CF166/15.11.2022, financed under the Romania’s National Recovery and Resilience Plan, Apel nr. PNRR-III-C9-2022-I8. All the supplementary materials and source codes are found in the
This research was supported by Deutsche Forschungsgemeinschaft (DFG) via IRTG 1792 “High Dimensional Nonstationary Time Series,” Humboldt Universitä zu Berlin; and the project “IDA Institute of Digital Assets,” contract number CF166/15.11.2022, financed under the Romania’s National Recovery and Resilience Plan, Apel nr. PNRR-III-C9-2022-I8. Dr. Simon Schillebeeckx gratefully acknowledges the support from the ASEAN Business Research Initiative (ABRI) grant #G17C20408 which made the original data collection possible.
Xinwen Ni: Data curation, Software, Visualization, Writing – original draft. Min-Bin Lin: Conceptualization, Writing – original draft, Writing – review & editing. Simon J. D. Schillebeeckx: Data curation, Investigation, Literature review. Wolfgang Karl Härdle: Conceptualization, Methodology, Formal analysis, Writing – review & editing.
Authors state no conflict of interest.
The raw data used in this study are available upon request.