(1) Context and motivation
With the expansion of digital humanities and cultural analytics methodologies into textual analysis fields such as English literature and rhetoric and composition (Ridolfo & Hart-Davidson 2015), the need for data that supports computational text analysis pedagogy and projects has become an ongoing exigency for instructors. The City-Data Corpus addresses these needs by providing instructors and students with textual data that is amenable to a range of different text analysis methods and is well-suited for methodological experimentation. To dramatize the affordances of the City-Data.com Corpus for benchmarking novel computational text analysis techniques, I stage a comparison between a novel topic modeling method employing sentence embeddings and latent Dirichlet allocation (LDA) approaches.
For qualitative researchers, the City-Data.com Corpus provides opportunities for studying asynchronous online interactions about civic issues concerning the city of Philadelphia, PA. The Coronavirus discussion thread, for example, provides a snapshot of early reactions to the COVID-19 pandemic. The Plan, Crime, and Retail discussion threads, on the other hand, evidence sustained conversations over several years.
The City-Data.com Corpus can serve as the basis for digital humanities pedagogical lessons and experiments, including network modeling, timeseries analysis, processing HTML for text analysis, and the full gamut of computational text analysis techniques such as term weighting, word and sentence embedding, machine classification, named entity extraction, and topic modeling. The scope of the City-Data.com Forum data is small enough to be read fully, enabling analysts to reconcile human assessment with statistical results.
Lastly, the titles of discussion threads make the City-Datam.com Corpus amenable to classification tasks like Ken Lang’s 20 Newsgroups Dataset. I leverage the weakly annotated forum posts to benchmark a novel topic modeling method that employs kmeans clustering and sentence embeddings against the widely used latent Dirichlet Allocation (LDA).
(1.1) Limitations and Caveats
City-Data.com forum rules prohibit flaming, hate speech, and trolling; however, posts do feature frank depictions of violence and potentially controversial statements that some may find offensive.
While the discussion board format of City-Data.com provides users with a direct means to quote previous posts when replying, this is not a strict requirement. Some replies attend to the content of previous posts with explicit attribution. Thus, some replies are not accessible by the presence of a “Quoted” field. Moreover, the quote_body and quote content may not always present the full content of the original post. Users can be selective about their citations. Consequently, quote_body and quote content may not always recover original posts. Lastly, because forum users can post across forum, quoted post content may have not accessed the originating page, thus, leaving gaps in quote_body and quote content.
(2) Dataset description
The City-Data.com Corpus consists of five discussion forum threads from the site, city-data.com (Advameg, Inc., n.d.a) scraped using the BeautifulSoup Python library (Richardson 2007). City-Data.com aggregates geographic, demographic, and historical information about major cities across the United States and Canada from public and governmental sources. The site provides data visualizations that enable users to compare cities. As such, City-Data.com targets audiences who are traveling or relocating, real estate professionals, and advertisers. City-Data.com also hosts discussion forums by region and topicality (e.g., Classified Ads or Food and Drink). City-Data.com forums are moderated and prohibit trolling, hate speech, spam, doxxing, and cross-posting (Advameg, Inc. n.d.b). Moreover, posters are advised to stay on topic and avoid personal attacks.
At the time of this writing, the City-Data.com forums house 2,940,053 threads, 62,355,814 posts authored by 2,476,620 members (Advameg, Inc., n.d.c). In comparison with other social media sites enlisted for data modeling, machine learning, and network analysis, City-Data Forums has a small footprint. Reddit, by contrast, boasts of over 13 billion posts and comments (Reddit Inc, 2023).
The City-Data.com Corpus contains 15,008 across five forum topics from the US > Pennsylvania > Philadelphia forums. Table 1 presents forum title, post count, date ranges, and summary descriptions of the City-Data.com Corpus.
Table 1
City-Data Corpus post count, word count, and date range of postings per forum.
| THREAD TOPIC TITLE | LABEL | SAMPLES | DATE RANGE | SUMMARY |
|---|---|---|---|---|
| How’s everyone doing amongst the Coronavirus shut down? (home, movies) (Advameg, Inc 2020) | coronavirus | 481 | 2020/03/16 – 2020/07/27 | Posts discuss experiences with public shutdowns in the opening months of the COVID-19 pandemic in Philadelphia. Content dwells on governmental interventions and the status of public services than the etiology or effects of COVID-19. |
| “Official Greater Philadelphia Area Crime Thread” (Advameg, Inc, 2013) | crime | 2,402 | 2013/04/11 – 2020/01/15 | Posts discuss and share information about crime in Philadelphia. |
| Official Philadelphia Metro Crime Thread (York, Chester: apartment complexes, houses, unemployment) (Advameg, Inc, 2012a) | crime | 1,284 | 2012/01/12 – 2013/03/11 | Posts discuss and share information about crime in the Philadelphia metro area. |
| Philadelphia 2035 (Houston: foreclosure, neighborhoods, wage) (Advameg, Inc. 2011) | plan | 6,796 | 2011/06/14 – 2020/01/14 | Posts discussing Philadelphia’s 2035 civic renovation plan authored by the Philadelphia City Planning Commission (2023). |
| Retail coming to Philadelphia (Penn, Burlington: real estate, house, buying) (Advameg, Inc. 2012b) | retail | 4,045 | 2012/11/27 – 2020/01/20 | Posts discuss new retail business developments in Philadelphia. |
Post-level data captured includes the following fields (see Table 2):
Table 2
Tabular data model for City-Data.com forum posts.
| POST_ID | POST_BODY | POST | DATETIME | QUOTE_ID | QUOTE_BODY | QUOTE | FORUM |
|---|---|---|---|---|---|---|---|
| Numerical post identifier | Post HTML | Post text | YYYY-MM-DD HH:MM:SS | Numerical post identifier | Quoted reply HTML | Quoted reply text. | forum title label |
Object name
City-Data.com Corpus
Format names and versions
CSV
Creation dates
07/14/2022 – 08/14/2022
Dataset creators
Ryan M. Omizo (Temple University)
Language
English
License
Creative Commons Attribution 4.0 International
Repository name
Zenodo 10.5281/zenodo.10086354
Publication date
2023-11-09
(3) Method
Topic modeling is a method through which the latent structure of documents is inferred from lists of terms (topics) that collocate with high statistical significance. For example, latent Dirichlet allocation (LDA) (Blei et al. 2003; Steyvers & Griffiths 2007; Hoffman et al. 2010) creates a topic model by generating probability distributions over words. These probability distributions designate the word features that would most likely generate the documents in under the model. While some topic modeling procedures like latent semantic indexing (Deerwester et al. 1990) do not produce human-interpretable topic models, most current topic modeling procedures like LDA produce term lists that order the most influential features of a topic. There are general limitations to LDA, however. As Vayansky and Kumar (2020) note in their review of topic modeling methods, LDA performance suffers when applied to short texts. Moreover, bag-of-words document representations used in many traditional LDA approaches are less informative than more advanced embedding-based representations, which can capture the dense relationships between words in context. For this reason, embedding-based approaches have been explored (Bhatia et al. 2016; Grootendorst 2022; Angelov 2020; Aharoni & Goldberg 2020; Bianchi et al. 2021a; Bianchi et al. 2021b; Zhang et al. 2022; Limwattana & Prom-on 2021). Unlike bag-of-words approaches, which will capture the presence or absence of term in documents, sentence embeddings encode word contexts derived from large language models pretrained on vast corpora. The richness of sentence embedding representations recommend their incorporation in topic modeling approaches. The density of information should lead to more nuance when grouping documents into topics, which should lead to more relevant topical term lists. My method for performing sentence embedding-based topic models (henceforth, SE-Topics) follows Grootendorst’s (2022) pipeline that involves transforming texts into sentence embeddings, clustering sentence embeddings, and extracting representative word lists or topics. To group the thread embeddings, I use scikit-learn’s (Pedregosa et al. 2011) kmeans clustering implementation.
To extract representative words, I adapt Grootendorsk’s (2022) “class based TFIDF” approach. Posts assigned to the same cluster are merged into a single document. However, prior work with these approaches indicates that frequency counts lead to better topical quality than TFIDF for SE-Topic and LDA topic modeling. This topic modeling approach emphasizes what Grootendorst (2022) describes as pipeline “modularity.” Although the step to derive significant textual features is separate from the embedding and clustering steps, this modularity enables people to replace or extend a phase in the topic modeling process. For example, dimensionality reduction can be applied to sentence embeddings to accelerate clustering; clustering algorithms can be exchanged (e.g., spectral or density-based clustering can be used); frequency distributions can be replaced with TFIDF vectorization when deriving term topics.1
(3.1) Experimental Trials
I conduct 16 topic modeling trials designed to leverage the networked structure of the City-Data.com Corpus. I evaluate topic quality of models trained on post-level segments, thread-level segments, and topics guided by prior information (see Li et al. 2018; El-Assady et al. 2019; Popa and Rebedea 2021; Gourru et al. 2018). Guided topic modeling uses prior information about the data to center model priorities. Post and thread-level topic modeling test how the unitization of City-Data.com’s networked content influences topics modeling and is primarily a data preparation step. Guided topic modeling intervenes in the topic modeling process. To guide the kmeans clustering process (the basis of the sentence embedding-based topic model), I manually designate initial cluster centers. This initial positioning will guide subsequent re-centerings as the model converges to reduce the distance between intra-cluster datapoints (Arthur & Vassilvitskii 2007). To guide the LDA topic modeling, I adapt Li et al.’s (2018) method of injecting seed words extracted from the data into the topic modeling process.i In one experiment conducted on the 20 Newsgroups dataset, Li et al. (2018) used forum label text (e.g., talk.politics.guns) as seeds with the assumption that forum labels provide distinguishing categorical information about potential topics.
I test three types of topic seeds:
Topic titles – Following Li et al. (2018), I use thread topic titles as seeds (see Table 1).
Initial forum posts – Initiating posts declare the horizons of participating in forum discussions. Empirical work by Sobkowicz and Sobkowicz (2010), See Jagarlamundi et al. 2012 for example, demonstrates that social medial discussions evidence strong first mover advantage similar to scientific papers (Newman 2009). Papers that appear early in the rise of a discipline will outpace the citation rate of newer papers. Here, the hypothesis is that posts that appear first gain more engagement and this increased engagement will condition the content of a sizable portion of the thread. (see Table 7 in Appendix A).2
Posts with the highest degree – Leveraging the network properties of the City-Data.com corpus, I create a graph of each forum and extract posts with the most incoming and outgoing links or node degree (Gerlach et al. 2018; Duan et al. 2021; Yang et al. 2016). Posts that receive numerous quoted replies and/or are replying to other posts serve as proxies for engagement. Like the intuition behind the use of initial forum posts, posts that are bound up in more extensive conversations, condition more content because respondents must stay on topic to sustain discourse (see Table 8 in Appendix A).
I calculate topical coherence and diversity scores to measure model quality. Topical coherence includes several measures that indicate how well topical term lists reflect the underlying data. In this study, I use Minmo et al.’s (2011; see also Hinneburg et al. 2014) UMASS method. UMASS coherence measures the probability that co-occurring words in the topical term list occur in documents divided by the total number of documents. Unlike other coherence measures such as UCI-coherence, which compare topical terms to a large reference corpus like Wikipedia dumps, UMASS coherence is derived from the original dataset. I balance coherence scores against topical diversity scores (Mimno et al. 2011). I employ Gensim’s coherence measures (Řehůřek, R., & Sojka 2011) to calculate UMASS coherence. Topical diversity refers to a family of metrics that indicate the variability of topical terms, thus, the range of data explainable by topical term lists. Quality topics are both coherently related to the underlying data, but also distinctive enough to offer thorough faceting (Dieng et al. 2020). For this study, I employ four diversity measures (Terragni 2023):
Proportion of unique words (PUW) – PUW determines the ratio of unique topical terms for all topics (Dieng et al. 2020). Scores closer to 1 indicate diverse topics; scores closer to 0 indicate repetitive topical terms.
Jaccard Distance between topical term lists (JD) – Proposed by Tran et al. (2013), this diversity measure evaluates the Jaccard distance between topical term lists. Greater distances between topical terms indicate more topical diversity (Terragini 2023).
Word embedding centroid distance (WE-CD) (Bianchi et al. 2020b) – WE-CD calculates the distance between collocations in the topical term list to a reference corpus of word embedding. This metric determines how diverse topical term lists are in comparison to generalized usage in embedding models of large volumes of texts like Wikipedia or the Common Crawl Corpus of internet sites. For this paper, I use the FastText Common Crawl word embedding model with 300 dimensions and 2 million subword vectors (Bojanowski et al. 2016; Joulin et al. 2016a; Joulin et al. 2016b).3
In all, topical diversity measures used in this study determine intra-topical term list diversity, inter-topical term list diversity, and generalized topical term list diversity when compared to a reference corpus. To calculate each measure, I use the top-25 terms per topic.
Because the Metro forum also discusses criminal activity, I set the topic number at 4 to correspond to the following thematic categories for topic modeling: coronavirus, crime, plan, and retail.4
Tables 3 and 4 illustrate the SE-Topics and LDA modeling coherence and diversity scores across different data segmentations.
Table 3
SE-Topics Coherence and Diversity Scores.
| TOPIC MODEL TYPE | UMASS | PUW | JD | WE-CD |
|---|---|---|---|---|
| SE-Topics post | –4.10 | 0.74 | 0.85 | 0.06 |
| SE-Topics guided topic titles | –2.76 | 0.54 | 0.63 | 0.06 |
| SE-Topics guided initial posts | –2.56 | 0.52 | 0.62 | 0.07 |
| SE-Topics guided high degree | –2.53 | 0.54 | 0.63 | 0.06 |
| SE-Topics threads | –5.46 | 0.74 | 0.86 | 0.12 |
| SE-Topics guided threads topic titles | –4.84 | 0.69 | 0.79 | 0.06 |
| SE-Topics guided threads initial posts | –4.57 | 0.67 | 0.78 | 0.21 |
| SE-Topics guided threads high degree | –4.26 | 0.67 | 0.78 | 0.21 |
| MEAN | –3.88 | 0.63 | 0.74 | 0.11 |
| MAX | –2.53 | 0.74 | 0.85 | 0.21 |
| Q3 | –2.66 | 0.71 | 0.82 | 0.17 |
| MEDIAN | –4.18 | 0.67 | 0.78 | 0.07 |
| Q1 | –4.70 | 0.54 | 0.62 | 0.06 |
| MIN | –5.46 | 0.52 | 0.62 | 0.06 |
Table 4
LDA Topic Modeling Coherence and Diversity Scores.
| TOPIC MODEL TYPE | UMASS | PUW | JD | WE-CD |
|---|---|---|---|---|
| LDA posts | –2.59 | 1.000 | 0.95 | 0. 30 |
| guided LDA (topic titles) | –2.63 | 1.000 | 0.95 | 0. 27 |
| guided LDA (high degree) | –2.10 | 0.950 | 0.95 | 0. 29 |
| guided LDA (initial posts) | –2.36 | 0.975 | 0.96 | 0. 26 |
| LDA threads | –2.10 | 0.950 | 0.95 | 0. 30 |
| LDA threads (high degree) | -2.15 | 0.950 | 0.95 | 0. 31 |
| LDA threads (topic titles) | –2.32 | 0.950 | 0.95 | 0. 30 |
| LDA threads (initial post) | –2.22 | 0.900 | 0.94 | 0. 32 |
| MEAN | –2.31 | 0.95 | 0.95 | 0.29 |
| MAX | –2.10 | 1.00 | 0.96 | 0.32 |
| Q3 | –2.12 | 0.98 | 0.95 | 0.30 |
| MEDIAN | –2.27 | 0.95 | 0.95 | 0.30 |
| Q1 | –2.47 | NaN | 0.95 | 0.28 |
| MIN | –2.63 | 0.9 | 0.94 | 0.26 |
In general, LDA topic modeling produced better topical coherence and diversity scores compared to sentence embedding approaches with an average UMASS coherence of –2.31 and PUW and JD diversity scores near 1.0. Mean WE-CD for LDA topic models is also 0.12 points greater, signifying that LDA topic models are more semantically distant from the reference corpus than SE-Topics with high coherence. Box and whisker plots of topical coherence and diversity scores (see Figures 1 and 2) also indicate that LDA topic modeling results are more consistent across segment types. SE-Topical coherence and diversity scores indicate a wider interquartile range between the best performing segments (high degree posts) and the lowest performing segments with a difference of 2.93. On the other hand, there is only a 0.53 difference in coherence scores between the best performing LDA topics (high degree posts and threads) and the worst performing topic model (topic titles).

Figure 1
Boxplots of SE-Topics and LDA Coherence Scores.

Figure 2
Boxplots of SE-Topics and LDA Diversity Scores.
Put another way, the best scoring SE-Topics—guided initial posts and guided high degree —perform as well as the two worst scoring LDA topic models.
The discrepancies in coherence and diversity scores, however, are complicated by qualitative assessments.
Comparing SE-Topics and LDA topics guided by high degree posts, we can discern close similarities among term collocations within individual topics. Both SE-Topics and LDA Topics represent the Retail thread and the development rhetoric of the Plan thread; however, SE-Topics have agglomerated crime and coronavirus discourse (see Topic 0 in Table 5). The LDA topic model seeded by high degree posts (see Table 6), however, has captured discussions about government restriction of public services that ensured in the first months of the 2020 COVID-19 pandemic in Philadelphia with terms such as money, state, local, care, and help.
Table 5
SE-Topics Guided Post (high degree nodes).
| TOPIC | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | people | white | crime | black | time | year | murder | think | city | police |
| 1 | city | philly | philadelphia | people | street | area | year | think | center | neighborhood |
| 2 | building | city | street | project | tower | think | center | market | development | broad |
| 3 | store | retail | city | mall | retailer | market | walnut | center | think | location |
Table 6
LDA (High Degree) Topic Model.
| TOPIC | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | money | state | local | people | issue | care | neighborhood | income | help | white |
| 1 | store | retail | mall | shopping | retailer | location | center | shop | gallery | also |
| 2 | city | think | philadelphia | philly | people | year | time | even | area | much |
| 3 | street | market | building | walnut | space | chestnut | center | east | block | south |
Thread segmentation modestly improves LDA topical coherence (mean –2.2 compared to mean –2.4). Coherence scores for SE-Topics, on the other hand, worsen with thread segmentation (mean –4.41 compared to mean –3.49).
SE-Topics models benefited the most from the injection of prior information. The results of the SE-Topics informed by the embeddings bundled with the most quoted replies are particularly interesting when compared with the generally poor performance of thread-based segments. Both permutations depend on the networked structure of the City-Data.com Corpus to structure topics. Threads utilize the chained structure of posts and quoted replies to crystalize texts that emerge through interaction; high degree node priors utilize the linkages of posts with other posts. Although more information is encoded when threads are joined into single texts, the centrality of posts within conversations lends better guidance to the formation of more cohesive topics. This makes intuitive sense in that posts with numerous linkages will anchor common content. Threads, on the other hand, are more intrinsically diverse segments, striated by conversational turns and/or dissensus among interlocutors.
Guided topic modeling returned better topical coherence and diversity scores than unguided models (with the sentence embedding model guided by initial posts the exception). Posts with the most incoming and outgoing linkages (degree) produced the best scoring topic models. This finding suggests that network structure can influence the development of topical content in extended asynchronous conversations. Messages enmeshed in replies are more likely conserved as interlocutors attend to given information as they add commentary.
(4) Reuse potential
In this paper, I have explored the potential of sentence embedding-based topic modeling against LDA benchmarks. LDA approaches yielded better topical coherence and diversity scores in comparison to sentence embeddings. Inspection of topical term lists suggests that qualitative distances between methods are less pronounced, though (See Appendix B for a truncated comparison to BERTopic (Grootendorst 2022).
That said, final topical coherence and diversity scores are less important than the different topical permutations that the City-Data.com Corpus allowed us to test. The networked structure of the City-Data.com Corpus enables the unitization of data into posts or threads as well as principled means to designate topical priors based on posts with the highest engagement. Consequently, the City-Data.com Corpus is conducive to evaluating the strength text analysis algorithms as well as aspects of research design such as text segmentation and the incorporation of topical guidance. Illustrating the effects of different modelling parameters can be a boon to data-driven pedagogies because students can witness how different data selection choices impact their topic modeling results.
Along these pedagogical lines, the City-Data.com Corpus provides opportunities for students to practice other data processing techniques such as scraping, cleaning, and parsing HTML data as well as other methods that rely upon labeled data such as machine classification.
Appendices
Appendix A. Initial Forum Posts and High Degree Post Ids
Table 7
Initial post ids per City-Data.com Corpus forum.
| FORUM | POST ID |
|---|---|
| coronavirus | 758124 |
| crime | 908813 |
| metro | 251839 |
| plan | 958364 |
| retail | 710692 |
Table 8
High degree City-Data.com Corpus forum posts used for guided topic modeling.
| FORUM | POST ID | DEGREE |
|---|---|---|
| coronavirus | 57681616 | 5 |
| crime | 50175634 | 5 |
| metro | 22518391 | 5 |
| plan | 38332691 | 5 |
| retail | 38055575 | 5 |
Appendix B. BERTopic Results on City-Data.com Corpus Posts and Threads
For completeness, I trialed BERTopic’s topic modeler on City-Data.com Corpus posts and threads. Because BERTopic assigns isolated datapoints to an outlier cluster, I generate 5 topics for each trial to yield at least 4 cohesive topics. Topics were required to store a minimum of 50 textual units (posts or threads).
Table 9 illustrates the coherence and diversity scores of BERTopic models of posts and thread units. Both BERTopic models yield UMASS coherence and diversity scores comparable to SE-Topics method. Qualitatively, the BERTopic Post Model produces the most legible topics. The Topic –1 (the outlier topics) is characterized by general references to philadelphia and people (see Table 10). Topic 0 captures references to city infrastructure found in the plan forum. Topic 1 includes references to race and crime (white, crime, murders) indicative of the crime and metro forums.
Table 9
BERTopic Coherence and Diversity Scores for City-Data.com Corpus.
| TOPIC MODEL TYPE | UMASS | PUW | JD | WE-CD |
|---|---|---|---|---|
| BERTopic posts | –5.12 | 0.52 | 0.78 | 0.12 |
| BERTopic threads | –5.59 | 0.49 | 0.79 | 0. 28 |
Table 10
BERTopic Post Topic Model. Note that Topic –1 indicates outliers.
| TOPIC | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|---|---|
| –1 | city | like | people | just | philly | philadelphia | think | dont | street | center |
| 0 | city | like | think | new | just | people | building | philadelphia | dont | philly |
| 1 | white | people | city | dont | like | crime | im | year | just | murders |
| 2 | like | inga | news | just | dont | people | article | think | does | thread |
| 3 | bau | hello | nasty | bart | fancy | update | haha | awful | ok | finally |
The BERTopic Thread Model is similar to the Post Model, although it more clearly features references to terms in the retail forum such as retail, store, and stores. Both BERTopic Post and Thread Topic Models feature low quality topics (Topics 3 and 2, respectively). Neither conveys interpretable information about the content of the forums (see Tables 10 and 11), indexing two short posts in the corpus.
Table 11
BERTopic Thread Level Topic Model. Topic –1 indicates outliers.
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
|---|---|---|---|---|---|---|---|---|---|---|
| –1 | city | like | people | philly | just | think | dont | philadelphia | new | street |
| 0 | city | like | new | store | think | just | philadelphia | retail | stores | people |
| 1 | people | crime | city | dont | just | like | white | im | know | year |
| 2 | hmmm | hello | fancy | update | haha | |||||
| 3 | inga | like | just | news | article | dont | read | writing | people | does |
However, I would not argue that the SE-Topic Modeling method discussed above is a clear improvement over BERTopic. These results do not represent the full range of BERTopic parameters but are useful to highlight how little tuning the SE-Topic method requires to produce legible topics that yield similar coherence and diversity scores.
Notes
[1] The embedding-based topic modeling approach illustrated here departs from Grootendorsk’s (2022) BERTopic significantly. First, kmeans clustering used to generate SE-Topics produce flat clusters; BERTopic’s default implementation uses hierarchical density-based clustering (HDBSCAN, McInnes et al., 2017). HDBSCAN forms clusters around points of density in semantic space. Data not proximal enough to these points of density are labeled as outliers. The lack of outliers among SE-Topics produces a topic model more comparable to LDA. Moreover, at the time of this writing, BERTopic’s “guided topic modeling” (equivalent to the use of topical priors) was inoperable due to dependency issues. Thus, BERTopic could not offer results comparable to LDA or the Sentence Embedding-Based routine tested in this study. Despite this offset, I include BERTopic’s modeling of City-Data.com Corpus Threads and Posts with discussion in Appendix B.
[3] I employ Terragni’s (2023) suite of diversity scripts to measure PUW, JD, and WE-CD. See also Röder (2015a; 2015b), Stevens et al. (2012), Terragni et al. (2021).
[4] Text processing, topic modeling, and evaluations scripts available at https://github.com/rotemple/city-data-com-corpus-scripts/blob/main/sentence-embedding-tm.ipynb.
Funding Information
Ryan Omizo’s work on this project received no special funding.
Competing interests
The author has no competing interests to declare.
Author Contributions
Conceptualization
Data curation
Formal Analysis
Investigation
Methodology
Project administration
Resources
Software
Visualization
Writing – original draft
Writing – review & editing
