Have a personal or library account? Click to login
A Comparison of Topic Modeling Approaches Using Networked Discussion Forum Posts From the City-data.com Corpus Cover

A Comparison of Topic Modeling Approaches Using Networked Discussion Forum Posts From the City-data.com Corpus

By: Ryan M. Omizo  
Open Access
|Feb 2024

Figures & Tables

Table 1

City-Data Corpus post count, word count, and date range of postings per forum.

THREAD TOPIC TITLELABELSAMPLESDATE RANGESUMMARY
How’s everyone doing amongst the Coronavirus shut down? (home, movies) (Advameg, Inc 2020)coronavirus4812020/03/16 – 2020/07/27Posts discuss experiences with public shutdowns in the opening months of the COVID-19 pandemic in Philadelphia. Content dwells on governmental interventions and the status of public services than the etiology or effects of COVID-19.
“Official Greater Philadelphia Area Crime Thread” (Advameg, Inc, 2013)crime2,4022013/04/11 – 2020/01/15Posts discuss and share information about crime in Philadelphia.
Official Philadelphia Metro Crime Thread (York, Chester: apartment complexes, houses, unemployment) (Advameg, Inc, 2012a)crime1,2842012/01/12 – 2013/03/11Posts discuss and share information about crime in the Philadelphia metro area.
Philadelphia 2035 (Houston: foreclosure, neighborhoods, wage) (Advameg, Inc. 2011)plan6,7962011/06/14 – 2020/01/14Posts discussing Philadelphia’s 2035 civic renovation plan authored by the Philadelphia City Planning Commission (2023).
Retail coming to Philadelphia (Penn, Burlington: real estate, house, buying) (Advameg, Inc. 2012b)retail4,0452012/11/27 – 2020/01/20Posts discuss new retail business developments in Philadelphia.
Table 2

Tabular data model for City-Data.com forum posts.

POST_IDPOST_BODYPOSTDATETIMEQUOTE_IDQUOTE_BODYQUOTEFORUM
Numerical post identifierPost HTMLPost textYYYY-MM-DD HH:MM:SSNumerical post identifierQuoted reply HTMLQuoted reply text.forum title label
Table 3

SE-Topics Coherence and Diversity Scores.

TOPIC MODEL TYPEUMASSPUWJDWE-CD
SE-Topics post–4.100.740.850.06
SE-Topics guided topic titles–2.760.540.630.06
SE-Topics guided initial posts–2.560.520.620.07
SE-Topics guided high degree–2.530.540.630.06
SE-Topics threads–5.460.740.860.12
SE-Topics guided threads topic titles–4.840.690.790.06
SE-Topics guided threads initial posts–4.570.670.780.21
SE-Topics guided threads high degree–4.260.670.780.21
MEAN–3.880.630.740.11
MAX–2.530.740.850.21
Q3–2.660.710.820.17
MEDIAN–4.180.670.780.07
Q1–4.700.540.620.06
MIN–5.460.520.620.06
Table 4

LDA Topic Modeling Coherence and Diversity Scores.

TOPIC MODEL TYPEUMASSPUWJDWE-CD
LDA posts–2.591.0000.950. 30
guided LDA (topic titles)–2.631.0000.950. 27
guided LDA (high degree)–2.100.9500.950. 29
guided LDA (initial posts)–2.360.9750.960. 26
LDA threads–2.100.9500.950. 30
LDA threads (high degree)-2.150.9500.950. 31
LDA threads (topic titles)–2.320.9500.950. 30
LDA threads (initial post)–2.220.9000.940. 32
MEAN–2.310.950.950.29
MAX–2.101.000.960.32
Q3–2.120.980.950.30
MEDIAN–2.270.950.950.30
Q1–2.47NaN0.950.28
MIN–2.630.90.940.26
johd-10-182-g1.png
Figure 1

Boxplots of SE-Topics and LDA Coherence Scores.

johd-10-182-g2.png
Figure 2

Boxplots of SE-Topics and LDA Diversity Scores.

Table 5

SE-Topics Guided Post (high degree nodes).

TOPIC12345678910
0peoplewhitecrimeblacktimeyearmurderthinkcitypolice
1cityphillyphiladelphiapeoplestreetareayearthinkcenterneighborhood
2buildingcitystreetprojecttowerthinkcentermarketdevelopmentbroad
3storeretailcitymallretailermarketwalnutcenterthinklocation
Table 6

LDA (High Degree) Topic Model.

TOPIC12345678910
0moneystatelocalpeopleissuecareneighborhoodincomehelpwhite
1storeretailmallshoppingretailerlocationcentershopgalleryalso
2citythinkphiladelphiaphillypeopleyeartimeevenareamuch
3streetmarketbuildingwalnutspacechestnutcentereastblocksouth
Table 7

Initial post ids per City-Data.com Corpus forum.

FORUMPOST ID
coronavirus758124
crime908813
metro251839
plan958364
retail710692
Table 8

High degree City-Data.com Corpus forum posts used for guided topic modeling.

FORUMPOST IDDEGREE
coronavirus576816165
crime501756345
metro225183915
plan383326915
retail380555755
Table 9

BERTopic Coherence and Diversity Scores for City-Data.com Corpus.

TOPIC MODEL TYPEUMASSPUWJDWE-CD
BERTopic posts–5.120.520.780.12
BERTopic threads–5.590.490.790. 28
Table 10

BERTopic Post Topic Model. Note that Topic –1 indicates outliers.

TOPIC0123456789
–1citylikepeoplejustphillyphiladelphiathinkdontstreetcenter
  0citylikethinknewjustpeoplebuildingphiladelphiadontphilly
  1whitepeoplecitydontlikecrimeimyearjustmurders
  2likeinganewsjustdontpeoplearticlethinkdoesthread
  3bauhellonastybartfancyupdatehahaawfulokfinally
Table 11

BERTopic Thread Level Topic Model. Topic –1 indicates outliers.

0123456789
–1citylikepeoplephillyjustthinkdontphiladelphianewstreet
  0citylikenewstorethinkjustphiladelphiaretailstorespeople
  1peoplecrimecitydontjustlikewhiteimknowyear
  2hmmmhellofancyupdatehaha
  3ingalikejustnewsarticledontreadwritingpeopledoes
DOI: https://doi.org/10.5334/johd.182 | Journal eISSN: 2059-481X
Language: English
Submitted on: Nov 10, 2023
Accepted on: Jan 10, 2024
Published on: Feb 7, 2024
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2024 Ryan M. Omizo, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.