Theme Extraction by Unraveling Clusters of Reddit Post Titles
Unsupervised Clustering on Text Data
Executive Summary
Reddit is one of the most popular social media websites that allows users to submit and rate content such as links, text posts, images, GIFs, videos, etc. which may be from a variety of topics including news, science, movies, or food, posted by members organized by subject into boards called subreddits. This study aims to identify the subreddits or themes from a sample of 6000 post titles from the file reddit-dmw-sample.txt using two types of clustering methods such as representative-based (𝑘-means) and hierarchical (Ward's agglomerative method) clustering.
Depending on the clustering method used, the number of clusters (𝑘) recommended differed. This is because of their different approach and tests used to determine 𝑘. For representative-based, 7 clusters were made, while there are 10 for hierarchical. Although different, resulting themes were quite similar. The following themes were extracted: general inquires (advice), US presidential elections (Trump, Clinton, and Sanders), New Year (2016), gaming (video and online), food, speeding ticket, tech support, and vine compilations.
Methodology
The following steps were taken for this study:
-
Data Description and Processing
- Cleaning text data
- Lemmatization
- Exploratory Data Analysis
-
Models
- 𝑘-means clustering
- Ward's method
- Results and Recommendations