Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to...

58
Clustering and Topic Analysis Final Report CS 5604 Information Storage and Retrieval Virginia Polytechnic Institute and State University Fall 2017 Submitted by Ashish Baghudana Aman Ahuja Pavan Bellam Rammohan Chintha Prathyush Sambaturu Ashish Malpani Shruti Shetty Mo Yang December 15, 2017 Blacksburg, Virginia 24061 Instructor: Dr. Edward A. Fox

Transcript of Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to...

Page 1: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

Clustering and Topic AnalysisFinal Report

CS 5604 Information Storage and Retrieval

Virginia Polytechnic Institute and State UniversityFall 2017

Submitted by

Ashish Baghudana Aman AhujaPavan Bellam Rammohan Chintha

Prathyush Sambaturu Ashish MalpaniShruti Shetty Mo Yang

December 15, 2017Blacksburg, Virginia 24061

Instructor: Dr. Edward A. Fox

Page 2: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

Abstract

One of the key objectives of the CS-5604 course titled Information Storage and Retrieval is to builda pipeline for a state-of-the-art retrieval system for the Integrated Digital Event Archiving andLibrary (IDEAL) and Global Event and Trend Archive Research (GETAR) projects. The GETARproject, in collaboration with the Internet Archive, aims to develop an archive of webpages andtweets related to multiple events and trends that occur in the world, and develop a retrieval systemto extract information from that archive.

Since it is practically impossible to manually look through all the documents in a large corpus,an important component of any retrieval system is a module that is able to group and summa-rize meaningful information. The Clustering and Topic Analysis (CTA) team aims to build thiscomponent for the GETAR project.

Our report examines the various techniques underlying clustering and topic analysis, discussestechnology choices and implementation details, and, describes the results of the k-means algo-rithm and latent Dirichlet allocation (LDA) on di�erent collections of webpages and tweets. Sub-sequently, we provide a developer manual to help set up our framework, and �nally, outline auser manual describing the �elds that we populate in HBase.

Page 3: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

Contents

1 Introduction 1

1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Topic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Literature Survey 6

2.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Partition-based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.2 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.3 Density-based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.4 Grid-based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.5 Model-based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Topic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.2 Latent Semantic Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.3 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.4 Twitter-LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Requirements Gathering 12

3.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Topic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

i

Page 4: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

4 Design and Deliverables 14

4.1 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2 Technologies Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.3 Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Implementation and Evaluation Techniques 19

5.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.2.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.2.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.3 Topic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.3.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.3.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6 Results 26

6.1 Remember April 16 Tweets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6.1.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6.1.2 Topic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.2 Solar Eclipse 2017 Tweets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6.2.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6.2.2 Topic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.3 Solar Eclipse 2017 Webpages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.3.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.3.2 Topic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.4 Hurricane Irma Webpages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.4.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.4.2 Topic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.5 Vegas Shooting Webpages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.5.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

ii

Page 5: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

7 User Manual 39

7.1 HBase schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

7.2 Topic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

7.2.1 Help File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

7.2.2 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 41

7.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

7.3.1 Running Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 42

7.3.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

8 Developer Manual 43

8.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

8.2 Topic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

8.3 HBase interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

8.3.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

8.4 File Inventory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

9 Future Work and Enhancements 46

9.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

9.2 Topic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Acknowledgements 48

Bibliography 48

iii

Page 6: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

List of Figures

2.1 Plate notation for LDA (courtesy Wikipedia) . . . . . . . . . . . . . . . . . . . . . 9

2.2 Plate notation for Twitter-LDA [22] . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.1 Pipeline for text processing. The CTA team now begins the preprocessing pipelineat Step 3: Remove stop words and punctuation as the text is already tokenized andlowercased. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2 Latent Dirichlet Allocation uses a Python based system with three main capabil-ities – access to HBase, preprocessing and LDA, and visualization. . . . . . . . . . 16

5.1 The three stages of our preprocessing pipeline – tokenization, mapping, and �l-tering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6.1 Clean Tweet Data Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6.2 Calinski Harabaz index vs. number of clusters for “remember april 16” Dataset . . 27

6.3 k-means clustering results on “remember April 16” tweets. . . . . . . . . . . . . . 28

6.4 Tweets distribution over clusters using hierarchical clustering algorithm. . . . . . 29

6.5 Cluster distribution for “Solar Eclipse 2017” tweets . . . . . . . . . . . . . . . . . . 31

6.6 Cluster distribution for “Solar Eclipse 2017” webpages . . . . . . . . . . . . . . . . 33

6.7 Cluster distribution for “Hurricane Irma” webpages . . . . . . . . . . . . . . . . . 35

6.8 Plots showing number of topics vs. log perplexity and number of topics vs. topiccoherence for the collections Solar Eclipse webpages and Hurricane Irma web-pages. We attempt to choose the best number of topics based on these two plots. . 36

6.9 Cluster distribution for “Vegas Shooting” webpages . . . . . . . . . . . . . . . . . 37

7.1 Computational complexity of running LDA for di�erent collections. The resultswere benchmarked on a single node server with 20 cores. . . . . . . . . . . . . . . 41

iv

Page 7: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

List of Tables

1.1 Sample topics from a collection of Wikipedia articles collected using a keywordsearch for “computers”, “basketball”, and “economics” . . . . . . . . . . . . . . . . 4

4.1 Timeline of task list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.1 A sample of collection speci�c stop words for Solar Eclipse 2017 and HurricaneIrma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6.1 Datasets description with category (tweet or document) and number of documents 26

6.2 Frequent words and events in each cluster for “Remember April 16” dataset . . . . 28

6.3 Top words for topics obtained through running LDA on the “Remember April 16”dataset. The results show only the best 6 topics. The remaining 4 topics wereincoherent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.4 Topics from Twitter-LDA that did not appear in LDA for the “Remember April16” dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6.5 Cluster Naming based on frequent words for the “Solar Eclipse 2017” tweet data . 32

6.6 Keywords for topics in the collection “Solar Eclipse” . . . . . . . . . . . . . . . . . 32

6.7 The cosine similarity analysis of “Solar Eclipse 2017” webpage data . . . . . . . . 34

6.8 Cluster Naming based on frequent words for the “Solar Eclipse 2017” webpage data 34

6.9 Keywords for topics in the collection “Solar Eclipse” . . . . . . . . . . . . . . . . . 34

6.10 The cosine similarity analysis of “Hurricane Irma” webpage data . . . . . . . . . . 35

6.11 Cluster naming based on frequent words for the “Hurricane Irma” web data . . . 36

6.12 Keywords for topics in the collection “Solar Eclipse” . . . . . . . . . . . . . . . . . 37

6.13 The cosine similarity analysis of “Vegas Shooting” webpage data . . . . . . . . . . 38

6.14 Cluster naming based on frequent words for the “Vegas Shooting” webpage data . 38

v

Page 8: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

7.1 HBase Schema: Fields for Topic Analysis . . . . . . . . . . . . . . . . . . . . . . . 39

7.2 HBase Schema: Fields for Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 40

8.1 File Inventory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

vi

Page 9: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

Chapter 1

Introduction

The CS5604 course project aims to build a state-of-the-art information retrieval (IR) system insupport of the Integrated Digital Event Archiving and Library (IDEAL) and Global Event andTrend Archive Research (GETAR) projects. The semester-long project is divided into severalsubareas undertaken by di�erent teams. These are Classi�cation (CLA), Collection ManagementTweets (CMT), Collection Management Webpages (CMW), Clustering and Topic Analysis (CTA),Database and Indexing (SOLR), and Front-end and Visualization (FE). This report focuses on theresults of the Clustering and Topic Analysis (CTA) team.

1.1 Problem Statement

Building a state-of-the-art information retrieval system involves several components. Each com-ponent is handled by a team. CMW and CMT crawl/collect from the Internet to fetch event relatedwebpages and tweets, respectively. CLA re�nes the data processed by CMW and CMT to classifywebpages and tweets with a speci�c event. Our team (CTA) takes classi�ed data to learn clus-ter(s) and keywords/topics for each of the documents so that SOLR can index these documentsusing Lily. This will help FE fetch documents from the computer cluster when a user searchesfor something.

A naive solution to extract keywords would be to �nd the top-n used words in the entire col-lection. There are two major problems with this approach. First, it assumes all documents in acollection talk about the same keywords. It would not allow users to search for di�erent facets ofan event – such as, relief-related documents vs. destruction-related documents in the collectionHurricane Harvey. Second, it is possible that a document may be about school shootings withoutmentioning the words “school” or “shooting”.

Clustering and topic modeling algorithms help identify semantically related words not just in asingle document, but across similar documents. For instance, if “gun” occurred frequently with

1

Page 10: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

2

“school” and “shooting”, a search for “gun” would yield results about school shootings. Thesealgorithms also help us identify recurring sub-themes within the collection. This is particularlyimportant in documents about events as they often talk about di�erent aspects of the event. As anexample, these algorithms can identify di�erent facets of the event Hurricane Irma. Some articlestalk about “destruction” and “damage”, while others talk about “weather” and “storm”.

Once we �nd the themes in the corpus, the documents will be indexed using the discovered topicsand clusters to improve the quality of search and retrieval. The front-end team uses these topicsto design a faceted search. These topics can also be used to validate the results of the clusteringalgorithms and vice-versa.

The approaches for clustering and topic modeling are discussed in the subsequent subsections.More background information is described in Chapter 2.

1.2 Clustering

Clustering can be intuitively thought of as a process of placing similar objects close to each otherand dissimilar objects away from each other. It is an example of an unsupervised form of learningwhere grouping into natural categories takes place when no class label is available. In clusteringthe goal is to reduce the distance between objects in the same cluster and have increasing distancebetween objects of di�erent clusters [17].

Clustering can be used for �nding latent groupings that will later be useful for categorization. Itmay help to gain insight into the nature of the data. In addition, it may also lead to discovery ofdistinct subclasses or similarities among patterns. Clustering can be classi�ed into two categoriesbased on the assignment of objects.

1. Hard clustering: Each data point is assigned to only one of the given clusters.

2. Soft Clustering: Instead of putting each data point into a separate cluster, a probability orlikelihood is assigned for a data point being in a cluster.

Clustering algorithms can be grouped into di�erent types of generating models. These are:

1. Centroid Models: In these algorithms the similarity is derived from the closeness of thedata point to the centroid of the clusters. These are iterative type algorithms, and thepopular k-means clustering algorithm falls under this category. It is described in Algorithm1.

Page 11: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

3

Algorithm 1 The k-means algorithm1: procedure KMeans(k, data)2: centroids← randomly select K points from data as initial centroids3: while centroids do not change do . This is the convergence criterion4: for i← 1, k do5: centroids[i]← Recompute centroid for cluster i6: end for7: end while8: return centroids9: end procedure

The closeness of data points in the cluster is represented by a distance measure. This couldbe based on the L1-distance, L2-distance, cosine similarity, correlation, or sum of squarederrors.

2. Distribution Models: In Distribution Models, the data points are related to each otherbased on their likelihood of belonging to the same probability distribution. Popular algo-rithms like Expectation-Maximization (EM) and Gaussian Mixture Model (GMM) fall underthis category.Initially we start with a �xed number of distributions and iteratively update to �t the datadistribution such that the likelihood of the data given the distribution is maximized.

3. HierarchicalModels: These models hierarchically aggregate or divide points into clustersbased on their distance from each other. The two main components of hierarchical modelsare the distance function (distance between points) and the link function (distance betweenclusters). Based on the recursive approach, there are two types of hierarchical models.

• Agglomerative Clustering Approach: Start with all data points as individual clus-ters and aggregate them to form clusters.

• Divisive Clustering Approach: Start with all data points as a single cluster andpartition the large cluster to form smaller clusters.

These models are very easy to interpret but lack scalability for handling big datasets.

4. Density Models: These models search the data space for areas of varied density of datapoints in the data space. They isolate various di�erent density regions and assign the datapoints within these regions in the same cluster. DBSCAN is a popular example of a densitymodel.

In this report, we focus on centroid and hierarchical clustering techniques, namely k-means andAgglomerative Clustering using bag-of-words as feature vectors. For each cluster, we output then-most frequent words in the cluster and set these as keywords for all documents in that cluster.

Page 12: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

4

1.3 Topic Analysis

Topic analysis or topic modeling aims to �nd latent (hidden) groups of words, called topics, froma large corpus of text documents. The topics discovered by these techniques can be de�ned asgroups (or themes) of semantically similar words. Topic modeling uses statistical techniques todiscover these topics, by using co-occurrence of words in the documents. Given a set of docu-ments about “computers”, “basketball”, and “economics”, sample words in each of the topics aregiven in Table 1.1.

Topic 1 Topic 2 Topic 3computer basketball economicgame team economyibm league governmentprogram team investmentmachine coach marketdesign player tradesoftware nba growthmemory ncaa policy

Table 1.1: Sample topics from a collection of Wikipedia articles collected using a keyword searchfor “computers”, “basketball”, and “economics”

Algorithm 2 Latent Dirichlet Allocation Algorithm1: procedure LDA(k, documents, iterations)2: Randomly initialize topic assignments Z3: for each iteration do4: for each document do5: for word w ← 1, number of words in document do6: z ← sampleTopic(w) . Ignore current assignment when sampling7: Update topic assignments Z8: end for9: end for

10: end for11: return topic assignments Z12: end procedure

Recent work in topic modeling is based on Latent Dirichlet Allocation (LDA) [8]. LDA is a prob-abilistic generative model that observes the word frequencies and co-occurrence in documentsand infers the topic distribution based on sampling techniques. It models each document as amixture over topics, and each topic as a mixture of words. Using this assumption, the algorithm

Page 13: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

5

aims to �nd the top ranked words for each topic. Since document-topic distribution and topic-word distribution are treated as latent variables, we use an approximation technique to teaseout the probability distributions. Some of the techniques are Gibbs Sampling and ExpectationMaximization (EM). The algorithm for LDA using Gibbs Sampling is given in Algorithm 2.

Page 14: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

Chapter 2

Literature Survey

2.1 Clustering

Clustering algorithms can be classi�ed into �ve groups - partitioning-based, hierarchy-based,density-based, grid-based, and model-based.

2.1.1 Partition-based Clustering

In partitioning-based clustering algorithms, data objects are divided into a number of partitionsinitially. Each partition represents a cluster. The partitioning is optimized for a pre-speci�edcriterion function. The most typical partitioning-based clustering algorithm is the k-means algo-rithm. k-means clustering [18] represents data as real-valued vectors in d-dimensional space Rd.Initially, data points are partitioned into K clusters with K center points. Then, the algorithmwill iteratively update the center points to minimize the mean squared distance from each datapoint to its nearest center point. The major challenge in the k-means algorithm is to determinethe number of clusters K . This is explored later in this report.

Based on k-means, fuzzy clustering such as fuzzy c-means (FCM) [7] has also been proposed. InFCM, data points are assigned to centers of clusters with a degree of belief. Therefore, each datapoint may belong to more than one cluster with di�erent memberships. FCM follows the sameprinciple of k-means that iteratively searches the center point and updates memberships of eachdata object. But its goal is to minimize the objective function J below.

J =n∑

i=1

c∑k=1

µmik|pi − vk|2 (2.1)

6

Page 15: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

7

Here, n is the number of data points, c is the number of de�ned clusters, µik is the likelihood data,i belongs to cluster k, m is a fuzziness factor and |pi− vk| is the Euclidean distance between i-thobject pi and k-th cluster center vk.

2.1.2 Hierarchical Clustering

In hierarchical clustering algorithms, data points are organized in a hierarchical manner depend-ing on the medium of proximity [12]. Each data point is a leaf node in the tree-like hierarchicalstructure. Hierarchy-based algorithms can either be bottom-up or top-down. Bottom-up meth-ods start with one data point for each cluster and recursively merge other clusters. Top-downmethods start with one cluster and recursively split into multiple clusters according to a certainmetric. The major drawback of hierarchical based methods is that merge or split steps cannot beundone.

BIRCH [21] is an e�cient hierarchy-based clustering algorithm. It builds up a clustering featuretree (CF tree) by scanning the dataset in an incremental and dynamic way. When a data point isencountered, the CF tree is traversed from root to leaf by choosing the closest node at each level.After the closest leaf cluster is identi�ed, a test will be performed to check whether a current datapoint belongs to this leaf cluster. If not, a new leaf cluster will be created. Two major advantagesof BIRCH are the ability to deal with large datasets and handle noise. However, it does not havegood stability and may not work well when clusters are not spherical.

Compared to BIRCH, CURE [13] is more robust in noise handling and can identify clusters withnon-spherical shapes. CURE represents each cluster by a set of well-scattered points and shrinksthem towards the center of the cluster by a speci�c function. With more than one representa-tive point per cluster, CURE is able to adjust well to the geometry of clusters with sophisticatedshapes which suppresses the noise e�ect. In addition, CURE also applies a combination of randomsampling and partitioning to deal with large datasets.

2.1.3 Density-based Clustering

In density-based clustering algorithms, data objects are separated based on regions of density,connectivity, and boundary. A cluster is a connected, dense component with arbitrary shape. Thisfeature provides a natural protection against outliers to �lter out noise. Two typical density-basedclustering algorithms are density-based spatial clustering of applications with noise (DBSCAN)[11] and density-based clustering (DENCLUE) [15].

In DBSCAN, a data object is assigned to a cluster when the density in its neighborhood is highenough. Clusters are created from a data object by absorbing all objects in its neighborhood.DENCLUE models cluster distributions based on the sum of in�uence functions of all data objects.An in�uence function describes the impact of a data object in its neighborhood. DENCLUE createsclusters according to density attractors which are the local maxima of the overall density function.

Page 16: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

8

DENCLUE is much faster than DBSCAN because it uses tree-based access structures.

2.1.4 Grid-based Clustering

In grid-based clustering algorithms, data objects are divided into grids. Then clustering is per-formed on the grids instead of the large dataset directly. A major advantage of grid-based methodsis speed since the size of a grid is usually much smaller than the size of the dataset. However,it is not good at handling datasets with irregular distributions. Optimal Grid (OptiGrid) [14] isa grid-based clustering algorithm which aims at achieving optimal grid partitioning. OptiGridconstructs the best cutting hyperplanes through a set of selected projections. The cutting planeis chosen to have minimal point density. After grid construction, clusters can be found usinga density function. Then, the algorithm is applied recursively on the clusters to achieve betterclustering.

2.1.5 Model-based Clustering

Model-based clustering algorithms are designed to optimize the �t between a given dataset and acertain mathematical model. There are two types of model-based methods, statistical and neuralnetwork methods. Statistical methods use probability measures to determine clusters, while neu-ral network methods utilize a set of weighted connections between input/output units to deriveclusters. One example of statistical methods is the Expectation-Maximization (EM) algorithm[10]. As the name indicates, EM iteratively has two steps. In the expectation step, data objectsare fractionally assigned to each cluster according to the posterior distribution of latent variableswhich is derived using current model parameters. In the maximization step, the fractional assign-ment is given by re-estimating model parameters with the maximum likelihood rule. However,the EM algorithm has a lot of mathematical requirements and it has a slow convergence rate.

2.2 Topic Analysis

Topic modeling is used to represent a set of documents/text corpora by a distribution of hid-den topics. Topics refer to unobserved data that are discovered using observed data - the wordspresent in the documents. Thus, a distribution of related words makes up a topic. Topics helppreserve the essential relationship amongst words in a document, thereby reducing the dimen-sionality of the documents to a bunch of topics.

Page 17: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

9

2.2.1 TF-IDF

Di�erent methodologies have been used in order to retrieve data from text corpora. The termfrequency-inverse document frequency (tf-idf) method calculates the frequency of a word/term(tf) in a corpus and multiplies it by the inverse document frequency (idf) value to return a term-by-document matrix.

2.2.2 Latent Semantic Indexing

The Latent Semantic Indexing (LSI) method was later proposed to replace the tf-idf matrix bya singular value decomposition (SVD). Following this, the probabilistic Latent Semantic Index-ing (pLSI) model was proposed by Ho�man [16]. It models a document as a mixture of topicsbased on the likelihood principle. There is no generative process for determining the document-topic distribution, which leads to problems while assigning probabilities to documents outsidethe training set.

2.2.3 Latent Dirichlet Allocation

Latent Dirichlet allocation (LDA) [8] is an unsupervised machine learning technique, which as-sumes a hierarchical Bayesian dependency between the documents, topics, and words for topicmodeling. It was �rst presented by David Blei, Andrew Ng, and Michael I. Jordan as a generativeprobabilistic model for collections of discrete data such as text corpora. It overcomes the limita-tions of the pLSI method. Given a collection of documents, LDA assigns a distribution of wordsto every topic and also a distribution of topics to every document.

Figure 2.1: Plate notation for LDA (courtesy Wikipedia)

As explained in [8], LDA assumes the following generative process:

• For each topic k ∈ K , draw topic distribution φk∼Dirichlet(β)

Page 18: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

10

• For each document m,

– Draw topic distribution θ∼Dirichlet(α)– For each word n in document m,

∗ Draw topic zm,n∼Multinomial(θm)

∗ Draw word wm,n∼Multinomial(φzm,n)

The model is equivalently explained through a plate notation diagram (Figure 2.1). There are Mdocuments in the corpus. For simplicity, the diagram assumes N words in each document. Eachword w has a topic z, which is generated from the document-topic distribution θ. α and β are thehyperparameters to the model (also known as Dirichlet priors).

2.2.4 Twitter-LDA

Twitter-LDA [22] is a variant of the standard LDA model which was developed for the purposeof modeling tweets. This model assumes that a tweet contains a single topic given the constrainton its length (140, expanded to 280, characters), and models background and topic-related wordsseparately, to give a more realistic modeling of Twitter text. Topic modeling on tweets has alsobeen studied in [6] and [19].

Figure 2.2: Plate notation for Twitter-LDA [22]

The generative process for Twitter-LDA is as follows:

• Draw word-category distribution π∼Bernoulli(γ)

• Draw background words distribution φB∼Dirichlet(β)

Page 19: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

11

• For each topic k, draw topic-word distribution φ′

k∼Dirichlet(β)

• For each user u,

– Draw topic distribution θu∼Dirichlet(α)– For each document mu,t by user u,

∗ Draw topic zu,t∼Multinomial(θu)

∗ For each word wu,t,n

· Draw category yu,t,n∼Binomial(π)· Draw word wu,t,n∼Multinomial(φ

′zu,t), if yu,t,n = 1

else wu,t,n∼Multinomial(φB , if yu,t,n = 0)

This model is equivalently explained by Figure 2.2. Each user u has a topic-distribution θu. Sinceevery tweet is only 140 (expanded to 280) characters, there is only one topic z for a tweet t. Eachword in the tweet is drawn from a multinomial distribution.

Page 20: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

Chapter 3

Requirements Gathering

The system is aimed to support a helpful search experience for the users. Our team works on an-alyzing the topics from the clean text of tweets and webpages that have been already classi�ed byCLA on the basis of di�erent events. Since CMT and CMW already convert the text to lowercase,tokenize it, and remove stop words, we take in preprocessed text for our unsupervised learningalgorithms. The text in these �elds is UTF-8 encoded. The tokenization technique, using NLTK[2], “tokenizes a string to split o� punctuation other than periods”.

The most relevant �elds in the HBase schema for CTA are clean-webpage:clean-tokens andclean-tweet:cleantokens. To use custom tokenizing, one can also directly use clean-webpage:-clean-text-cta and clean-tweet:clean-text-cta.

We �nd that the preprocessed text for webpages is less noisy and that the default NLTK im-plementation might work. However, for tweets a more custom implementation such as NLTK’sTweetTokenizer [1] or a similar implementation called tweetokenizer [20] might work better.

3.1 Clustering

For clustering, the tokenized text is converted to a bag-of-words model where we ignore theorder of occurrence of words in a document. Each document is represented as a high-dimensionalvector. We run the k-means algorithm on this with di�erent number of clusters K . To determinethe best model (i.e., the best number of clusters), we plot the Calinski-Harabasz index for eachmodel. The plot resembles an elbow plot and the best number of clusters is determined by thepoint at which the Calinski-Harabasz index drops o� steeply. This is discussed in more detail inthe evaluation section. The keywords for each cluster are the most frequently used n words.

12

Page 21: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

13

3.2 Topic Analysis

For topic modeling, the tokenized text is merely mapped to a high-dimensional vector spacewhere each of the terms is encoded as a number. We run LDA on this vector space to determinethe document-topic distribution and topic-word distribution. Using these two distributions, we�nd the keywords for each of the documents in the corpus. The quality of our results is evaluatedusing two quantitative measures – perplexity and topic coherence. We favor models with lowperplexity and high topic coherence. The two measures are described in the evaluation section.We also plan to do a qualitative study, where students in the class will be asked to determine if aset of words is coherent or not.

3.3 Outputs

The results of the CTA team are the most probable clusters and topics for each document and a setof words representing a topic or a cluster. The topic analysis team maps the list of topics alongwith their probabilities to topic:topic-list and topic:probability-list, respectively. Theclustering team populates the �elds cluster:cluster-list and cluster-probability. To helpthe FE team, we also populate two �elds – topic:topic-displaynames and cluster:cluster-

displaynames. These correspond to the highest probability topic and cluster, respectively.

Page 22: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

Chapter 4

Design and Deliverables

4.1 System Design

Figure 4.1: Pipeline for text processing. The CTA team now begins the preprocessing pipeline atStep 3: Remove stop words and punctuation as the text is already tokenized and lowercased.

The CTA team uses tokenized text provided by the CMT and CMW teams, respectively. Theincoming documents undergo pre-processing to normalize and �lter out the redundant informa-

14

Page 23: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

15

tion. In general, each document is converted to lowercase and tokenized into blocks of text basedon word boundaries (each called a token). A token is nothing but a sequence of characters whichacts as a useful semantic entity for processing. For example : “Harvey was a catastrophic �ooddisaster in southeast Texas” will be split into harvey, was, a, catastrophic, flood, disaster, in,southeast, and texas. CMT and CMW have already uploaded, tokenized, and lowercased textto HBase, thereby, eliminating the �rst two steps in our pipeline.

The CTA team begins at Step 3, which involves removing stop words such as the, is, a, etc. Theyo�er no analytic value to the text mining process. Punctuation marks are another componentthat has little or no importance to topic analysis and can be safely eliminated without any loss ofinformation. For example: “It’s hard to believe, but it has already been a month since HurricaneHarvey made landfall in Texas.” ⇒ it, ’s, hard, to, believe, ,, but, it, has, already, been, a,month, since, hurricane, harvey, made, landfall, in, texas.

The emphasized strings are stop words and punctuation, which can be eliminated. In this exam-ple, we removed 55% of the tokens which otherwise would have an impact on the overall runtimeof the topic analysis system. Several documents in social media now contain hashtags (#hashtag).These should not be removed as they o�er key insights into the themes in the document. URLscan also be safely removed from documents for clustering and topic analysis. As a rule of thumb,we also eliminate all tokens with length less than 3.

An optional step in the pipeline is stemming or lemmatization. Stemming reduces words to abase form using a simple algorithm. For example, “forest”, “forests”, and “forested” all contributeto a single word: “forest”. This helps reduce the dimensionality of the vector space and eliminateswords that just di�er in tense or plurality.

Lemmatization is another technique to reduce words, to their root forms. Stemming employs arule based system, whereas lemmatization takes into account the part-of-speech tag and linguisticanalysis to normalize words. For example, “am”, “are”, and “is” will be lemmatized to “be”.

Stemming or lemmatization can be helpful for webpages, however these techniques are not suit-able for tweets. Tweets often ignore normal grammar rules, and may not comprise dictionarywords. This makes stemming and lemmatization troublesome. Currently, we do not perform anystemming or lemmatization on any of the data.

The preprocessing step is trivially parallelizable for both webpages and tweets to increase theperformance of our design.

4.2 Technologies Used

The topic modeling system uses Python extensively. Broadly, it has three main responsibilities:

1. Database Access: We use the Python library happybase to retrieve data from and put datainto HBase. However, happybase is prone to failure when querying more than one million

Page 24: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

16

records. To overcome this di�culty, we use a shell (.sh) script when pulling data with overa million records. We also batch our queries whenever possible to reduce the load on thedatabase server using happybase’s in-built batch method.

2. Preprocessing and LDA: Preprocessing generally includes removal of stop words andpunctuation. We augment the stop word list using collection speci�c stop words. Weachieve this using nltk and a custom preprocessing pipeline. The Python library gensim

performs the heavy-lifting code of topic modeling. Since we do this on a single node of thecluster, we parallelize the algorithm by using all 20 cores of the node. This is done usingthe LdaMulticore class of gensim.

3. Visualization: pyLDAvis is a bring-your-own-topic-model package for visualizing inter-topic distances and the most relevant words for a topic. We incorporate this library tounderstand and evaluate the coherence of our topic models.

Figure 4.2: Latent Dirichlet Allocation uses a Python based system with three main capabilities– access to HBase, preprocessing and LDA, and visualization.

Alternate systems can be built using Scala, Spark, and MLLib (as used by the Fall 2016 team).However, we found Scala lacking in the rich preprocessing and visualization aspects that Python-based libraries excelled in. Additionally, we did not face scaling problems as gensim is writtenusing numpy and scipy, both of which are wrappers for C libraries. This makes gensim adequatelyfast for large collections of data.

We also used PySpark to create topic models. However, the LDAModel implementation of PySparkis incomplete and does not have methods for labeling documents by topics and evaluation. There-fore, we did not proceed with this approach.

Page 25: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

17

Same technologies are used for database access and preprocessing in clustering as well. Scala andSpark are used for clustering the data into clusters. matplotlib and Python scripts are used forfurther evaluating and plotting the results.

4.3 Timeline

The major tasks for our team are highlighted in the table below. It is broadly divided into threephases – Literature Survey, Baseline implementation and Scaling, and Experiments (Runs) andEvaluation.

Date Task List Team Member StatusLiterature SurveySep 19, 2017 Literature survey Entire team DoneSep 26, 2017 Interim Report 1 Entire team DoneBaseline Implementation and ScalingOct 03, 2017 Implement baseline LDA Ashish B, Aman DoneOct 05, 2017 Data preprocessing for clustering Mo, Prathyush DoneOct 08, 2017 Generate bag of words for tweet

collectionPavan, Ram Done

Oct 10, 2017 Implement baseline Twitter-LDA Ashish B, Aman DoneOct 12, 2017 Implement k-means clustering Pavan, Ram DoneOct 13, 2017 Evaluate LDA and Twitter-LDA on

“election” dataAshish B Done

Oct 15, 2017 Intrepretation of results and visual-ization

Mo, Prathyush Done

Oct 15, 2017 Find best number of clusters quali-tatively

Ashish B, Shruti Done

Oct 18, 2017 Interim Report 2 Entire team DoneOct 19, 2017 Implement hierachial clustering Pavan, Ram DoneOct 21, 2017 Implement quantitative measures:

perplexity and topic coherenceAshish B Done

Oct 21, 2017 Performance comparison and se-lection of �nal value of K

Mo, Prathyush Done

Oct 24, 2017 Compare LDA with Twitter-LDAfor results

Ashish B, Aman Done

Oct 31, 2017 Package as single script to run LDAexperiments and evaluations

Ashish B Done

Experiments and EvaluationNov 04, 2017 Run LDA on CMW collections -

“Solar Eclipse 2017”, “HurricaneIrma”, “Las Vegas Shooting”

Ashish B Done

Page 26: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

18

Nov 6, 2017 Run clustering algorithm on CMWcollections

Mo, Prathyush Done

Nov 08, 2017 Interim Report 3 Entire Team DoneNov 14, 2017 Integration with HBase to pull and

write data seamlesslyShruti, Ashish M,Mo, Prathyush

Done

Nov 21, 2017 Thanksgiving breakNov 27, 2017 Collate results between clustering

and topic modelingAshish B, Pavan,Ram

Done

Nov 30, 2017 Qualitative evaluation of topics andclusters

Entire team Done

Dec 07, 2017 Final Presentation Entire team DoneDec 12, 2017 Final Report Entire team Done

Table 4.1: Timeline of task list

Page 27: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

Chapter 5

Implementation and EvaluationTechniques

5.1 Preprocessing

Each document is preprocessed before being fed into the clustering and topic analysis algorithms.The pseudocode for our approach is described in Algorithm 3.

Algorithm 3 Preprocessing Algorithm1: procedure Preprocess(document, tokenizer, mappers, filters)2: tokens← tokenize(document)3: for each mapper in mappers do4: tokens← map(mapper.map, tokens)5: end for6: for each �lter in �lters do7: tokens← filter(filters.filter, tokens)8: end for9: return tokens

10: end procedure

We use Python’s functional programming helper functions map and filter to make our interfacesimple. Examples of mappers include LowerCaseMapper, which converts all tokens to lowercase,WordNetLemmatizer, which lemmatizes each token using Wordnet, and PorterStemmer, whichstems words using Porter’s algorithm.

Examples of filters include StopWordFilter, which removes stop words and PunctuationFilter

which removes all punctuation from the tokens.

19

Page 28: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

20

Figure 5.1: The three stages of our preprocessing pipeline – tokenization, mapping, and �ltering.

One of the most important aspects of preprocessing is maintaining a list of words that are stop-words for a particular collection. For instance, webpages about Solar Eclipse 2017 mention wordssuch as solar, eclipse, tse, etc. These words do not add any meaning to topic modeling orclustering results as they are present in almost all documents. We develop these lists for eachcollection based on multiple runs and experiments. A sample of these words is shown in Table5.1.

5.2 Clustering

5.2.1 Implementation Details

The implementation of the clustering algorithm is shown in Algorithm 4. The �rst step is torepresent each document as a bag of words, which results in a corpus. The second step is todetermine the number of clusters K with the elbow method using the Calinski-Harabaz Indexas a metric. A higher Calinski-Harabaz score relates to a model with better de�ned clusters. ForK clusters, the Calinski-Harabaz score s is given as the ratio of the between-clusters dispersionmean and the within-cluster dispersion:

s(k) =Tr(Bk)

Tr(Wk)× (N − k)

k − 1

Wk =k∑

q=1

∑x∈Cq

(x− Cq)(x− Cq)T

Bk =∑q

nq(cq − c)(cq − c)T

Here N is the number of points in our data, Cq is the set of points in cluster q, cq is the center ofcluster q, c is the center of E, and nq be the number of points in cluster q.

Page 29: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

21

Solar Eclipse 2017 Hurricane Irmasolar hurricane

eclipse irmatotality national

tse globaleclipse2017 us

eclipses septsubnav september

aug csbnaugust reutersaccount �oridafacebook businesstwitter insider

published guardianusername subscribepassword bloomberg

Table 5.1: A sample of collection speci�c stop words for Solar Eclipse 2017 and Hurricane Irma

We run the k-means and hierarchical clustering algorithm with an input of the corpus, iterations,and K value (the number of clusters). The iteration is the maximum number of iterations beforeterminating the algorithm.

Note that as well as commonly known k-means, we also implement Hierarchical Clustering. InHierarchical Clustering, we do a divisive method where we split each cluster into sub-clusters insubsequent steps. Depending on the threshold we decide, we get a certain number of clusters.We can lower the threshold to get more clusters. Currently for hierarchical clustering, we set thethreshold so that we have only four clusters generated. Further, we have to decide on the numberof clusters at which to stop the divisive process.

Algorithm 4 Run Clustering Algorithm1: procedure RunCluster(documents, iterations)2: corpus← return bag of words for each document3: . Convert all documents into their bag of words representation4: K ← Elbow(documents)5: clusters_kmean← KMeans(corpus, iterations,K)6: clusters_hier ← Hierarchical(corpus, iterations)7: return clusters_kmeans, clusters_hier8: end procedure

Frequent Word Analysis: k-means just returns the clusters, but does not name them. We canname each cluster by looking at a handful of documents in the it, which can be tedious in a Big

Page 30: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

22

Data scenario. Therefore, (i) we determine the most frequent words in all documents within acluster, and (ii) name the cluster based on a few very frequent words in the cluster. Here, we usethe assumption that "the frequent words in the cluster describe the information in the documentswithin the cluster". The algorithm 5 describes the �rst step, i.e., �nding frequent words in eachcluster. Once we obtain the frequent words in a cluster, we use them to manually label the clusterwith a suitable name related to these words.

Algorithm 5 Run Freqeunt Words Analysis AlgorithmInput: documents belonging to a single cluster, and a threshold value on the minimum frequencyrequired for a word to be de�ned as "frequent"Output: frequent words with frequency higher than the threshold

1: procedure FreqentWords(documents, threshold)2: frequent_words = ∅3: . frequent_words is a set of frequent words4: frequency_map = ∅5: . The frequency map has a key as a word and the value as its frequency in the cluster6: for each document in documents do7: for each word in document do8: if word in frequency_map then9: increase word′s frequency by 1 in frequency_map

10: else11: add < word, 1 > to frequency_map12: end if13: end for14: end for15: for each word in frequency_map do16: if frequency_map[word] ≥ threshold then17: add word to frequent_words18: end if19: end for20: return frequent_words21: end procedure

5.2.2 Evaluation

Silhouette

This is one of the methods to validate or understand the clustering obtained. In case of k-means,we have K clusters C1, C2, ..., Ck. Let di be a document in cluster Ci, and let a(di) representthe average dissimilarity of di with all other documents in the same cluster. We also computethe average dissimilarity of di to all documents in any cluster Cj, i 6= j. Let b(di) represent the

Page 31: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

23

lowest of all these average dissimilarities. The silhouette of di is

s(di) =b(di)− a(di)

max(b(di), a(dj))(5.1)

s(di) has the range -1 to +1. If s(di) is close to 1, then the document is said to be correctlyclustered, whereas if s(di) is close to -1, then the document is wrongly clustered. This could beeasily seen from the de�nitions of a(di) and b(di).

Elbow

This is used mainly in conjunction with the k-means clustering to help in determining the optimalnumber of clusters. The method plots the sum of squared errors on the Y-axis and the number ofclusters on the X-axis. The elbow of this plot points to the optimal number of clusters.

5.3 Topic Analysis

5.3.1 Implementation Details

Algorithm 6 Building Vocabulary1: procedure BuildVocabulary(document)2: id2word← dictionary()3: word2id← dictionary()4: counter ← 05: for each document in documents do6: tokens← preprocess(document)7: for each token in tokens do8: if token not in word2id then9: id2word[counter]← token

10: word2id[token]← counter11: counter ← counter + 112: end if13: end for14: end for15: return (id2word,word2id)16: end procedure

We use Python 2.7 as our language of choice for topic modeling. We use the package gensim totrain all our topic models. We use the module LdaMulticore in gensim to use support of multi-

Page 32: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

24

cores to distribute our workload. We follow the Gibbs sampling derivations in [5] to understandand implement Gibbs sampling in LDA and Twitter-LDA.

The input to LdaMulticore is a stream of vectors or vector-space representations of each docu-ment. Since we want to scale our method as the number of documents, we opt to operate fromdisk rather than main memory. This helps us avoid loading all documents in memory and there-fore scales easily.

The BuildVocabulary method helps us create a vector representation of each of the documents.It ensures that all terms get a unique ID.

Algorithm 7 Run LDA using Gensim1: procedure RunLDA(documents, vocabulary, iterations, topics, hyperparameters)2: corpus← doc2id(documents)3: . Convert all documents into their vector representation4: model← LdaMulticore(corpus, vocabulary, topics, iterations, hyperparamters)5: returnmodel6: end procedure

The RunLDAmethod creates the topic model. Once trained, the model has details for the document-topic distribution and topic-word distribution.

For each topic, we extract the top 30 keywords and give them human-readable names. Subse-quently, for each document, we extract the top 3 topics and their corresponding probabilities.These are put in HBase as probable facets for a document. We also extract the most probabletopic and populate that in topic:topic-displaynames. For quantitative evaluation, we calculateperplexity and topic coherence, which are described in the next section.

5.3.2 Evaluation

Quantitative

One of the main evaluation techniques for topic modeling is perplexity. Informally, perplexitymeasures the cross entropy between the empirical distribution and the predicted distribution.Perplexity of a model for a test set of M documents is given by:

Perp(Dtest) = exp{−∑M

d=1 log p(wd)∑Md=1Nd

}whereNd is the number of words in the document and p(wd) is the probability of the word in thedocument.

Page 33: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

25

According to the de�nition, a lower perplexity score indicates a better model.

Qualitative

Evaluation of topic models quantitatively is fraught with several challenges. Often, human eval-uation is necessary to determine the coherence of a topic. The authors in [9] use a word intrusiontask to detect the odd-one-out of a set of words for a topic. If they are consistently able to �nd theout of place word, the topics are coherent. While we had initially planned to crowd-source ourannotations from the class, we did not �nd enough time to complete this.

Page 34: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

Chapter 6

Results

Our experiments are performed on both the tweet and webpage data as shown in Table 6.1. Forexample, 56,376 tweets were collected through a keyword search for “remember April 16”. 721webpages and 2,667,720 webpages were collected using the keyword “solar eclipse 2017”. 2741and 913 webpages were collected for "Hurricane Irma" and "Vegas Shooting". A sample of thetweets is shown in Figure 6.1.

Table 6.1: Datasets description with category (tweet or document) and number of documents

Dataset Name Type No. of documentsRemember April 16 tweet 56376Solar Eclipse 2017 webpage 721Solar Eclipse 2017 tweet 2667720Hurricane Irma webpage 2741Vegas Shooting webpage 913

Figure 6.1: Clean Tweet Data Sample

26

Page 35: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

27

6.1 Remember April 16 Tweets

6.1.1 Clustering

We use the Calinski Harabaz Index as the metric and apply the elbow method to �nd the optimalvalue of K in k-means clustering. As illustrated in Figure 6.2 the Calinski Harabaz Index does notdrop much after K = 4. Therefore, the point K = 4 indicates the optimal K value.

Figure 6.2: Calinski Harabaz index vs. number of clusters for “remember april 16” Dataset

However, after performing the experiments using various datasets we observed that the clusterdistribution is skewed with one cluster representing the majority of data which are not correlated.To reduce this, we can increase the the value of the number of clusters as per the datasets. For theexperiments conducted we generally used a value of K= 5 or 6. If the results interpreted are notcomprehensible we can also conduct an iterative clustering on the large cluster to derive moremeaningful results.

A sample of the k-means clustering result is in Figure 6.3a. The number highlighted by a bluebox is tweet ID while the number highlighted by a brown box is cluster index. For example,tweet 588503698176757762 belongs to cluster 0. Figure 6.3b presents the tweet distribution amongdi�erent clusters. For experiments with webpage data, as will be shown later, we also applycosine similarity analysis which calculates the similarity of intra- and inter-cluster documents to

Page 36: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

28

evaluate clustering results. However, we didn’t perform cosine similarity analysis for tweet databecause of the large data size and the exponential analysis time.

To understand the meaning of each cluster, we �nd the most frequently used words in the tweetsbelonging to each cluster. Using the frequent words we obtain the real world event or news that isfrequently discussed in the set of tweets in the cluster. The frequent words and the correspondingevent / news of each cluster are shown in Table 6.2. For example, the frequent words in Cluster0 are: Remember, Sewol, door, missing, and April. When we use these keywords to search in theInternet, we understand that these tweets are remembering the victims and su�erers of the SewolFerry Disaster that happened on April 16, 2014. While Cluster 0 is fairly coherent, Cluster 3 lacksclarity. It is not always easy to ascertain the real word event from frequently used words in thecluster.

(a) Sample of k-means clustering output(b) Tweet distribution over clusters

Figure 6.3: k-means clustering results on “remember April 16” tweets.

Table 6.2: Frequent words and events in each cluster for “Remember April 16” dataset

Cluster 0Frequent words April, 162014, Remember, #Sewol, door, missingEvent Tweets remembering the Sinking of MV Sewol ferry on April 16, 2014

Cluster 1Frequent words March, Selena, RIPSelena, QuintanillaEvent Remembering the American singer Selena Quintanilla who died on March 31, 1995.

Cluster 2Frequent words Remember, 2007, April 16, Virginia, Students, Faculty, VTEvent Remembering the victims and the su�ering of the Virginia Tech Shooting that took place on April 16 2007

Cluster 3Frequent words Remember, Jongin, pray, hyung-ksooEvent Too few tweets belonging to this cluster to infer exact event!

Cluster 4Frequent words world, voice, opportunity, electionsEvent Tweets celebrating WorldVoiceDay and some tweets belonging to Scottish elections.

Page 37: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

29

We perform Hierarchical Clustering on the same data. We note that for the same number ofclusters, this gives a slightly di�erent distribution of clusters as shown in Figure 6.4. However,since the distribution of the clusters is almost similar in both types of clustering, we excludedhierarchical clustering from future experiments.

Figure 6.4: Tweets distribution over clusters using hierarchical clustering algorithm.

6.1.2 Topic Analysis

LDA

Topic 1 Topic 2 Topic 3votes @louisemensch worldmonday schizophrenia opportunityelections laughing 16scottish trolled communicatecouncil ruin #worldvoicedayTopic 4 Topic 5 Topic 6sewol selena yearsthousands quintanilla-p neverferry erez techstampede love studentsfamily la facultypeople always #neverforgetdied great virginia

Table 6.3: Top words for topics obtained through running LDA on the “Remember April 16”dataset. The results show only the best 6 topics. The remaining 4 topics were incoherent.

Page 38: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

30

Running the LDA algorithm on the same dataset yielded slightly di�erent results than clustering.We trained the model with K = 10, α = 0.1, β = 0.01, iterations = 500. We obtained a listof top-30 words used for each topic and tried labelling them manually. The results are shown inTable 6.3.

Topic 1:Words like votes, monday, elections, scottish, council, registered, etc. are closely associatedwith the theme “Scottish Elections”.

Topic 2:@louisemensch, schizophrenia, laughing, trolled, and ruin talk about Louise Mensch – aBritish journalist and former Conservative Member of Parliament.

Topic 3:Monday, April 16 is observed as the world voice day and the words world, opportunity, 16,communicate, #worldvoiceday, and basic group tweets that are celebrating #WorldVoiceDay.

Topic 4:Words such as sewol, thousands, ferry, stampede, family, people, and died account for thesinking of MV Sewol o� the coast of South Kore on April 16th, 2014.

Topic 5:Selena, quintanilla-p, erez, love, la, always, and great refer to the American singer andsongwriter Selena Quintanilla-PÃľrez who celebrated her birthday on April 16th.

Topic 6:Words such as years, never, tech, students, faculty, lives, lost, #neverforget, va, and heroes

are about the Virginia Tech massacre that occurred on April 16, 2007.

While some of these topics are coherent, we also obtained a few other topics that we could notannotate. We ran the Twitter-LDA algorithm as well. These results are described in the nextsection.

Twitter-LDA

Twitter-LDA assumes that each user tweets about certain topics. It further assumes that eachtweet is only about one topic, as compared to LDA that models each document as a mixture overtopics. While Twitter-LDA gave many of the same topics, it also identi�ed a few di�erent topics.These are described in Table 6.4.

Two of those topics were about Coach Frank Beamer and a certain Maverick Gamer. This includedwords such as changed, lives, coach, beamer, #thanksfrank and maverickgamer, victims, still,vol, 1. However, neither of these topics were about real world events. Seeing these results, wedecided to simplify our scripts and run only LDA instead of both LDA and Twitter-LDA.

Page 39: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

31

Table 6.4: Topics from Twitter-LDA that did not appear in LDA for the “Remember April 16”dataset

Topic 1 Topic 2changed maverickgamerlives victimscoach stillbeamer vol#thanksfrank 1

6.2 Solar Eclipse 2017 Tweets

6.2.1 Clustering

We run the clustering algorithm on “Solar Eclipse 2017” tweet data as well. The optimal numberof clusters K is found to be 6. The clustering distribution is shown in Figure 6.5. The clusters arenamed based on the frequent word analysis. The names of the clusters are presented in Table 6.5.

Figure 6.5: Cluster distribution for “Solar Eclipse 2017” tweets

Page 40: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

32

Table 6.5: Cluster Naming based on frequent words for the “Solar Eclipse 2017” tweet data

Cluster Name0 DiamondRing1 WatchEclipse2 SafeEclipse3 ExoPlanetMusic4 MidFlightEclipse5 NonEnglish

6.2.2 Topic Analysis

eclipse safety photosand pic-tures

experience mid�ight weatherand fore-cast

exo

eclipse view photos truly catch cloud exowatch don’t timelapse breath-

taking�ight path totaleclipse

moon look pictures remarkable breath-taking

shadow thepower-ofexo

block glass photobomb beautiful mid weather planetcover eye space great internation-

alrain message

totality watch lifetime pretty space outside verexocircle safely live happy wow forecast que

Table 6.6: Keywords for topics in the collection “Solar Eclipse”

We obtained the best results with number of topics as 10. However, we were only able to extract 7coherent topics from the model. The remaining 3 topics were repeats of other topics. These topicswere about – describing the eclipse, safety, photos and pictures, experience, experiencingthe eclipse mid�ight, weather and forecast, and a music band called exo.

6.3 Solar Eclipse 2017 Webpages

6.3.1 Clustering

We run the clustering algorithm on webpage data corresponding to the event of “Solar Eclipse2017”. The optimal number of clusters K is found to be 6 using the same method mentioned

Page 41: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

33

previously. The cluster distribution is shown in Figure 6.6. The cluster names based on frequentword analysis are presented in Table 6.8.

Figure 6.6: Cluster distribution for “Solar Eclipse 2017” webpages

To evaluate the result, cosine similarity analysis is performed. Cosine similarity is a metric toindicate the similarity of two documents. It is de�ned as

sim(d1, d2) =

−→V (d1) ·

−→V (d2)

|−→V (d1)||

−→V (d2)|

The intra- and inter-cluster cosine similarity are computed. The intra-cluster cosine similarityis the average cosine similarity of each pair of documents in one cluster. Inter-cluster cosinesimilarity is computed in a similar way but each pair contains documents from two di�erentclusters. The result is shown in Table 6.7. The intra-cluster cosine similarity is in red font whilethe inter-cluster cosine similarity is in blue font. On average, the intra-cluster cosine similarityis about three times that of inter-cluster similarity, indicating distinct clusters.

Page 42: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

34

Table 6.7: The cosine similarity analysis of “Solar Eclipse 2017” webpage data

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5Cluster 0 0.235892 0.057602 0.095294 0.077461 0.067748 0.087039Cluster 1 0.057602 0.579325 0.092020 0.083038 0.081589 0.076596Cluster 2 0.095294 0.092020 0.146127 0.105437 0.095530 0.099726Cluster 3 0.077461 0.0830 0.1054 0.1769 0.0772 0.0870Cluster 4 0.067748 0.0816 0.0955 0.0772 0.1116 0.0849Cluster 5 0.087039 0.0766 0.0997 0.0870 0.0849 0.7148

Table 6.8: Cluster Naming based on frequent words for the “Solar Eclipse 2017” webpage data

Cluster Name0 EclipseChasers1 AjcEclipseNews2 EclipseScience3 BusinessInsiderEclipseArticles4 Eclipseville5 MuseumEclipse

6.3.2 Topic Analysis

Topic 1 Topic 2 Topic 3total map worldtotality atlanta nasascience carolina lunarsun washington annularsky denver earth

Table 6.9: Keywords for topics in the collection “Solar Eclipse”

Using number of topics as 3, we ran our LDA code to obtain the following topics. The top wordsfor each topic are highlighted in Table 6.12.

Topic 1:The words total, totality, science, sun and sky convey that these documents discuss the sci-ence behind solar eclipses.

Topic 2:The words map atlanta, carolina, washington and denver seem to describe the locations inwhich the Solar Eclipse was observed.

Page 43: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

35

Topic 3:The words world, nasa, lunar, annular and earth are a little disjoint from each other. Unfor-tunately, this is one of the disadvantages of topic modeling. Words that do not relate with eachother are sometimes force �t together within an arti�cial topic.

6.4 Hurricane Irma Webpages

6.4.1 Clustering

The clustering distribution and cosine similarity analysis result for "Hurricane Irma" webpagedata are shown in Figure 6.7 and Table 6.10. The optimal number of clusters K is found to be 6.The names of clusters obtained are presented in Table 6.11.

Figure 6.7: Cluster distribution for “Hurricane Irma” webpages

Table 6.10: The cosine similarity analysis of “Hurricane Irma” webpage data

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5Cluster 0 0.511192 0.037326 0.052042 0.030241 0.046346 0.041547Cluster 1 0.037326 0.274652 0.120389 0.060207 0.081885 0.086328Cluster 2 0.052042 0.120389 0.369712 0.088786 0.144062 0.128138Cluster 3 0.030241 0.060207 0.088786 0.081079 0.067102 0.071955Cluster 4 0.046346 0.081885 0.144062 0.067102 0.907489 0.119770Cluster 5 0.041547 0.086328 0.128138 0.071955 0.119770 0.118860

Page 44: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

36

Table 6.11: Cluster naming based on frequent words for the “Hurricane Irma” web data

Cluster Name0 HeavyDotComIrmaUpdates1 GlobalNewsIrmaUpdates2 ExpressCoUkIrmaUpdates3 FloridaIrmaUpdates4 PicturesIrma5 IrmasPathAndDevastation

6.4.2 Topic Analysis

Figure 6.8: Plots showing number of topics vs. log perplexity and number of topics vs. topiccoherence for the collections Solar Eclipse webpages and Hurricane Irma webpages. We attemptto choose the best number of topics based on these two plots.

We were able to obtain 5 topics on the Hurricane Irma dataset. These broadly corresponded todamage and destruction, carribbean islands, weather and winds, �orida, and presidenttrump.

Page 45: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

37

damage carribbeanislands

weather winds �orida trump

damage carribbean weather �orida trumpdestruction coast winds miami presidentsurge islands mph orlando uswater puerto tropical coast home�ood rico forecast state help

Table 6.12: Keywords for topics in the collection “Solar Eclipse”

6.5 Vegas Shooting Webpages

6.5.1 Clustering

The clustering distribution and cosine similarity analysis result for "Vegas Shooting" webpagedata are shown in Figure 6.9 and Table 6.13. The optimal number of clusters K is found to be 6.The names of clusters obtained are presented in Table 6.14.

Figure 6.9: Cluster distribution for “Vegas Shooting” webpages

Page 46: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

38

Table 6.13: The cosine similarity analysis of “Vegas Shooting” webpage data

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5Cluster 0 0.356642 0.115300 0.130751 0.042092 0.031763 0.028054Cluster 1 0.115300 0.174116 0.138287 0.143103 0.109531 0.107970Cluster 2 0.130751 0.138287 0.353778 0.154930 0.119397 0.113233Cluster 3 0.042092 0.143103 0.154930 0.517013 0.036331 0.032462Cluster 4 0.031763 0.109531 0.119397 0.036331 0.412272 0.024849Cluster 5 0.028054 0.107970 0.113233 0.032462 0.024849 0.271590

Table 6.14: Cluster naming based on frequent words for the “Vegas Shooting” webpage data

Cluster Name0 ReviewJournalLasVegas1 MandalayBayShooting2 LasVegasShootingTheGuardian3 DowntownShooting4 RealEstateLasVegas5 LocalNewsAndEntertainmentLasVegas

Page 47: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

Chapter 7

User Manual

7.1 HBase schema

The results of clustering and topic modeling on webpages and tweets will be stored in an HBasetable. The output received from the CTA team will be useful to the SOLR and FE team for facetedsearch. In the HBase table we will use column families to store our data. Topic and cluster arethe two column families in the table that our team is concerned with. As illustrated in Table 7.1,column family topic will consist of columns topic-list, probability-list, and keyword-list.The topic-list will contain a list of topic labels received after topic modeling on our set ofwebpages and tweets. The probability-list will contain a list of topic probabilities with respectto the topics present in the topic list. The keyword-list will contain the top 5 list of words thatoccur in the top 2 topics of the topic lists.

Similarly, the “cluster” column family consists of columns cluster-list, display-clusternamesand cluster-probability. Here, cluster-list and display-clusternames contains the nameof each cluster. Note that display-clusternames is for the FE team to use as a facet in their inter-face. cluster-probability will denote the probability the document is in that cluster. Note thatin this case, we use hard clustering, so the probability will always be 1. An example of clusteringresults for "Solar Eclipse 2017" tweets stored in HBase is shown in Table 7.2.

Table 7.1: HBase Schema: Fields for Topic Analysis

column family topiccolumn name topic-list probability-list display-clusternames

photos pictures, mid�ight, eclipse 0.53,0.26,0.12 photos picturesexo, eclipse, experience 0.78,0.14,0.04 exo

mid�ight, eclipse, experience 0.40,0.27,0.15 mid�ight

39

Page 48: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

40

Table 7.2: HBase Schema: Fields for Clustering

column family clustercolumn name cluster-list cluster-probability display-clusternames

WatchEclipse 1 WatchEclipseDiamondRing 1 DiamondRing

ExoPlanetMusic 1 ExoPlanetMusic

The schema also contains a �eld keywords, which is currently present only in the column-familywebpages. The CTA team proposes the same �eld be present for tweets as well to index tweetsby their topics and clusters.

7.2 Topic Analysis

7.2.1 Help File

The entire process of training a topic model can be done through a single Python script lda.py.To view the help section to run the code, use the following command:

python lda.py -h

The following parameters need to be used while running the code:

• COLLECTION_NAME: The name of the data collection on which the LDA code will run.

• FILE: Location of the �le containing the dataset.

• TOPICS: Number of topics k to be given to the LDA model as input. To run multiple instancesof the model with di�erent number of topics in each instance, use the following format k1,k2, k3.

In addition to these, the user can also pass the following optional arguments while running thecode:

• ALPHA: The value of the hyperparameter α in the LDA model. If this �ag is not speci�ed,the default value of 0.1 will be used.

• BETA: The value of the hyperparameter β in the LDA model. If this �ag is not speci�ed, thedefault value of 0.01 will be used.

Page 49: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

41

• ITER: The number of Gibbs sampling iterations for the LDA model. The default value of800 will be used if this is not speci�ed.

• PREPROCESS: Preprocessing options for the dataset if any �ne-tuning is required.

• TOKENIZER: The tokenizer to use – valid options are CommaTokenizer, SemicolonTokenizer,SpaceTokenizer, and WordTokenizer.

• MAPPERS: The mappers to use for each of the tokens – valid options are WordnetLemmatizer,PorterStemmer, and LowercaseMapper.

• FILTERS: The �lters to use for each of the tokens – valid options are ASCIIFilter, Stopword-Filter, LengthFilter and CollectionFilter.

• FILTER_WORDS: The �le path to speci�c words to �lter out for a collection.

The output of this script includes several �les, mainly – document-topics, topics-keywords, andvisualization.

7.2.2 Computational Complexity

We benchmarked our method on time taken, the results of which are shown in Figure 7.1. Thetime taken to preprocess the text, create a model, and run evaluations takes about one hour forapproximately 2.5 million tweets. While we did not implement LDA using Spark, this is betterthan the Spark implementation from Fall 2016, which was averaging about 3 hours for a fewernumber of tweets. However, this is only a preliminary comparison as the datasets used mightdi�er in composition.

Figure 7.1: Computational complexity of running LDA for di�erent collections. The results werebenchmarked on a single node server with 20 cores.

Page 50: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

42

7.3 Clustering

7.3.1 Running Clustering Algorithm

The user can use the Spark framework for clustering and can tune the algorithm by modifyingparameters such as the number of clusters and iterations.

First, a build.sbt has to be created with the following con�guration to run the sbt package.

import sbt.Artifact

name := "Kmeans"

scalaVersion := "2.10.4"

libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" %

"1.5.0","org.apache.spark" %% "spark-mllib" % "1.5.0")

Use the following command to generate a jar �le to run on Apache Spark

sbt package

To run k-means, run the following command.

spark-submit <created jar file> -k <input file>

For large datasets, if there is any performance related issue, the driver memory and executormemory can be increased by modifying the command as follows.

spark-submit -driver-memory 16g -executor-memory 16g <jar> -k <input file>

7.3.2 Analysis

We plot the cluster distribution using the Python script in the analysis/distribution_analysis/directory with the following command. The <input file> should be in the format of "(document_ID,cluster_ID)"

python clustering_analysis.py <input file>

The inter- and intra-cluster cosine similarity analysis is performed using the Python script inthe "analysis/similarity_analysis/" directory with following command. The <input file>

should be in the format of "clean_document,cluster_ID".

python cos_sim_inter.py <input file>

python cos_sim_intra.py <input file>

Page 51: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

Chapter 8

Developer Manual

8.1 Clustering

This section is useful for users to continue further development on the project. The implemen-tation of clustering requires the following tools:

1. Python 2.7.0

2. Apache Spark 1.5.0

3. Scala >=2.10.4

4. SBT 1.0.1

5. Java 1.6.0_31

6. Pyspark >=1.5.2

7. Sklearn >=0.16.0

8. Scipy >=0.14.1

A reference manual for Scala and Spark can be found at A complete reference manual for Scalaand Spark 1.5.0, including the guidelines for installation, can be found at [3] and [4], respectively.

The value of K was chosen based on the Calinski Harabaz index and after examining the results.Developers can employ a di�erent metric based on the input dataset to arrive at a value for K .This can be directly modi�ed in both the Python and Scala implementations.

43

Page 52: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

44

8.2 Topic Analysis

Topic Analysis for both documents and tweets is written in Python 2.7.14. The requirements forthe code are:

1. Python 2.7.14

2. Gensim 3.1.0

3. nltk 3.2.5

4. numpy 1.13.3

5. tabulate 0.8.1

6. happybase 1.1.0

LDA can also be done using Apache Spark using the mllib library. This is still under developmentsince this implementation uses Variational Inference rather than Collapsed Gibbs Sampling forinference. This leads to di�erent (and often poor) results as compared to Gibbs Sampling.

We use virtualenv to encapsulate our environment for replicability. We strongly suggest thatother developers also use virtualenv for each of their projects to manage versions of libraries.

We also use the library happybase to access the HBase database dynamically. This allows us tofetch and put rows into the database directly from our code.

8.3 HBase interaction

8.3.1 Clustering

To read data from HBase, run the script HBase_interaction/hbase_read_cluster.py with thefollowing command. Please modify table name and corresponding column list in the Python codeaccordingly. An output CSV �le will be generated storing the data.

python hbase_read_cluster.py

To write data to HBase, run the script HBase_interaction/hbase_write_cluster.py with thefollowing command. The <input file> should be a CSV �le in the format "document_ID, cluster_name,

cluster_probability". This will �ll in the HBase �eld "cluster-name" and "cluster-probability"with "document_ID" as row name.

python hbase_write_cluster.py <input file>

Page 53: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

45

8.4 File Inventory

Directory DescriptionClustering/kmeans/src/main/scala/kmeans.scala Code for ClusteringClustering/kmeans/src/main/scala/Preprocess.scala Does Preprocessing for ClusteringClustering/kmeans/src/main/scala/cluster.scala Supporting File for ClusteringClustering/analysis/clustering_analysis.py Clustering Results VisualizationClustering/analysis/cos_sim_inter.py Inter-Cluster Similarity CalculationClustering/analysis/cos_sim_intra.py Intra-Cluster Similarity CalculationClustering/HBase_interaction/hbase_read_cluster.py Code for Reading data from HBaseClustering/HBase_interaction/hbase_write_cluster.py Code for writing Clustering Results to HBaseTopic_Analysis/tokenizers.py Di�erent tokenizers for the dataTopic_Analysis/mappers.py Mapping tokens to a di�erent formTopic_Analysis/�lter.py Filtering out speci�c tokensTopic_Analysis/pipeline.py Preprocessing pipelineTopic_Analysis/hbase.py HBase interactionTopic_Analysis/lda.py Source code for LDATopic_Analysis/utils.py Miscellaneous functionsTopic_Analysis/readers.py Read dataset from �le

Table 8.1: File Inventory

Page 54: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

Chapter 9

Future Work and Enhancements

9.1 Clustering

In clustering, we only used the hard clustering algorithms, but in the future we would like to usesome soft clustering algorithms. This might help to understand the community structure of thedocuments in cases where a document could be part of more than one cluster. We did performhierarchical clustering on some tweets and documents, but would like to compare and contrastthese results with those of k-means. In future, we would also like to improvise the frequent wordanalysis on the resultant clusters to include techniques from Community Detection.

9.2 Topic Analysis

Topic analysis focused on two models – LDA and Twitter LDA. Twitter-LDA assumes that eachuser has a distribution of topics. Since the data collected is event-driven and not user-driven, werealized that Twitter-LDA is not a good model since the data violated some of the assumptionsin the Twitter-LDA model. In the LDA model, however, we noticed some room for improvementin our process:

1. Automatic elimination of collection speci�c words: While we currently maintain listsof collection speci�c words to discard in the preprocessing step, it would indeed be betterto do this automatically by understanding the word distribution in a collection.

2. Crowd sourcing annotations for topic names: In the current approach, we follow eitherautomatic or manual naming of clusters. Automatic naming is fraught with challengesas naming is dependent on the order of words and does not capture adequate semanticmeaning. Manual naming can get equally di�cult when we have several collections and

46

Page 55: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

47

topics. Moreover, neither approach helps us understand how coherent a topic really is.Therefore, we feel that topic naming is an ideal candidate for crowd-sourcing.

3. Joint model for tweets and webpages: Modeling tweets and webpages separately hasits advantages – however, a joint model may be better able to capture the latent themes inthe combined collection. We found that the topics in tweets were generally di�erent fromthose in webpages. This makes correlation between the two collections di�cult and maybe alleviated by using a joint model.

Page 56: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

48

Acknowledgments

This work was a part of the Global Event and Trend Archive Research (GETAR) and IntegratedDigital Event Archiving and Library (IDEAL) projects, supported by the National Science Foun-dation grants IIS-1619028 and IIS-1319578, respectively. We would also like to thank Dr. EdwardA. Fox (Instructor), and Liuqing Li (GTA) from the course CS 5604 at Virginia Tech, for theirsupport towards the completion of this work.

Page 57: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

Bibliography

[1] NLTK Tweet Tokenizer. http://www.nltk.org/api/nltk.tokenize.html#nltk.

tokenize.casual.TweetTokenizer. Accessed: 2017-11-08.

[2] NLTK Word Tokenize. http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktLanguageVars.word_tokenize. Accessed: 2017-11-08.

[3] Scala Tutorial. http://www.scala-lang.org/. Accessed: 2017-11-08.

[4] Spark Tutorial. https://spark.apache.org/docs/1.5.0/. Accessed: 2017-11-08.

[5] Aman Ahuja, Wei Wei, and Kathleen M. Carley. Topic modeling in large scale social networkdata. Technical Report CMU-ISR-15-108, School of Computer Science, Carnegie MellonUniversity, 2015.

[6] Aman Ahuja, Wei Wei, and Kathleen M. Carley. Microblog sentiment topic model. In 2016IEEE 16th International Conference on Data Mining Workshops (ICDMW), pages 1031–1038.IEEE, 2016.

[7] James C Bezdek, Robert Ehrlich, and William Full. FCM: The fuzzy c-means clusteringalgorithm. Computers & Geosciences, 10(2-3):191–203, 1984.

[8] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet Allocation. In Advancesin Neural Information Processing Systems, pages 601–608, 2002.

[9] Jonathan Chang, Jordan Boyd-Graber, Chong Wang, Sean Gerrish, and David M. Blei. Read-ing tea leaves: How humans interpret topic models. In Neural Information Processing Sys-tems, 2009.

[10] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incom-plete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (methodolog-ical), pages 1–38, 1977.

[11] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-based algorithmfor discovering clusters in large spatial databases with noise. In KDD, volume 96, pages226–231, 1996.

49

Page 58: Clustering and Topic Analysis Final Report · 1. Hard clustering: Each data point is assigned to only one of the given clusters. 2. SoftClustering: Instead of putting each data point

50

[12] Adil Fahad, Najlaa Alshatri, Zahir Tari, Abdullah Alamri, Ibrahim Khalil, Albert Y Zomaya,Sebti Foufou, and Abdelaziz Bouras. A survey of clustering algorithms for big data: Taxon-omy and empirical analysis. IEEE Transactions on Emerging Topics in Computing, 2(3):267–279, 2014.

[13] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. CURE: an e�cient clustering algorithmfor large databases. In ACM SIGMOD Record, volume 27, pages 73–84. ACM, 1998.

[14] Alexander Hinneburg and Daniel A. Keim. Optimal grid-clustering: Towards breaking thecurse of dimensionality in high-dimensional clustering. In Proceedings of the 25th Interna-tional Conference on Very Large Data Bases, VLDB ’99, pages 506–517, San Francisco, CA,USA, 1999. Morgan Kaufmann Publishers Inc.

[15] Alexander Hinneburg, Daniel A Keim, et al. An e�cient approach to clustering in largemultimedia databases with noise. In KDD, volume 98, pages 58–65, 1998.

[16] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annualinternational ACM SIGIR conference on Research and development in information retrieval,pages 50–57. ACM, 1999.

[17] Saurav Kaushik. An introduction to clustering. https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-different-methods-of-clustering/,2016 (accessed October 7, 2017).

[18] James MacQueen et al. Some methods for classi�cation and analysis of multivariate ob-servations. In Proceedings of the �fth Berkeley symposium on mathematical statistics andprobability, volume 1, pages 281–297. Oakland, CA, USA., 1967.

[19] Rishabh Mehrotra, Scott Sanner, Wray Buntine, and Lexing Xie. Improving LDA topic mod-els for microblogs via tweet pooling and automatic labeling. In Proceedings of the 36th inter-national ACM SIGIR conference on Research and development in information retrieval, pages889–892. ACM, 2013.

[20] Jared Suttles. tweetokenize. https://github.com/jaredks/tweetokenize. Accessed: 2017-11-08.

[21] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH: an e�cient data clusteringmethod for very large databases. In ACM SIGMOD Record, volume 25, pages 103–114. ACM,1996.

[22] Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaom-ing Li. Comparing Twitter and traditional media using topic models. In European Conferenceon Information Retrieval, pages 338–349. Springer, 2011.