TOPIC MODELS FOR TAGGED TEXT
Our world has been experiencing a dramatic and continually increasing growth of digital textual information. This phenomenon raises challenges in analyzing, understanding, organizing, and summarizing these large bodies of textual information. A large portion of the textual information contains meta-data, such as user-annotated tags, which provides useful information and could help improve the current text mining results. Thus, this thesis focuses on handling tagged text using topic modeling techniques.We start from the Latent Dirichlet Allocation (LDA) model and introduce a Trivial Tag-Latent Dirichlet Allocation (TriTag-LDA) model, which directly connects the tags to the topics via an improved two-layer LDA model. Specifically, the bottom layer is the standard LDA, while the upper layer is a constrained LDA with the topics coming from the bottom layer. After that, we propose a new topic model, Tag-Latent Dirichlet Allocation (Tag-LDA), which more naturally integrates tags into the generative process. In Tag-LDA, a document is viewed as a mixture of tags rather than topics, and topics are generated from multinomial distributions under tags. TriTag-LDA and Tag-LDA bridge the user-generated tags and the latent topics. In both these models, a tag is described in the form of a mixture of shared topics. This representation enables the analysis of the relationships between tags. We provide quantitative and qualitative comparisons between our models and related work, and show that Tag-LDA is superior under the perplexity criterion. We also apply Tag-LDA to explain hashtags on Twitter and discover their relationships.We then develop two extensions of Tag-LDA: Tag-Dirichlet Processes (Tag-LDP) and Tag-Dirichlet Allocation with concepts (ConceptTag-LDA). Tag-LDP utilizes the Dirichlet process in modeling so that the number of topics can be decided automatically based on the data. Our experiments demonstrate that Tag-LDP can infer the number of topics from the data and that the quality of topics is as good as Tag-LDA. ConceptTag-LDA provides a mechanism where users' prior knowledge can be incorporated in learning the topics. Users' knowledge represented as pre-defined concepts is modeled through the Dirichlet Tree prior which replaces the original Dirichlet prior in Tag-LDA. Our experiments study the influence of the concepts on the topics, and demonstrate that the input concepts can influence the topics toward users' prior knowledge. Finally we present the dynamic Twitter topic model (DTTM), a specialized temporal topic model tailored for the short messages in social media. On social media such as Twitter, people's discussions are constantly evolving with many discussions centering around events. A major event usually involves twists and turns reflected by multiple sub-events as it develops over time. This temporal event development is in turn reflected by people's discussions on Twitter. In DTTM, we assume an event can be modeled by a mainstream topic plus several facets and that each tweet is a mixture of two topics: the mainstream topic and one facet topic. To capture the temporal dynamics of the discussions, DTTM models the temporal evolution of the mainstream topic and the facet topics. To demonstrate the effectiveness of DTTM in modeling the temporal dynamics of topics, we did two case studies with our model using Twitter data and show that our model performs better in summarizing the discussions than existing topic models.