Topological Data Analysis in Text Processing
Topological Data Analysis denotes the set of algorithms and methods to define and retrieve the underlying structure of the shapes in the data. Utilizing topological inference in data mining and generally data science is recent, while computational geometry and computational topology have been examined in the area of applied mathematics for many years. Some recent studies have shown the strength of topological data analysis when dealing with high-dimensional data sets. Dealing with the noisy data, the most common goal in TDA is to refine the underlying shapes as the most important property of the data. Then what remains may be considered irrelevant information or simply the noise. Topological inference has been applied to many sub-areas of pattern recognition and data mining, but it is not widely used in natural language processing and text mining. A simple reason is that defining shapes in the text is not easy. In this dissertation document, we introduce three different algorithms of extracting topological features from textual documents, using as the underlying representations of text the two most popular methods, namely term frequency vectors and word embeddings, and also without using any conventional features: (1) To extract topological features without using conventional features, we analyze the graph of appearance/co-appearance of different entities through long documents. We show how these topological structures in a text may effectively act as the signature or identifier of the topic, writer, writing, etc. (2) Then we introduce a new algorithm of extracting topological features from text, namely by converting a sequence of word embeddings into a time series, and analyzing the dimensions of the resulting series for topological persistence. (3) We also provide a topological method to analyze the geometry of the term frequency space. In all three algorithms, we apply homological persistence to reveal the geometric structures under different distance resolutions. We focus on utilizing our defined features for text classification, though they may be useful for other natural language processing tasks as well. Our results show that even if the representation of documents is derived from the standard term frequency matrix or word embeddings space, similarly produced topological features improve the accuracy of classification, meaning that our topological features carry some exclusive information that is not captured by conventional text analysis methods.