Gholizadeh, Shafie
Topological Data Analysis in Text Processing
1 online resource (97 pages) : PDF
2020
University of North Carolina at Charlotte
Topological Data Analysis denotes the set of algorithms and methods to define and retrieve the underlying structure of the shapes in the data. Utilizing topological inference in data mining and generally data science is recent, while computational geometry and computational topology have been examined in the area of applied mathematics for many years. Some recent studies have shown the strength of topological data analysis when dealing with high-dimensional data sets. Dealing with the noisy data, the most common goal in TDA is to refine the underlying shapes as the most important property of the data. Then what remains may be considered irrelevant information or simply the noise. Topological inference has been applied to many sub-areas of pattern recognition and data mining, but it is not widely used in natural language processing and text mining. A simple reason is that defining shapes in the text is not easy. In this dissertation document, we introduce three different algorithms of extracting topological features from textual documents, using as the underlying representations of text the two most popular methods, namely term frequency vectors and word embeddings, and also without using any conventional features: (1) To extract topological features without using conventional features, we analyze the graph of appearance/co-appearance of different entities through long documents. We show how these topological structures in a text may effectively act as the signature or identifier of the topic, writer, writing, etc. (2) Then we introduce a new algorithm of extracting topological features from text, namely by converting a sequence of word embeddings into a time series, and analyzing the dimensions of the resulting series for topological persistence. (3) We also provide a topological method to analyze the geometry of the term frequency space. In all three algorithms, we apply homological persistence to reveal the geometric structures under different distance resolutions. We focus on utilizing our defined features for text classification, though they may be useful for other natural language processing tasks as well. Our results show that even if the representation of documents is derived from the standard term frequency matrix or word embeddings space, similarly produced topological features improve the accuracy of classification, meaning that our topological features carry some exclusive information that is not captured by conventional text analysis methods.
doctoral dissertations
Computer science
Ph.D.
Persistent HomologyText MiningTopology
Computer Science
Zadrozny, Wlodek
Yang, JingSaule, ErikShaikh, Samira
Thesis (Ph.D.)--University of North Carolina at Charlotte, 2020.
This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s). For additional information, see http://rightsstatements.org/page/InC/1.0/.
Copyright is held by the author unless otherwise indicated.
Gholizadeh_uncc_0694D_12489
http://hdl.handle.net/20.500.13093/etd:2058