Creating New Concept-based Representations for Superior Text Analysis and Retrieval
Text analytics represent a set of scalable techniques that mine unstructured and semi-structured textual resources in order to extract useful knowledge for performing a task at hand. For example, document clustering and classification, entity extraction, text summarization, semantic search, and others. How to adequately represent the input text in a machine-interpretable representation that captures its syntactic and semantic structures is still an open research problem.In this thesis, we identify and address the limitations of existing text representation models. The challenges relate to three major categories: efficiency, effectiveness, and usability of the text representation. We propose new concept-based representations leveraging distributed representations and existing knowledge bases in order to address those challenges. Existing models such as the Bag-of-Words (BoW) and the bag of n-grams suffer from many drawbacks such as: 1) sparsity and the curse of dimensionality impacting their space and computational efficiency, and 2) vocabulary mismatch and lack of word order impacting their effectiveness. Distributional semantics models which represent words as numerical vectors are more efficient but uninterpretable. Explicit concept space models which represent text as Bag-of-Concepts (BoC) are easy to understand and interact with but sparse and suffer from concept mismatch.Our objective in this thesis is to improve the analysis and retrieval of textual data especially technical texts (e.g., patents, scientific literature...etc) using the proposed concept-based representations. We show through empirical evaluation that: 1) significant performance improvements can be achieved using our representations with both long technical text (patents) and short text (search queries), 2) our concept-based representations greatly facilitate interactive and visual analysis of technical text, and 3) the proposed conceptual representations are generic and applicable to many academic benchmark datasets where we achieve superior state-of-the-art performance.First, we present a simple and efficient knowledge-based technique for reducing the dimensionality of the bag of n-grams model. Using our unsupervised technique on a benchmark dataset for patent classification, we achieve 13-fold reduction in the number of bigram features and 1.7% increase in classification accuracy over the BoW baseline.Second, we address the challenge of short text representation, especially search queries which lack context, order, and syntax (e.g., "software engineer google","google software engineer"). We propose a novel and effective representation to create an ensemble of contextual, knowledge-based, and lexical features for the given short text. We report the performance of this ensemble representation on entity type recognition of search queries in the recruitment domain. The results show superior performance of our approach over traditional BoW and word embedding models where we achieve 97% micro-averaged F1 score.Third, we present Mined Semantic Analysis (MSA), a novel concept-based representation model which utilizes unsupervised data mining techniques in order to discover concept-concept associations. These associations are used subsequently to enrich the BoC representation of the given text. Quantitative evaluation of MSA on benchmark datasets for measuring text semantic similarity shows its superior performance. Additionally, we demonstrate the usability of MSA representations by implementing a Web-based semantic-driven visual and interactive framework for innovation and patent analytics.Fourth, we propose a neural-based model to learn distributed representations (embeddings) of concepts and entities from their mentions in encyclopedic knowledge bases (e.g., Wikipedia). There are many advantages of this model over sparse representations (i.e., BoW and BoC). First, it is space and computationally efficient. Second, it is more effective as it helps to overcome the concept mismatch problem; here concepts are matched by comparing their embeddings rather than traditional string matching. Third, it is expressive and interpretable. To enhance the learned concept embeddings, we further extend this model by combining the textual knowledge of Wikipedia with the knowledge from Microsoft knowledge graph (Probase). We empirically evaluate the efficacy of the learned representations on benchmark datasets for measuring entity semantic relatedness, analogical reasoning, concept categorization, argument type identification for semantic parsing, and dataless classification where we achieve state-of-the-art performance.Finally, we address the problem of usability of the text representation. We propose a novel interactive framework for patent retrieval; a domain specific text retrieval task. The proposed framework leverages distributed representations of concepts and entities extracted from the patents text. We also propose a simple and practical interactive relevance feedback mechanism where the user is asked to annotate relevant/irrelevant results from the top n hits. We then use this feedback for query reformulation and term weighting where weights are assigned based on how good each term is at discriminating the relevant vs. irrelevant candidates. First, we demonstrate the efficacy of the distributed representations on the CLEF-IP 2010 dataset where we achieve significant improvement of 4.6% in recall over the keyword search baseline. Second, we simulate interactivity to demonstrate the efficacy of the proposed interactive term weighting scheme. Simulation results show that we can achieve extra 1.9% to 11.6% improvement in mean average precision from one interaction iteration outperforming previous semantic and interactive patent retrieval methods.