IDENTIFYING SERENDIPITOUS DRUG USAGES FROM PATIENT-REPORTED MEDICATION OUTCOMES ON SOCIAL MEDIA
Drug repositioning has prominent advantages of lower safety risk and development cost than developing new drugs. It has attracted broad interests from the biomedical community. In the past decades, computational approaches have examined biological, chemical, literature, and electronic health record data for systematic drug repositioning. But due to the limitations of these data sources, neither of them alone appear sufficient for drug repositioning research. In recent years, more and more patients go to social media to report and discuss their medication outcomes. Of these reports, we noticed mentions of serendipitous drug usages, which we hypothesize to be new, independent data to study drug repositioning, in the sense of complementing other existing data sources to identifying and validating drug repositioning hypotheses. In our first work, we examined medication outcome information available on four social media sites, namely WebMD, PatientsLikeMe, YouTube, and Twitter. We found the patient health forum of WebMD the best social media site for our research in terms of data availability and quality, but colloquial patient language is challenging for computers to process. In the second phase of dissertation, we explored state-of-the-art natural language processing (NLP) and machine learning methods to identify mentions of serendipitous drug usages in social media text. We curated a gold-standard dataset based on filtered drug reviews from WebMD. Among 15,714 sentences in total, our annotators manually identified 447 sentences mentioning novel desirable drug usages that were not listed as known drug indications by WebMD and thus were considered serendipitous drug usages. We constructed features using NLP methods and medical knowledge. Then we built SVM, random forest, AdaBoost.M1, and deep learning models and evaluated their prediction power on serendipitous drug usages. Our best model (AdaBoost.M1) achieved an AUC score of 0.937 on the independent test dataset, with the precision equal to 0.811 and the recall equal to 0.476. Our models predicted several serendipitous drug usages, including metformin and bupropion for obesity, tramadol for depression and ondansetron for irritable bowel syndrome with diarrhea, which were also supported by evidences from scientific literature. These results demonstrated that patient-reported medication outcomes on social media are complementary to other data sources for drug repositioning. NLP and machine learning methods make this new data source feasible to use. In the end, we implemented NLP and machine learning methods explored in this dissertation to an open source software application for users without intensive NLP and machine learning skills to extract serendipitous drug usages mentioned in social media text.