Files
Abstract
In the era of big data, identifying the right dataset for analysis has been a severe challenge in data science. Especially in health data science, datasets are frequently complex and have restricted access, thus requiring sufficient time, energy, and background knowledge for users to understand, select, and begin analysis. The complexity largely toughens the development of health data science, and we believe it is important to make significant efforts to improve dataset identification processes. Recognizing the challenge, we believe that to provide complete knowledge of healthcare datasets would offer a solution that facilitates dataset identification to a great extent. As with a catalog of books in a library where people can find the desired book easily, with a complete knowledge of datasets, users are expected to quickly identify the most relevant and high-quality datasets for their research purposes. Toward this goal, we start with providing both content and quality level knowledge that is sufficiently comprehensive to cover the needs of a certain group of users---health data science novices. Specifically, we systematically examined the needs of the target users, extracted knowledge that was tailored to these needs, established quantifiable measurements for data quality (a Publication-based Popularity Index (PPI) and an Association-based intrinsic Quality Index (AQI)), and developed a healthcare Dataset Information Resource (DIR) framework to efficiently represent knowledge for datasets. The results from user studies indicate that the solution is promising.