Files
Abstract
Annotating all cis-regulatory modules (CRMs) and constituent transcription factor (TF) binding sites (TFBSs) in genomes is essential to understand genome functions, however, the task remains highly challenging. In this dissertation, we first developed a new algorithm dePCRM2 for predicting CRMs and TFBSs by integrating numerous TF ChIP-seq datasets based on an ultra-fast motif-finding algorithm. dePCRM2 partitions genome regions covered by extended binding peaks in the datasets into a CRM candidates (CRMCs) set and a non-CRMCs set, and evaluates each CRMC using a novel score that captures essential features of CRMs. Applying dePCRM2 to 6,092 datasets covering 77.47% of the human genome, we predicted 201 unique TF binding motif families and 1,404,973 CRMCs. And dePCRM2 largely outperforms existing methods. Based our predictions, we estimate that about 55% and 22% of the genome code for CRMs and TFBSs, respectively. Thus, the regulatory genome is more prevalent than originally thought. Moreover, based on the highly similar evolutionary behaviors of TFBSs and inter-TFBSs spacer sequences, we provide genome-wide evidence for the continuum model of TF binding in CRMs. Additionally, as epigenomic marks determine the functional states of CRMs, thereby playing crucial roles in cell fate determination and type maintenance during cell differentiation, epigenomic marks can help to predict the functional states of CRMs. Although genomic sequences play a crucial role in establishing the unique epigenome in each cell type during cell differentiation, little is known about the sequence determinants that lead to the unique epigenomes of the cells. We developed two types of highly accurate deep convolutional neural networks (CNNs) for cell types and for histone marks. The results showed that they are powerful ways to uncover the sequence determinants of the various histone modification patterns in different cell types. We found that sequence motifs learned by the CNN models are highly like known binding motifs of TFs known to play important roles in cell differentiation. Using these models, we can predict the importance of the learned motifs and their interactions in determining specific histone mark patterns in the cell types. Thus, the CNNs provide a way to pinpoint the influences of the motifs in epigenome marks. Finally, although several databases have been developed for predicted or experimentally determined enhancers/CRMs, they only cover a small portion of CRMs encoded in the genomes, lack constituent TFBSs, have high false positives, and are often dedicated to a single organism. To aid the use of the predicted CRMs and TFBSs by the research community, we developed a database dePCRMS (de novo predicted CRMs) (https://pcrms.uncc.edu). Currently, dePCRMS contains 1,155,151, 777,409 and 19,515 CRMs, and 89,948,206, 103,718,473, and 3,758,557 TFBSs, in Homo sapiens, Mus musculus and Caenorhabditis elegans, respectively. The users can use the web interface quickly browse and visualize the CRMs and their constituent TFBSs at different significant level in selected chromosomes in an organism. Moreover, the web interface provides three functional analysis modules for the user 1) to search the closest CRM to a gene, 2) to search CRMs in a given genome range around a gene, and 3) to search TFBSs in CRMs for a given TF. The dePCRMS database can be an informative tool for the users to characterize functions of regulatory genomes in important organisms.