Prediction of Cis-regulatory Modules in Genomes

Ni, Pengyu

Prediction of Cis-regulatory Modules in Genomes

Ni, Pengyu

2020

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket

Files

Abstract

Annotating all cis-regulatory modules (CRMs) and constituent transcription factor (TF) binding sites (TFBSs) in genomes is essential to understand genome functions, however, the task remains highly challenging. In this dissertation, we first developed a new algorithm dePCRM2 for predicting CRMs and TFBSs by integrating numerous TF ChIP-seq datasets based on an ultra-fast motif-finding algorithm. dePCRM2 partitions genome regions covered by extended binding peaks in the datasets into a CRM candidates (CRMCs) set and a non-CRMCs set, and evaluates each CRMC using a novel score that captures essential features of CRMs. Applying dePCRM2 to 6,092 datasets covering 77.47% of the human genome, we predicted 201 unique TF binding motif families and 1,404,973 CRMCs. And dePCRM2 largely outperforms existing methods. Based our predictions, we estimate that about 55% and 22% of the genome code for CRMs and TFBSs, respectively. Thus, the regulatory genome is more prevalent than originally thought. Moreover, based on the highly similar evolutionary behaviors of TFBSs and inter-TFBSs spacer sequences, we provide genome-wide evidence for the continuum model of TF binding in CRMs. Additionally, as epigenomic marks determine the functional states of CRMs, thereby playing crucial roles in cell fate determination and type maintenance during cell differentiation, epigenomic marks can help to predict the functional states of CRMs. Although genomic sequences play a crucial role in establishing the unique epigenome in each cell type during cell differentiation, little is known about the sequence determinants that lead to the unique epigenomes of the cells. We developed two types of highly accurate deep convolutional neural networks (CNNs) for cell types and for histone marks. The results showed that they are powerful ways to uncover the sequence determinants of the various histone modification patterns in different cell types. We found that sequence motifs learned by the CNN models are highly like known binding motifs of TFs known to play important roles in cell differentiation. Using these models, we can predict the importance of the learned motifs and their interactions in determining specific histone mark patterns in the cell types. Thus, the CNNs provide a way to pinpoint the influences of the motifs in epigenome marks. Finally, although several databases have been developed for predicted or experimentally determined enhancers/CRMs, they only cover a small portion of CRMs encoded in the genomes, lack constituent TFBSs, have high false positives, and are often dedicated to a single organism. To aid the use of the predicted CRMs and TFBSs by the research community, we developed a database dePCRMS (de novo predicted CRMs) (https://pcrms.uncc.edu). Currently, dePCRMS contains 1,155,151, 777,409 and 19,515 CRMs, and 89,948,206, 103,718,473, and 3,758,557 TFBSs, in Homo sapiens, Mus musculus and Caenorhabditis elegans, respectively. The users can use the web interface quickly browse and visualize the CRMs and their constituent TFBSs at different significant level in selected chromosomes in an organism. Moreover, the web interface provides three functional analysis modules for the user 1) to search the closest CRM to a gene, 2) to search CRMs in a given genome range around a gene, and 3) to search TFBSs in CRMs for a given TF. The dePCRMS database can be an informative tool for the users to characterize functions of regulatory genomes in important organisms.

Details

Title

Prediction of Cis-regulatory Modules in Genomes

Author

Ni, Pengyu (Bioinformatics)

Contributor

ProQuest (Firm) Contributor
University of North Carolina at Charlotte Degree Granting Institution
Su, Zhengchang Thesis Advisor
Su, Zhengchang Committee Member
Guo, Juntao Committee Member
Shi, Xinghua Committee Member
Song, Baohua Committee Member

Date

2020

Publisher

University of North Carolina at Charlotte

Subjects

Bioinformatics

Keywords

Cis-Regulatory Modules; Databases; Deep Learning; Enhancers; Epigenome

Link to This Page

Handle: http://hdl.handle.net/20.500.13093/etd:1834

Publication Type

doctoral dissertations

Pagination

1 online resource (159 pages) : PDF

File Format

application/pdf

Degree Type

Ph.D.

Usage Statement

This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s). For additional information, see http://rightsstatements.org/page/InC/1.0/., (http://rightsstatements.org/page/InC/1.0/)
Copyright is held by the author unless otherwise indicated.

Record Appears in

Departments and Institutes > Bioinformatics
Types > Doctoral Dissertations
Graduate Theses and Dissertations
Graduate Thesis and Dissertations

PDF

Statistics

Download Full History

Prediction of Cis-regulatory Modules in Genomes

Files

Abstract

Details

Related Items

PDF

Statistics