Files
Abstract
A dv ances in affordable transcriptome sequencing combined with better exon and gene prediction has motivated many to compare transcription across the tree of lif e. W e de v elop a mathematical frame w ork to calculate complexity and compare transcript models. Str uct ural feat ures, i.e. intron retention (IR), donor / acceptor site v ariation, alternativ e e x on cassettes, alternativ e 5 / 3 UTRs, are compared and the distance between transcript models is calculated with nucleotide le v el precision. All metrics are implemented in a PyPi package, TranD and output can be used to summarize splicing patterns for a transcriptome (1GTF) and between transcriptomes (2GTF). TranD output enables quantitative comparisons bet ween: annot ations augmented by empirical RNA-seq data and the original transcript models; transcript model prediction tools for longread RNA-seq (e.g. FLAIR versus Isoseq3); alternate annotations for a species (e.g. RefSeq vs Ensembl); and between closely related species. In C. elegans, Z. ma y s, D. melanogaster, D. simulans and H. sapiens , alternative exons were observed more frequently in combination with an alternative donor / acceptor than alone. Transcript models in RefSeq and Ensembl are linked and both have unique transcript models with empirical support. D. melanogaster and D. simulans, share many transcript models and long-read RNAseq data suggests that both species are under-annotated. We recommend combined references.