The effect of structure in short regions of DNA on measurements on short-oligonucleotide microarray and Ion Torrent PGM sequencing platforms
Single-stranded DNA in solution has been studied by biophysicists for many years, as complex structures, both stable and dynamic, form under normal experimental conditions. Stable intra-strand formations affect enzymatic technical processes such as PCR and biological processes such as gene regulation. In the research described here we examined the effect of such structures on two high-throughput genomic assay platforms and whether we could predict the influence of those effects to improve the interpretation of genomic sequencing results.Helical structures in DNA can be composed of interactions across strands or within a strand. Exclusion of the aqueous solvent provides an entropic advantage to more compact structures. Our first experiments were tested whether internal helical regions in one of the two binding partners in a microarray experiment would influence the stability of the complex. Our results are novel and show, from molecular simulations and hybridization experiments, that stable secondary structures on the boundary, when not impinging on the ability of targets to access the probes, stabilize the probe-target hybridization.High-throughput sequencing (HTS) platforms use as templates short single-stranded DNA fragments. We tested the influence of template secondary structure on the fidelity of reads generated using the Ion Torrent PGM platform. It can clearly be seen for targets where hairpin structures are quite long (~20bp) that a high level of mis-calling occurs, particularly of deletions, and that some of these deletions are 20-30 bases long. These deletions are not associated with homopolymers, which are known to cause base mis-calls on the PGM, and the effect of structure on the sequencing reaction, rather than the PCR preparative steps, has not been previously published.As HTS technologies bring the cost of sequencing whole genomes down, a number of unexpected observations have arisen. An example that caught our attention is the prevalence of far more short deletions than had been detected using Sanger methods. The prevalence is particularly high in the Korean genome. Since we showed that helical structures could disrupt the fidelity of base calls on the Ion Torrent we looked at the context of the apparent deletions to determine whether any sequence or structure pattern discriminated them. Starting with the genome provided by Kim et al (1) we selected deletions > 2 bases long from chromosome I of a Korean genome. We created 70 nucleotide fragments centered on the deletion. We simulated the secondary structures using OMP software and then modeled using the Random Forest algorithm in the WEKA modeling package to characterize the relations between the deletions and secondary structures in or around them. After training the model on chromosome I deletions we tested it using chromosome 20 deletions. We show that sequence information alone is not able to predict whether a deletion will occur, while the addition of structural information improves the prediction rates. Classification rates are not yet high: additional data and a more precise structural description are likely needed to train a robust model. We are unable to state which of the structures affect in vitro platforms and which occur in vivo. A comparative genomics approach using 38 genomes recently made available for the CAMDA 2013 competition should provide the necessary information to train separate models if the important features are different in the two cases.