465 research outputs found
Pair HMM based gap statistics for re-evaluation of indels in alignments with affine gap penalties: Extended Version
Although computationally aligning sequence is a crucial step in the vast
majority of comparative genomics studies our understanding of alignment biases
still needs to be improved. To infer true structural or homologous regions
computational alignments need further evaluation. It has been shown that the
accuracy of aligned positions can drop substantially in particular around gaps.
Here we focus on re-evaluation of score-based alignments with affine gap
penalty costs. We exploit their relationships with pair hidden Markov models
and develop efficient algorithms by which to identify gaps which are
significant in terms of length and multiplicity. We evaluate our statistics
with respect to the well-established structural alignments from SABmark and
find that indel reliability substantially increases with their significance in
particular in worst-case twilight zone alignments. This points out that our
statistics can reliably complement other methods which mostly focus on the
reliability of match positions.Comment: 17 pages, 7 figure
Fast prediction of RNA-RNA interaction
Abstract Background Regulatory antisense RNAs are a class of ncRNAs that regulate gene expression by prohibiting the translation of an mRNA by establishing stable interactions with a target sequence. There is great demand for efficient computational methods to predict the specific interaction between an ncRNA and its target mRNA(s). There are a number of algorithms in the literature which can predict a variety of such interactions - unfortunately at a very high computational cost. Although some existing target prediction approaches are much faster, they are specialized for interactions with a single binding site. Methods In this paper we present a novel algorithm to accurately predict the minimum free energy structure of RNA-RNA interaction under the most general type of interactions studied in the literature. Moreover, we introduce a fast heuristic method to predict the specific (multiple) binding sites of two interacting RNAs. Results We verify the performance of our algorithms for joint structure and binding site prediction on a set of known interacting RNA pairs. Experimental results show our algorithms are highly accurate and outperform all competitive approaches.</p
mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications
Cataloged from PDF version of article.High throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for processing and downstream analysis. While tools that report the 'best' mapping location of each read provide a fast way to process HTS data, they are not suitable for many types of downstream analysis such as structural variation detection, where it is important to report multiple mapping loci for each read. For this purpose we introduce mrsFAST-Ultra, a fast, cache oblivious, SNP-aware aligner that can handle the multi-mapping of HTS reads very efficiently. mrsFAST-Ultra improves mrsFAST, our first cache oblivious read aligner capable of handling multi-mapping reads, through new and compact index structures that reduce not only the overall memory usage but also the number of CPU operations per alignment. In fact the size of the index generated by mrsFAST-Ultra is 10 times smaller than that of mrsFAST. As importantly, mrsFAST-Ultra introduces new features such as being able to (i) obtain the best mapping loci for each read, and (ii) return all reads that have at most n mapping loci (within an error threshold), together with these loci, for any user specified n. Furthermore, mrsFAST-Ultra is SNP-aware, i.e. it can map reads to reference genome while discounting the mismatches that occur at common SNP locations provided by db-SNP; this significantly increases the number of reads that can be mapped to the reference genome. Notice that all of the above features are implemented within the index structure and are not simple post-processing steps and thus are performed highly efficiently. Finally, mrsFAST-Ultra utilizes multiple available cores and processors and can be tuned for various memory settings. Our results show that mrsFAST-Ultra is roughly five times faster than its predecessor mrsFAST. In comparison to newly enhanced popular tools such as Bowtie2, it is more sensitive (it can report 10 times or more mappings per read) and much faster (six times or more) in the multi-mapping mode. Furthermore, mrsFAST-Ultra has an index size of 2GB for the entire human reference genome, which is roughly half of that of Bowtie2. mrsFAST-Ultra is open source and it can be accessed at http://mrsfast.sourceforge.net
The lncRNA landscape of breast cancer reveals a role for DSCAM-AS1 in breast cancer progression.
Molecular classification of cancers into subtypes has resulted in an advance in our understanding of tumour biology and treatment response across multiple tumour types. However, to date, cancer profiling has largely focused on protein-coding genes, which comprise <1% of the genome. Here we leverage a compendium of 58,648 long noncoding RNAs (lncRNAs) to subtype 947 breast cancer samples. We show that lncRNA-based profiling categorizes breast tumours by their known molecular subtypes in breast cancer. We identify a cohort of breast cancer-associated and oestrogen-regulated lncRNAs, and investigate the role of the top prioritized oestrogen receptor (ER)-regulated lncRNA, DSCAM-AS1. We demonstrate that DSCAM-AS1 mediates tumour progression and tamoxifen resistance and identify hnRNPL as an interacting protein involved in the mechanism of DSCAM-AS1 action. By highlighting the role of DSCAM-AS1 in breast cancer biology and treatment resistance, this study provides insight into the potential clinical implications of lncRNAs in breast cancer
Faster Approximate String Matching for Short Patterns
We study the classical approximate string matching problem, that is, given
strings and and an error threshold , find all ending positions of
substrings of whose edit distance to is at most . Let and
have lengths and , respectively. On a standard unit-cost word RAM with
word size we present an algorithm using time When is
short, namely, or this
improves the previously best known time bounds for the problem. The result is
achieved using a novel implementation of the Landau-Vishkin algorithm based on
tabulation and word-level parallelism.Comment: To appear in Theory of Computing System
- …
