89 research outputs found
A study on the correlation of nucleotide skews and the positioning of the origin of replication: different modes of replication in bacterial species
Deviations from Chargaff's 2nd parity rule, according to which A∼T and G∼C in single stranded DNA, have been associated with replication as well as with transcription in prokaryotes. Based on observations regarding mainly the transcription-replication co-linearity in a large number of prokaryotic species, we formulate the hypothesis that the replication procedure may follow different modes between genomes throughout which the skews clearly follow different patterns. We draw the conclusion that multiple functional sites of origin of replication may exist in the genomes of most archaea and in some exceptional cases of eubacteria, while in the majority of eubacteria, replication occurs through a single fixed origin
Information decomposition of symbolic sequences
We developed a non-parametric method of Information Decomposition (ID) of a
content of any symbolical sequence. The method is based on the calculation of
Shannon mutual information between analyzed and artificial symbolical
sequences, and allows the revealing of latent periodicity in any symbolical
sequence. We show the stability of the ID method in the case of a large number
of random letter changes in an analyzed symbolic sequence. We demonstrate the
possibilities of the method, analyzing both poems, and DNA and protein
sequences. In DNA and protein sequences we show the existence of many DNA and
amino acid sequences with different types and lengths of latent periodicity.
The possible origin of latent periodicity for different symbolical sequences is
discussed.Comment: 18 pages, 8 figure
Optimal Computation of Overabundant Words
The observed frequency of the longest proper prefix, the longest proper suffix, and the longest infix of a word w in a given sequence x can be used for classifying w as avoided or overabundant. The definitions used for the expectation and deviation of w in this statistical model were described and biologically justified by Brendel et al. (J Biomol Struct Dyn 1986). We have very recently introduced a time-optimal algorithm for computing all avoided words of a given sequence over an integer alphabet (Algorithms Mol Biol 2017). In this article, we extend this study by presenting an O(n)-time and O(n)-space algorithm for computing all overabundant words in a sequence x of length n over an integer alphabet. Our main result is based on a new non-trivial combinatorial property of the suffix tree T of x: the number of distinct factors of x whose longest infix is the label of an explicit node of T is no more than 3n-4. We further show that the presented algorithm is time-optimal by proving that O(n) is a tight upper bound for the number of overabundant words. Finally, we present experimental results, using both synthetic and real data, which justify the effectiveness and efficiency of our approach in practical terms
Rich dynamical behaviors from a digital reversal operation
An operation that maps one natural number to another can be considered as a
dynamical system in . Some of such systems, e.g. the mapping in
the so-called 3x+1 problem proposed by Collatz, is conjectured to have a single
global attractor, whereas other systems, e.g. linear congruence, could be
ergodic. Here we demonstrate that an operation that is based on digital
reversal, has a spectrum of dynamical behaviors, including 2-cycle, 12-cycle,
periodic attractors with other cycle lengths, and diverging limiting dynamics
that escape to infinity. This dynamical system has infinite number of cyclic
attractors, and may have unlimited number of cycle lengths. It also has
potentially infinite number of diverging trajectories with a recurrent pattern
repeating every 8 steps. Although the transient time before settling on a
limiting dynamics is relatively short, we speculate that transient times may
not have an upper bound.Comment: 5 figure
Information content based model for the topological properties of the gene regulatory network of Escherichia coli
Gene regulatory networks (GRN) are being studied with increasingly precise
quantitative tools and can provide a testing ground for ideas regarding the
emergence and evolution of complex biological networks. We analyze the global
statistical properties of the transcriptional regulatory network of the
prokaryote Escherichia coli, identifying each operon with a node of the
network. We propose a null model for this network using the content-based
approach applied earlier to the eukaryote Saccharomyces cerevisiae. (Balcan et
al., 2007) Random sequences that represent promoter regions and binding
sequences are associated with the nodes. The length distributions of these
sequences are extracted from the relevant databases. The network is constructed
by testing for the occurrence of binding sequences within the promoter regions.
The ensemble of emergent networks yields an exponentially decaying in-degree
distribution and a putative power law dependence for the out-degree
distribution with a flat tail, in agreement with the data. The clustering
coefficient, degree-degree correlation, rich club coefficient and k-core
visualization all agree qualitatively with the empirical network to an extent
not yet achieved by any other computational model, to our knowledge. The
significant statistical differences can point the way to further research into
non-adaptive and adaptive processes in the evolution of the E. coli GRN.Comment: 58 pages, 3 tables, 22 figures. In press, Journal of Theoretical
Biology (2009)
Range-Limited Heaps' Law for Functional DNA Words in the Human Genome
Heaps' or Herdan's law is a linguistic law describing the relationship
between the vocabulary/dictionary size (type) and word counts (token) to be a
power-law function. Its existence in genomes with certain definition of DNA
words is unclear partly because the dictionary size in genome could be much
smaller than that in a human language. We define a DNA word as a coding region
in a genome that codes for a protein domain. Using human chromosomes and
chromosome arms as individual samples, we establish the existence of Heaps' law
in the human genome within limited range. Our definition of words in a genomic
or proteomic context is different from other definitions such as
over-represented k-mers which are much shorter in length. Although an
approximate power-law distribution of protein domain sizes due to gene
duplication and the related Zipf's law is well known, their translation to the
Heaps' law in DNA words is not automatic. Several other animal genomes are
shown herein also to exhibit range-limited Heaps' law with our definition of
DNA words, though with various exponents. When tokens were randomly sampled and
sample sizes reach to the maximum level, a deviation from the Heaps' law was
observed, but a quadratic regression in log-log type-token plot fits the data
perfectly. Investigation of type-token plot and its regression coefficients
could provide an alternative narrative of reusage and redundancy of protein
domains as well as creation of new protein domains from a linguistic
perspective.Comment: 7 figure
On overabundant words and their application to biological sequence analysis
The observed frequency of the longest proper prefix, the longest proper suffix, and the longest infix of a word w in a given sequence x can be used for classifying w as avoided or overabundant. The definitions used for the expectation and deviation of w in this statistical model were described and biologically justified by Brendel et al. (J Biomol Struct Dyn 1986, [1]). We have very recently introduced a time-optimal algorithm for computing all avoided words of a given sequence over an integer alphabet (Algorithms Mol Biol 2017, [2]). In this article, we extend this study by presenting an O(n)-time and O(n)-space algorithm for computing all overabundant words in a sequence x of length n over an integer alphabet. Our main result is based on a new non-trivial combinatorial property of the suffix tree T of x: the number of distinct factors of x whose longest infix is the label of an explicit node of T is no more than 3n−4. We further show that the presented algorithm is time-optimal by proving that O(n) is a tight upper bound for the number of overabundant words. Finally, we present experimental results, using both synthetic and real data, which justify the effectiveness and efficiency of our approach in practical terms
Optimal Computation of Avoided Words
The deviation of the observed frequency of a word w from its expected frequency in a given sequence x is used to determine whether or not the word is avoided. This concept is particularly useful in DNA linguistic analysis. The value of the standard deviation of w, denoted by std(w), effectively characterises the extent of a word by its edge contrast in the context in which it occurs. A word w of length k>2 is a ρ-avoided word in x if std(w)≤ρ, for a given threshold ρ<0. Notice that such a word may be completely absent from x. Hence computing all such words naïvely can be a very time-consuming procedure, in particular for large k. In this article, we propose an O(n)-time and O(n)-space algorithm to compute all ρ-avoided words of length k in a given sequence x of length n over a fixed-sized alphabet. We also present a time-optimal O(σn)-time algorithm to compute all ρ-avoided words (of any length) in a sequence of length n over an integer alphabet of size σ. We provide a tight asymptotic upper bound for the number of ρ-avoided words over an integer alphabet and the expected length of the longest one. We make available an implementation of our algorithm. Experimental results, using both real and synthetic data, show the efficiency of our implementation
- …
