Search CORE

89 research outputs found

A study on the correlation of nucleotide skews and the positioning of the origin of replication: different modes of replication in bacterial species

Author: Almirantis Yannis
Nikolaou Christoforos
Publication venue: Oxford University Press
Publication date: 30/11/2005
Field of study

Deviations from Chargaff's 2nd parity rule, according to which A∼T and G∼C in single stranded DNA, have been associated with replication as well as with transcription in prokaryotes. Based on observations regarding mainly the transcription-replication co-linearity in a large number of prokaryotic species, we formulate the hypothesis that the replication procedure may follow different modes between genomes throughout which the skews clearly follow different patterns. We draw the conclusion that multiple functional sites of origin of replication may exist in the genomes of most archaea and in some exceptional cases of eubacteria, while in the majority of eubacteria, replication occurs through a single fixed origin

Crossref

PubMed Central

Information decomposition of symbolic sequences

Author: Adams
Almirantis
Audi
Benson
Benson
Chaley
Chechetkin
Cole
Conway
Coward
Dodin
Dodin
E.V. Korotkov
Fraser
Glaser
Grosse
Heringa
Herren
Hertz
Herzel
Jackson
Junker
Korotkova
Kullback
Lobzin
Lotman
M.A. Korotkova
Margot
Marple
McLachlan
N.A. Kudryashov
Ng
Pennisi
Presta
Rackovsky
Ramakrishna
Rashid
Silverman
Stoesser
Tiwari
Tomb
Trifonov
Venter
Voss
Wang
Weiss
Yaglom
Zirmunsky
Publication venue: 'Elsevier BV'
Publication date: 17/02/2003
Field of study

We developed a non-parametric method of Information Decomposition (ID) of a content of any symbolical sequence. The method is based on the calculation of Shannon mutual information between analyzed and artificial symbolical sequences, and allows the revealing of latent periodicity in any symbolical sequence. We show the stability of the ID method in the case of a large number of random letter changes in an analyzed symbolic sequence. We demonstrate the possibilities of the method, analyzing both poems, and DNA and protein sequences. In DNA and protein sequences we show the existence of many DNA and amino acid sequences with different types and lengths of latent periodicity. The possible origin of latent periodicity for different symbolical sequences is discussed.Comment: 18 pages, 8 figure

arXiv.org e-Print Archive

Crossref

University of Groningen

Optimal Computation of Overabundant Words

Author: Almirantis Yannis
Charalampopoulos Panagiotis
Gao Jia
Iliopoulos Costas S.
Mohamed Manal
Pissis Solon P.
Polychronopoulos Dimitris
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 17th International Workshop on Algorithms in Bioinformatics (WABI 2017)
Publication date: 01/01/2017
Field of study

The observed frequency of the longest proper prefix, the longest proper suffix, and the longest infix of a word w in a given sequence x can be used for classifying w as avoided or overabundant. The definitions used for the expectation and deviation of w in this statistical model were described and biologically justified by Brendel et al. (J Biomol Struct Dyn 1986). We have very recently introduced a time-optimal algorithm for computing all avoided words of a given sequence over an integer alphabet (Algorithms Mol Biol 2017). In this article, we extend this study by presenting an O(n)-time and O(n)-space algorithm for computing all overabundant words in a sequence x of length n over an integer alphabet. Our main result is based on a new non-trivial combinatorial property of the suffix tree T of x: the number of distinct factors of x whose longest infix is the label of an explicit node of T is no more than 3n-4. We further show that the presented algorithm is time-optimal by proving that O(n) is a tight upper bound for the number of overabundant words. Finally, we present experimental results, using both synthetic and real data, which justify the effectiveness and efficiency of our approach in practical terms

arXiv.org e-Print Archive

DROPS Dagstuhl Research Online Publication Server

Rich dynamical behaviors from a digital reversal operation

Author: Almirantis Yannis
Li Wentian
Publication venue
Publication date: 05/08/2024
Field of study

An operation that maps one natural number to another can be considered as a dynamical system in

\mathbb{N}^+

. Some of such systems, e.g. the mapping in the so-called 3x+1 problem proposed by Collatz, is conjectured to have a single global attractor, whereas other systems, e.g. linear congruence, could be ergodic. Here we demonstrate that an operation that is based on digital reversal, has a spectrum of dynamical behaviors, including 2-cycle, 12-cycle, periodic attractors with other cycle lengths, and diverging limiting dynamics that escape to infinity. This dynamical system has infinite number of cyclic attractors, and may have unlimited number of cycle lengths. It also has potentially infinite number of diverging trajectories with a recurrent pattern repeating every 8 steps. Although the transient time before settling on a limiting dynamics is relatively short, we speculate that transient times may not have an upper bound.Comment: 5 figure

arXiv.org e-Print Archive

Information content based model for the topological properties of the gene regulatory network of Escherichia coli

Author: Albert
Alberts
Almirantis
Avery
Ayşe Erzan
Babu
Balcan
Balcan
Balcan
Banzhaf
Barabasi
Barabasi
Benos
Berg
Bergmann
Berkin Malkoç
Bilu
Bollobás
Browning
Buldyrev
Colizza
Colizza
Dawkins
Dawkins
Dobrin
Dodd
Dorogovtsev
Duygu Balcan
Erdös
Erdös
Gama-Castro
Gerland
Gershenzon
Guelzim
Harbison
Jeong
Kashtan
Kauffman
Kim
Koralov
Kugiumtzis
Li
Lynch
Ma
Matsumoto
Milo
Milo
Mungan
Münch
Okuda
O’Flanagan
Pachkov
Reil
Rudd
Salgado
Salgado
Samal
Sengun
Sengupta
Shannon
Shearwin
Sneppen
Spirin
Stormo
Teixeira
van Nimwegen
van Noort
Vazquez
Wagner
Warren
Watson
Wernicke
Zhou
Publication venue: 'Elsevier BV'
Publication date: 29/12/2009
Field of study

Gene regulatory networks (GRN) are being studied with increasingly precise quantitative tools and can provide a testing ground for ideas regarding the emergence and evolution of complex biological networks. We analyze the global statistical properties of the transcriptional regulatory network of the prokaryote Escherichia coli, identifying each operon with a node of the network. We propose a null model for this network using the content-based approach applied earlier to the eukaryote Saccharomyces cerevisiae. (Balcan et al., 2007) Random sequences that represent promoter regions and binding sequences are associated with the nodes. The length distributions of these sequences are extracted from the relevant databases. The network is constructed by testing for the occurrence of binding sequences within the promoter regions. The ensemble of emergent networks yields an exponentially decaying in-degree distribution and a putative power law dependence for the out-degree distribution with a flat tail, in agreement with the data. The clustering coefficient, degree-degree correlation, rich club coefficient and k-core visualization all agree qualitatively with the empirical network to an extent not yet achieved by any other computational model, to our knowledge. The significant statistical differences can point the way to further research into non-adaptive and adaptive processes in the evolution of the E. coli GRN.Comment: 58 pages, 3 tables, 22 figures. In press, Journal of Theoretical Biology (2009)

arXiv.org e-Print Archive

Crossref

Range-Limited Heaps' Law for Functional DNA Words in the Human Genome

Author: Almirantis Yannis
Li Wentian
Provata Astero
Publication venue
Publication date: 17/06/2024
Field of study

Heaps' or Herdan's law is a linguistic law describing the relationship between the vocabulary/dictionary size (type) and word counts (token) to be a power-law function. Its existence in genomes with certain definition of DNA words is unclear partly because the dictionary size in genome could be much smaller than that in a human language. We define a DNA word as a coding region in a genome that codes for a protein domain. Using human chromosomes and chromosome arms as individual samples, we establish the existence of Heaps' law in the human genome within limited range. Our definition of words in a genomic or proteomic context is different from other definitions such as over-represented k-mers which are much shorter in length. Although an approximate power-law distribution of protein domain sizes due to gene duplication and the related Zipf's law is well known, their translation to the Heaps' law in DNA words is not automatic. Several other animal genomes are shown herein also to exhibit range-limited Heaps' law with our definition of DNA words, though with various exponents. When tokens were randomly sampled and sample sizes reach to the maximum level, a deviation from the Heaps' law was observed, but a quadratic regression in log-log type-token plot fits the data perfectly. Investigation of type-token plot and its regression coefficients could provide an alternative narrative of reusage and redundancy of protein domains as well as creation of new protein domains from a linguistic perspective.Comment: 7 figure

arXiv.org e-Print Archive

On overabundant words and their application to biological sequence analysis

Author: Almirantis Yannis
Charalampopoulos Panagiotis
Gao Jia
Iliopoulos Costas S.
Mohamed Manal
Pissis Solon P.
Polychronopoulos Dimitris
Publication venue
Publication date: 12/09/2018
Field of study

The observed frequency of the longest proper prefix, the longest proper suffix, and the longest infix of a word w in a given sequence x can be used for classifying w as avoided or overabundant. The definitions used for the expectation and deviation of w in this statistical model were described and biologically justified by Brendel et al. (J Biomol Struct Dyn 1986, [1]). We have very recently introduced a time-optimal algorithm for computing all avoided words of a given sequence over an integer alphabet (Algorithms Mol Biol 2017, [2]). In this article, we extend this study by presenting an O(n)-time and O(n)-space algorithm for computing all overabundant words in a sequence x of length n over an integer alphabet. Our main result is based on a new non-trivial combinatorial property of the suffix tree T of x: the number of distinct factors of x whose longest infix is the label of an explicit node of T is no more than 3n−4. We further show that the presented algorithm is time-optimal by proving that O(n) is a tight upper bound for the number of overabundant words. Finally, we present experimental results, using both synthetic and real data, which justify the effectiveness and efficiency of our approach in practical terms

King's Research Portal

Optimal Computation of Avoided Words

Author: Almirantis Yannis
Charalampopoulos Panagiotis
Gao Jia
Iliopoulos Costas S.
Mohamed Manal
Pissis Solon P.
Polychronopoulos Dimitris
Publication venue
Publication date: 01/01/2016
Field of study

The deviation of the observed frequency of a word w from its expected frequency in a given sequence x is used to determine whether or not the word is avoided. This concept is particularly useful in DNA linguistic analysis. The value of the standard deviation of w, denoted by std(w), effectively characterises the extent of a word by its edge contrast in the context in which it occurs. A word w of length k>2 is a ρ-avoided word in x if std(w)≤ρ, for a given threshold ρ<0. Notice that such a word may be completely absent from x. Hence computing all such words naïvely can be a very time-consuming procedure, in particular for large k. In this article, we propose an O(n)-time and O(n)-space algorithm to compute all ρ-avoided words of length k in a given sequence x of length n over a fixed-sized alphabet. We also present a time-optimal O(σn)-time algorithm to compute all ρ-avoided words (of any length) in a sequence of length n over an integer alphabet of size σ. We provide a tight asymptotic upper bound for the number of ρ-avoided words over an integer alphabet and the expected length of the longest one. We make available an implementation of our algorithm. Experimental results, using both real and synthetic data, show the efficiency of our implementation

Crossref

King's Research Portal