496 research outputs found
Species-level functional profiling of metagenomes and metatranscriptomes.
Functional profiles of microbial communities are typically generated using comprehensive metagenomic or metatranscriptomic sequence read searches, which are time-consuming, prone to spurious mapping, and often limited to community-level quantification. We developed HUMAnN2, a tiered search strategy that enables fast, accurate, and species-resolved functional profiling of host-associated and environmental communities. HUMAnN2 identifies a community's known species, aligns reads to their pangenomes, performs translated search on unclassified reads, and finally quantifies gene families and pathways. Relative to pure translated search, HUMAnN2 is faster and produces more accurate gene family profiles. We applied HUMAnN2 to study clinal variation in marine metabolism, ecological contribution patterns among human microbiome pathways, variation in species' genomic versus transcriptional contributions, and strain profiling. Further, we introduce 'contributional diversity' to explain patterns of ecological assembly across different microbial community types
Rule-based knowledge aggregation for large-scale protein sequence analysis of influenza A viruses
Background: The explosive growth of biological data provides opportunities for new statistical and comparative analyses of large information sets, such as alignments comprising tens of thousands of sequences. In such studies, sequence annotations frequently play an essential role, and reliable results depend on metadata quality. However, the semantic heterogeneity and annotation inconsistencies in biological databases greatly increase the complexity of aggregating and cleaning metadata. Manual curation of datasets, traditionally favoured by life scientists, is impractical for studies involving thousands of records. In this study, we investigate quality issues that affect major public databases, and quantify the effectiveness of an automated metadata extraction approach that combines structural and semantic rules. We applied this approach to more than 90,000 influenza A records, to annotate sequences with protein name, virus subtype, isolate, host, geographic origin, and year of isolation. Results: Over 40,000 annotated Influenza A protein sequences were collected by combining information from more than 90,000 documents from NCBI public databases. Metadata values were automatically extracted, aggregated and reconciled from several document fields by applying user-defined structural rules. For each property, values were recovered from ≥88.8% of records, with accuracy exceeding 96% in most cases. Because of semantic heterogeneity, each property required up to six different structural rules to be combined. Significant quality differences between databases were found: GenBank documents yield values more reliably than documents extracted from GenPept. Using a simple set of semantic rules and a reasoner, we reconstructed relationships between sequences from the same isolate, thus identifying 7640 isolates. Validation of isolate metadata against a simple ontology highlighted more than 400 inconsistencies, leading to over 3,000 property value corrections. Conclusion: To overcome the quality issues inherent in public databases, automated knowledge aggregation with embedded intelligence is needed for large-scale analyses. Our results show that user-controlled intuitive approaches, based on combination of simple rules, can reliably automate various curation tasks, reducing the need for manual corrections to approximately 5% of the records. Emerging semantic technologies possess desirable features to support today's knowledge aggregation tasks, with a potential to bring immediate benefits to this field. © 2006 Brahmachary et al; licensee BioMed Central Ltd
Large-scale comparative genomic ranking of taxonomically restricted genes (TRGs) in bacterial and archaeal genomes
BACKGROUND: Lineage-specific, or taxonomically restricted genes (TRGs), especially those that are species and strain-specific, are of special interest because they are expected to play a role in defining exclusive ecological adaptations to particular niches. Despite this, they are relatively poorly studied and little understood, in large part because many are still orphans or only have homologues in very closely related isolates. This lack of homology confounds attempts to establish the likelihood that a hypothetical gene is expressed and, if so, to determine the putative function of the protein. METHODOLOGY/PRINCIPAL FINDINGS: We have developed "QIPP" ("Quality Index for Predicted Proteins"), an index that scores the "quality" of a protein based on non-homology-based criteria. QIPP can be used to assign a value between zero and one to any protein based on comparing its features to other proteins in a given genome. We have used QIPP to rank the predicted proteins in the proteomes of Bacteria and Archaea. This ranking reveals that there is a large amount of variation in QIPP scores, and identifies many high-scoring orphans as potentially "authentic" (expressed) orphans. There are significant differences in the distributions of QIPP scores between orphan and non-orphan genes for many genomes and a trend for less well-conserved genes to have lower QIPP scores. CONCLUSIONS: The implication of this work is that QIPP scores can be used to further annotate predicted proteins with information that is independent of homology. Such information can be used to prioritize candidates for further analysis. Data generated for this study can be found in the OrphanMine at http://www.genomics.ceh.ac.uk/orphan_mine
Novel cyclic di-GMP effectors of the YajQ protein family control bacterial virulence
Bis-(3 ',5 ') cyclic di-guanylate (cyclic di-GMP) is a key bacterial second messenger that is implicated in the regulation of many critical processes that include motility, biofilm formation and virulence. Cyclic di-GMP influences diverse functions through interaction with a range of effectors. Our knowledge of these effectors and their different regulatory actions is far from complete, however. Here we have used an affinity pull-down assay using cyclic di-GMP-coupled magnetic beads to identify cyclic di-GMP binding proteins in the plant pathogen Xanthomonas campestris pv. campestris (Xcc). This analysis identified XC_3703, a protein of the YajQ family, as a potential cyclic di-GMP receptor. Isothermal titration calorimetry showed that the purified XC_3703 protein bound cyclic di-GMP with a high affinity (K-d similar to 2 mu M). Mutation of XC_3703 led to reduced virulence of Xcc to plants and alteration in biofilm formation. Yeast two-hybrid and far-western analyses showed that XC_3703 was able to interact with XC_2801, a transcription factor of the LysR family. Mutation of XC_2801 and XC_3703 had partially overlapping effects on the transcriptome of Xcc, and both affected virulence. Electromobility shift assays showed that XC_3703 positively affected the binding of XC_2801 to the promoters of target virulence genes, an effect that was reversed by cyclic di-GMP. Genetic and functional analysis of YajQ family members from the human pathogens Pseudomonas aeruginosa and Stenotrophomonas maltophilia showed that they also specifically bound cyclic di-GMP and contributed to virulence in model systems. The findings thus identify a new class of cyclic di-GMP effector that regulates bacterial virulence
Automatically extracting functionally equivalent proteins from SwissProt
In summary, FOSTA provides an automated analysis of annotations in UniProtKB/Swiss-Prot to enable groups of proteins already annotated as functionally equivalent, to be extracted. Our results demonstrate that the vast majority of UniProtKB/Swiss-Prot functional annotations are of high quality, and that FOSTA can interpret annotations successfully. Where FOSTA is not successful, we are able to highlight inconsistencies in UniProtKB/Swiss-Prot annotation. Most of these would have presented equal difficulties for manual interpretation of annotations. We discuss limitations and possible future extensions to FOSTA, and recommend changes to the UniProtKB/Swiss-Prot format, which would facilitate text-mining of UniProtKB/Swiss-Prot
Role of the PAS sensor domains in the Bacillus subtilis sporulation kinase KinA
Histidine kinases are sophisticated molecular sensors that are used by bacteria to detect and respond to a multitude of environmental signals. KinA is the major histidine kinase required for initiation of sporulation upon nutrient deprivation in Bacillus subtilis. KinA has a large N-terminal region (residues 1 to 382) that is uniquely composed of three tandem Per-ARNT-Sim (PAS) domains that have been proposed to constitute a sensor module. To further enhance our understanding of this "sensor" region, we defined the boundaries that give rise to the minimal autonomously folded PAS domains and analyzed their homo- and heteroassociation properties using analytical ultracentrifugation, nuclear magnetic resonance (NMR) spectroscopy, and multiangle laser light scattering. We show that PAS(A) self-associates very weakly, while PAS(C) is primarily a monomer. In contrast, PAS(B) forms a stable dimer (K-d [dissociation constant] o
Signatures of arithmetic simplicity in metabolic network architecture
Metabolic networks perform some of the most fundamental functions in living
cells, including energy transduction and building block biosynthesis. While
these are the best characterized networks in living systems, understanding
their evolutionary history and complex wiring constitutes one of the most
fascinating open questions in biology, intimately related to the enigma of
life's origin itself. Is the evolution of metabolism subject to general
principles, beyond the unpredictable accumulation of multiple historical
accidents? Here we search for such principles by applying to an artificial
chemical universe some of the methodologies developed for the study of genome
scale models of cellular metabolism. In particular, we use metabolic flux
constraint-based models to exhaustively search for artificial chemistry
pathways that can optimally perform an array of elementary metabolic functions.
Despite the simplicity of the model employed, we find that the ensuing pathways
display a surprisingly rich set of properties, including the existence of
autocatalytic cycles and hierarchical modules, the appearance of universally
preferable metabolites and reactions, and a logarithmic trend of pathway length
as a function of input/output molecule size. Some of these properties can be
derived analytically, borrowing methods previously used in cryptography. In
addition, by mapping biochemical networks onto a simplified carbon atom
reaction backbone, we find that several of the properties predicted by the
artificial chemistry model hold for real metabolic networks. These findings
suggest that optimality principles and arithmetic simplicity might lie beneath
some aspects of biochemical complexity
The genetic organisation of prokaryotic two-component system signalling pathways
<p>Abstract</p> <p>Background</p> <p>Two-component systems (TCSs) are modular and diverse signalling pathways, involving a stimulus-responsive transfer of phosphoryl groups from transmitter to partner receiver domains. TCS gene and domain organisation are both potentially informative regarding biological function, interaction partnerships and molecular mechanisms. However, there is currently little understanding of the relationships between domain architecture, gene organisation and TCS pathway structure.</p> <p>Results</p> <p>Here we classify the gene and domain organisation of TCS gene loci from 1405 prokaryotic replicons (>40,000 TCS proteins). We find that 200 bp is the most appropriate distance cut-off for defining whether two TCS genes are functionally linked. More than 90% of all TCS gene loci encode just one or two transmitter and/or receiver domains, however numerous other geometries exist, often with large numbers of encoded TCS domains. Such information provides insights into the distribution of TCS domains between genes, and within genes. As expected, the organisation of TCS genes and domains is affected by phylogeny, and plasmid-encoded TCS exhibit differences in organisation from their chromosomally-encoded counterparts.</p> <p>Conclusions</p> <p>We provide here an overview of the genomic and genetic organisation of TCS domains, as a resource for further research. We also propose novel metrics that build upon TCS gene/domain organisation data and allow comparisons between genomic complements of TCSs. In particular, '<it>percentage orphaned TCS genes</it>' (or 'Dissemination') and '<it>percentage of complex loci</it>' (or 'Sophistication') appear to be useful discriminators, and to reflect mechanistic aspects of TCS organisation not captured by existing metrics.</p
Dimerisation induced formation of the active site and the identification of three metal sites in EAL-phosphodiesterases
The bacterial second messenger cyclic di-3′,5′-guanosine monophosphate (c-di-GMP) is a key regulator of bacterial motility and virulence. As high levels of c-di-GMP are associated with the biofilm lifestyle, c-di-GMP hydrolysing phosphodiesterases (PDEs) have been identified as key targets to aid development of novel strategies to treat chronic infection by exploiting biofilm dispersal. We have studied the EAL signature motif-containing phosphodiesterase domains from the Pseudomonas aeruginosa proteins PA3825 (PA3825EAL) and PA1727 (MucREAL). Different dimerisation interfaces allow us to identify interface independent principles of enzyme regulation. Unlike previously characterised two-metal binding EAL-phosphodiesterases, PA3825EAL in complex with pGpG provides a model for a third metal site. The third metal is positioned to stabilise the negative charge of the 5′-phosphate, and thus three metals could be required for catalysis in analogy to other nucleases. This newly uncovered variation in metal coordination may provide a further level of bacterial PDE regulation
Whole genome sequence and manual annotation of Clostridium autoethanogenum, an industrially relevant bacterium
Clostridium autoethanogenum is an acetogenic bacterium capable of producing high value commodity chemicals and biofuels from the C1 gases present in synthesis gas. This common industrial waste gas can act as the sole energy and carbon source for the bacterium that converts the low value gaseous components into cellular building blocks and industrially relevant products via the action of the reductive acetyl-CoA (Wood-Ljungdahl) pathway. Current research efforts are focused on the enhancement and extension of product formation in this organism via synthetic biology approaches. However, crucial to metabolic modelling and directed pathway engineering is a reliable and comprehensively annotated genome sequence
- …
