165 research outputs found

    Applications of Evolutionary Bioinformatics in Basic and Biomedical Research

    Get PDF
    With the revolutionary progress in sequencing technologies, computational biology emerged as a game-changing field which is applied in understanding molecular events of life for not only complementary but also exploratory purposes. Bioinformatics resources and tools significantly help in data generation, organization and analysis. However, there is still a need for developing new approaches built based on a biologist’s point of view. In protein bioinformatics, there are several fundamental problems such as (i) determining protein function; (ii) identifying protein-protein interactions; (iii) predicting the effect of amino acid variants. Here, I present three chapters addressing these problems from an evolutionary perspective. Firstly, I describe a novel search pipeline for protein domain identification. The algorithm chain provides sensitive domain assignments with the highest possible specificity. Secondly, I present a tool enabling large-scale visualization of presences and absences of proteins in hierarchically clustered genomes. This tool visualizes multi-layer information of any kind of genome-linked data with a special focus on domain architectures, enabling identification of coevolving domains/proteins, which can eventually help in identifying functionally interacting proteins. And finally, I propose an approach for distinguishing between benign and damaging missense mutations in a human disease by establishing the precise evolutionary history of the associated gene. This part introduces new criteria on how to determine functional orthologs via phylogenetic analysis. All three parts use comparative genomics and/or sequence analyses. Taken together, this study addresses important problems in protein bioinformatics and as a whole it can be utilized to describe proteins by their domains, coevolving partners and functionally important residues

    Establishing the precise evolutionary history of a gene improves prediction of disease-causing missense mutations

    Get PDF
    PURPOSE: Predicting the phenotypic effects of mutations has become an important application in clinical genetic diagnostics. Computational tools evaluate the behavior of the variant over evolutionary time and assume that variations seen during the course of evolution are probably benign in humans. However, current tools do not take into account orthologous/paralogous relationships. Paralogs have dramatically different roles in Mendelian diseases. For example, whereas inactivating mutations in the NPC1 gene cause the neurodegenerative disorder Niemann-Pick C, inactivating mutations in its paralog NPC1L1 are not disease-causing and, moreover, are implicated in protection from coronary heart disease. METHODS: We identified major events in NPC1 evolution and revealed and compared orthologs and paralogs of the human NPC1 gene through phylogenetic and protein sequence analyses. We predicted whether an amino acid substitution affects protein function by reducing the organism’s fitness. RESULTS: Removing the paralogs and distant homologs improved the overall performance of categorizing disease-causing and benign amino acid substitutions. CONCLUSION: The results show that a thorough evolutionary analysis followed by identification of orthologs improves the accuracy in predicting disease-causing missense mutations. We anticipate that this approach will be used as a reference in the interpretation of variants in other genetic diseases as well. Genet Med 18 10, 1029–1036

    Characterizing the potential interplay between nucleotide excision repair and R-loops

    Get PDF
    R-loops have been a focus of interest in genomics due to their non-canonical structures and unclear roles on genomes of many organisms. While R-loops contribute to gene expression and efficient transcriptional termination, they cause genome instability under certain conditions. They are formed when an RNA anneals with its complementary DNA strand. A DNA:RNA hybrid is formed and the other strand of DNA is left single-stranded (ssDNA). To date, there was no clear knowledge on R-loops’ tendency for UV damage formation or how these DNA:RNA hybrid and ssDNA structures affect nucleotide excision repair (NER), the primary mechanism to cope with UV-induced DNA damage. Therefore, we aimed to shed light on the relationship between R-loops and UV damage occurrence and repair using Rloops' positions on human and Arabidopsis genomes, and Damage-seq and XR-seq data that provided positions of UV damages and repair events, respectively. By comparing the R-loopforming locations with damage and repair occurrences, we observed that the repair efficiency on R-loops was better than their surrounding regions. However, when ATAC-seq read count normalization eliminated the impact of R-loops being on open chromatin, lower repair efficiency on R-loop centers and 5’ regions, but higher repair efficiency on 3’ regions were observed. Because this repair profile might not be valid for each R-loop, we created heatmaps of relative repair where we could group R-loops with similar repair profiles. As a result, four different relative repair profiles were observed. We also checked the damage occurrence on R-loops and saw that in general, R-loops receive less damage than their surroundings, while there were also four different damage profiles within subgroups of Rloops. Further analysis will be conducted based on these results to explain what is behind differential repair and damage profiles on R-loops and better understand the roles of Rloops on our genomes

    A novel phylogeny-dependent coevolution algorithm

    Get PDF

    Effect of g-quadruplex formation on uv-induced damage and nucleotide excision repair

    Get PDF
    G-quadruplexes (G4s) are non-B DNA structures formed by four or more guanine tetrads stabilized by a charged ion. The formation and cellular functions of G4s are widely studied. G4s are known to localize at telomeres and promoters with a range of functions including maintaining genome integrity at telomeres and suppressing or initiating transcription. Moreover, G4s can cause mutagenesis if they stay persistent during replication [1]. Although thermodynamically stable G4s are linked to double-strand breaks and genome instability, their effect on UV-induced damages and the interplay between nucleotide excision repair and G4 formation are unclear. Aiming to uncover this relationship, we have obtained genome-wide G4 maps in HeLa cells using G4P-ChIP-seq data and motif finder algorithms Quadron [2] and G4-Miner [3]. Then, we applied Damage-seq and XR-seq methods on HeLa cells, which maps UV-induced damages [(6-4)PPs and CPDs] and nucleotide excision repair of these damages, respectively [4]. Profiles of (6-4)PP and CPD damage around G4 regions suggest that the damage formation on the G4s was lower than in the neighboring regions. On the contrary, we observed higher relative repair on G4s compared to the flanking regions. We are currently working on understanding the reasons behind these profiles that will provide insights into co-occurrence of G4s and UV-induced damage and the response of nucleotide excision repair in these sites

    Quantifying the Temporal Dynamics between Genome Folding and UV-induced DNA Damage Response

    Get PDF
    Chromosome conformation has been typically linked to transcription and replica- tion, which has been the focus of many studies. However, recognizing the extremely dynamic activity of chromatin contacts has shifted the emphasis of study in recent years toward its coupling in creating adaptive repair networks. Studies have been con- ducted to learn how the genome is protected from double-stranded breaks and different types of damage sources[1, 2], but considerably less has been done for cancer-causing bulky DNA lesions, which are generally introduced by UV damage and repaired by nucleotide-excision repair (NER). Using Hi-C contact maps we derived from UV-irradiated HeLa cells with 12, 30, and 60 minute recovery durations, we investigated 3D genome folding and its relation to NER activity. We examined the various hierarchical 3D genome architecture layers, including TADs, that change dynamically in response to UV damage. Under matching experimental conditions, we also generated RNA-seq samples. In addition, we pro- vide a novel method for understanding the differential chromatin contacts for mutual comparison of two Hi-C contact maps with an emphasis on short-range interactions. We employ a Graph Neural Network (GNN) model to detect whether sub-interactions of a graph created from one Hi-C matrix are also retained in the other. Using this technique, we quantify the magnitude of differential change around each bin along the diagonal of an equally divided Hi-C contact map and decide whether a genomic region is differential. As part of this multi-omics strategy, we obtained Damage-seq[3] and XR-seq[4] samples to show NER activity in the specific genomic areas identified by differential time course analysis. Our work establishes evidence for the notable effects of UV-irradiation on 3D struc- ture and genome integrity, and offers a model in which, when exposed to UV light, dif- ferential chromatin compaction is associated with dramatically elevated repair rates

    Dynamic maps of UV damage formation and repair for the human genome

    Get PDF
    Nucleotide excision repair removes DNA damage caused by carcinogens, such as UV and anticancer drugs, such as cisplatin. We have developed two methods, high-sensitivity damage sequencing and excision repair sequencing that map the formation and repair of damage in the human genome at single-nucleotide resolution. The combination of dynamic damage and repair maps provides a holistic perspective of UV damage and repair of the human genome and has potential applications in cancer prevention and chemotherapy

    Differential damage and repair of DNA-adducts induced by anti-cancer drug cisplatin across mouse organs

    Get PDF
    The platinum-based drug cisplatin is a widely used first-line therapy for several cancers. Cisplatin interacts with DNA mainly in the form of Pt-d(GpG) di-adduct, which stalls cell proliferation and activates DNA damage response. Although cisplatin shows a broad spectrum of anticancer activity, its utility is limited due to acquired drug resistance and toxicity to non-targeted tissues. Here, by integrating genome-wide high-throughput Damage-seq, XR-seq, and RNA-seq approaches, along with publicly available epigenomic data, we systematically study the genome-wide profiles of cisplatin damage formation and excision repair in mouse kidney, liver, lung and spleen. We find different DNA damage and repair spectra across mouse organs, which are associated with tissue-specific transcriptomic and epigenomic profiles. The framework and the multi-omics data we present here constitute an unbiased foundation for understanding the mechanisms of cellular response to cisplatin. Our approach should be applicable for studying drug resistance and for tailoring cancer chemotherapy regimens

    Reproducible and scalable data analysis on high performance computing

    Get PDF
    We recently presented the PHACT tool ( PHylogeny-Aware Computation of Tolerance) for assessing amino acid substitutions that achieved superior predictive performance compared to widely adapted tools [1]. PHACT scores alterations not only using the frequency of the alterations in the multiple sequence alignment (MSA) as most common tools do - but also uses the gene-based phylogenetic trees. PHACT’s inputs include the MSA of the protein, the phylogenetic tree estimated on that MSA and the probability distribution of amino acids at each ancestral node estimated from the tree. To assess the predictive performance of PHACT. We performed various experiments over a dataset that include 20,546 proteins and 61,662 variants. In theory, analyzing a protein takes a single CPU day using eight cores, thus, the amount of computation is 192 CPU hours. The overall computation time to finish the analyses for the whole dataset (20.546 proteins) is 3.94M CPU hours. Using a single and powerful computer with 64 cores takes around seven years, and 512GB of memory is not practical. We completed the analyses within four months by using a High-Performance Computing (HPC) cluster. Performing an extensive reproducible and scalable data analysis for multiple proteins with various parameters on an HPC is not straightforward [3]. For example, 50 proteins with ten parameters and ten consecutive tasks mean 5000 independent jobs must be executed successfully. Each job (task) has different characteristics; some are CPU, and others are memory-intensive jobs. Some jobs are completed within hours, while others take days or weeks. On the other hand, HPC is a complex environment; hundreds of servers running together and obtaining a failure is not an exception; thus, managing such a large number of jobs is not easy. A workflow tool, where an analysis definition is determined by a set of rules and a set of output files from a set of input files is obtained, is a must. In addition to being scalable, being reproducible is also a critical requirement that the same results can be obtained by other researchers anytime [4]. All the tools, software used during the analyses, input files, the computational facilities could be defined in a text file so all environments could be deployed without additional efforts. To satisfy all these requirements, we used a Snakemake workflow with a conda package manager due to its human-readable, Python-based language, portability, integration with a conda package manager, automatic deployment, and ability to specific software dependencies. PHACT framework specifies rules in Snakefile. Rules decompose the workflow into small steps such as finding homologs of each query sequence (PSI-BLAST), performing multiple sequence alignment (MAFFT), or generating a maximum-likelihood phylogenetic tree (RAXML-NG, FASTTREE). Each rule has its model parameters, which can be set via a single configuration file (config/config.yml). A dry-run parameter can be used to check if the workflow is adequately defined and to estimate the amount of calculation remaining. It summarizes the number of total jobs (rule) performed and sets of input and output files used and created, respectively. For 2 query files, as given in Fig.1, 29 jobs will be executed. In addition, to allow workload running on a local computer with a limited number of query IDs, the PHACT framework is designed to analyze a bulk of query IDs in parallel using HPC. Most HPC clusters have a scheduler that handles the workload on compute nodes. Users must prepare a bash script and submit it to the cluster to interact with a scheduler. Snakemake has the functionality to perform all these efforts automatically. Within this work, a valuable dataset that contains MSAs and phylogenic trees, which amounts to more than 1M files and a 1.6TByte size, was created and shared with other researchers. All details such as documentation, scripts, tools, environments, and input proteins can be found on our GitHub page [5] and all results are published on our FTP server [6]
    corecore