165 research outputs found
Applications of Evolutionary Bioinformatics in Basic and Biomedical Research
With the revolutionary progress in sequencing technologies, computational biology emerged as a game-changing field which is applied in understanding molecular events of life for not only complementary but also exploratory purposes. Bioinformatics resources and tools significantly help in data generation, organization and analysis. However, there is still a need for developing new approaches built based on a biologist’s point of view. In protein bioinformatics, there are several fundamental problems such as (i) determining protein function; (ii) identifying protein-protein interactions; (iii) predicting the effect of amino acid variants. Here, I present three chapters addressing these problems from an evolutionary perspective. Firstly, I describe a novel search pipeline for protein domain identification. The algorithm chain provides sensitive domain assignments with the highest possible specificity. Secondly, I present a tool enabling large-scale visualization of presences and absences of proteins in hierarchically clustered genomes. This tool visualizes multi-layer information of any kind of genome-linked data with a special focus on domain architectures, enabling identification of coevolving domains/proteins, which can eventually help in identifying functionally interacting proteins. And finally, I propose an approach for distinguishing between benign and damaging missense mutations in a human disease by establishing the precise evolutionary history of the associated gene. This part introduces new criteria on how to determine functional orthologs via phylogenetic analysis. All three parts use comparative genomics and/or sequence analyses. Taken together, this study addresses important problems in protein bioinformatics and as a whole it can be utilized to describe proteins by their domains, coevolving partners and functionally important residues
Establishing the precise evolutionary history of a gene improves prediction of disease-causing missense mutations
PURPOSE: Predicting the phenotypic effects of mutations has become an important application in clinical genetic diagnostics. Computational tools evaluate the behavior of the variant over evolutionary time and assume that variations seen during the course of evolution are probably benign in humans. However, current tools do not take into account orthologous/paralogous relationships. Paralogs have dramatically different roles in Mendelian diseases. For example, whereas inactivating mutations in the NPC1 gene cause the neurodegenerative disorder Niemann-Pick C, inactivating mutations in its paralog NPC1L1 are not disease-causing and, moreover, are implicated in protection from coronary heart disease. METHODS: We identified major events in NPC1 evolution and revealed and compared orthologs and paralogs of the human NPC1 gene through phylogenetic and protein sequence analyses. We predicted whether an amino acid substitution affects protein function by reducing the organism’s fitness. RESULTS: Removing the paralogs and distant homologs improved the overall performance of categorizing disease-causing and benign amino acid substitutions. CONCLUSION: The results show that a thorough evolutionary analysis followed by identification of orthologs improves the accuracy in predicting disease-causing missense mutations. We anticipate that this approach will be used as a reference in the interpretation of variants in other genetic diseases as well. Genet Med 18 10, 1029–1036
Characterizing the potential interplay between nucleotide excision repair and R-loops
R-loops have been a focus of interest in genomics due to their non-canonical structures and
unclear roles on genomes of many organisms. While R-loops contribute to gene expression
and efficient transcriptional termination, they cause genome instability under certain
conditions. They are formed when an RNA anneals with its complementary DNA strand. A
DNA:RNA hybrid is formed and the other strand of DNA is left single-stranded (ssDNA). To
date, there was no clear knowledge on R-loops’ tendency for UV damage formation or how
these DNA:RNA hybrid and ssDNA structures affect nucleotide excision repair (NER), the
primary mechanism to cope with UV-induced DNA damage. Therefore, we aimed to shed
light on the relationship between R-loops and UV damage occurrence and repair using Rloops' positions on human and Arabidopsis genomes, and Damage-seq and XR-seq data that
provided positions of UV damages and repair events, respectively. By comparing the R-loopforming locations with damage and repair occurrences, we observed that the repair
efficiency on R-loops was better than their surrounding regions. However, when ATAC-seq
read count normalization eliminated the impact of R-loops being on open chromatin, lower
repair efficiency on R-loop centers and 5’ regions, but higher repair efficiency on 3’ regions
were observed. Because this repair profile might not be valid for each R-loop, we created
heatmaps of relative repair where we could group R-loops with similar repair profiles. As a
result, four different relative repair profiles were observed. We also checked the damage
occurrence on R-loops and saw that in general, R-loops receive less damage than their
surroundings, while there were also four different damage profiles within subgroups of Rloops. Further analysis will be conducted based on these results to explain what is behind
differential repair and damage profiles on R-loops and better understand the roles of Rloops on our genomes
Effect of g-quadruplex formation on uv-induced damage and nucleotide excision repair
G-quadruplexes (G4s) are non-B DNA structures formed by four or more guanine tetrads
stabilized by a charged ion. The formation and cellular functions of G4s are widely studied. G4s
are known to localize at telomeres and promoters with a range of functions including maintaining
genome integrity at telomeres and suppressing or initiating transcription. Moreover, G4s can
cause mutagenesis if they stay persistent during replication [1]. Although thermodynamically
stable G4s are linked to double-strand breaks and genome instability, their effect on UV-induced
damages and the interplay between nucleotide excision repair and G4 formation are unclear.
Aiming to uncover this relationship, we have obtained genome-wide G4 maps in HeLa cells using
G4P-ChIP-seq data and motif finder algorithms Quadron [2] and G4-Miner [3]. Then, we applied
Damage-seq and XR-seq methods on HeLa cells, which maps UV-induced damages [(6-4)PPs and
CPDs] and nucleotide excision repair of these damages, respectively [4]. Profiles of (6-4)PP and
CPD damage around G4 regions suggest that the damage formation on the G4s was lower than
in the neighboring regions. On the contrary, we observed higher relative repair on G4s compared
to the flanking regions. We are currently working on understanding the reasons behind these
profiles that will provide insights into co-occurrence of G4s and UV-induced damage and the
response of nucleotide excision repair in these sites
Quantifying the Temporal Dynamics between Genome Folding and UV-induced DNA Damage Response
Chromosome conformation has been typically linked to transcription and replica-
tion, which has been the focus of many studies. However, recognizing the extremely
dynamic activity of chromatin contacts has shifted the emphasis of study in recent
years toward its coupling in creating adaptive repair networks. Studies have been con-
ducted to learn how the genome is protected from double-stranded breaks and different
types of damage sources[1, 2], but considerably less has been done for cancer-causing
bulky DNA lesions, which are generally introduced by UV damage and repaired by
nucleotide-excision repair (NER).
Using Hi-C contact maps we derived from UV-irradiated HeLa cells with 12, 30,
and 60 minute recovery durations, we investigated 3D genome folding and its relation
to NER activity. We examined the various hierarchical 3D genome architecture layers,
including TADs, that change dynamically in response to UV damage. Under matching
experimental conditions, we also generated RNA-seq samples. In addition, we pro-
vide a novel method for understanding the differential chromatin contacts for mutual
comparison of two Hi-C contact maps with an emphasis on short-range interactions.
We employ a Graph Neural Network (GNN) model to detect whether sub-interactions
of a graph created from one Hi-C matrix are also retained in the other. Using this
technique, we quantify the magnitude of differential change around each bin along the
diagonal of an equally divided Hi-C contact map and decide whether a genomic region
is differential. As part of this multi-omics strategy, we obtained Damage-seq[3] and
XR-seq[4] samples to show NER activity in the specific genomic areas identified by
differential time course analysis.
Our work establishes evidence for the notable effects of UV-irradiation on 3D struc-
ture and genome integrity, and offers a model in which, when exposed to UV light, dif-
ferential chromatin compaction is associated with dramatically elevated repair rates
Dynamic maps of UV damage formation and repair for the human genome
Nucleotide excision repair removes DNA damage caused by carcinogens, such as UV and anticancer drugs, such as cisplatin. We have developed two methods, high-sensitivity damage sequencing and excision repair sequencing that map the formation and repair of damage in the human genome at single-nucleotide resolution. The combination of dynamic damage and repair maps provides a holistic perspective of UV damage and repair of the human genome and has potential applications in cancer prevention and chemotherapy
Differential damage and repair of DNA-adducts induced by anti-cancer drug cisplatin across mouse organs
The platinum-based drug cisplatin is a widely used first-line therapy for several cancers. Cisplatin interacts with DNA mainly in the form of Pt-d(GpG) di-adduct, which stalls cell proliferation and activates DNA damage response. Although cisplatin shows a broad spectrum of anticancer activity, its utility is limited due to acquired drug resistance and toxicity to non-targeted tissues. Here, by integrating genome-wide high-throughput Damage-seq, XR-seq, and RNA-seq approaches, along with publicly available epigenomic data, we systematically study the genome-wide profiles of cisplatin damage formation and excision repair in mouse kidney, liver, lung and spleen. We find different DNA damage and repair spectra across mouse organs, which are associated with tissue-specific transcriptomic and epigenomic profiles. The framework and the multi-omics data we present here constitute an unbiased foundation for understanding the mechanisms of cellular response to cisplatin. Our approach should be applicable for studying drug resistance and for tailoring cancer chemotherapy regimens
Reproducible and scalable data analysis on high performance computing
We recently presented the PHACT tool ( PHylogeny-Aware Computation of Tolerance) for assessing amino acid
substitutions that achieved superior predictive performance compared to widely adapted tools [1]. PHACT scores
alterations not only using the frequency of the alterations in the multiple sequence alignment (MSA) as most common
tools do - but also uses the gene-based phylogenetic trees. PHACT’s inputs include the MSA of the protein, the
phylogenetic tree estimated on that MSA and the probability distribution of amino acids at each ancestral node
estimated from the tree. To assess the predictive performance of PHACT. We performed various experiments over a
dataset that include 20,546 proteins and 61,662 variants.
In theory, analyzing a protein takes a single CPU day using eight cores, thus, the amount of computation is 192 CPU
hours. The overall computation time to finish the analyses for the whole dataset (20.546 proteins) is 3.94M CPU hours.
Using a single and powerful computer with 64 cores takes around seven years, and 512GB of memory is not practical.
We completed the analyses within four months by using a High-Performance Computing (HPC) cluster.
Performing an extensive reproducible and scalable data analysis for multiple proteins with various parameters on an
HPC is not straightforward [3]. For example, 50 proteins with ten parameters and ten consecutive tasks mean 5000
independent jobs must be executed successfully. Each job (task) has different characteristics; some are CPU, and others
are memory-intensive jobs. Some jobs are completed within hours, while others take days or weeks. On the other hand,
HPC is a complex environment; hundreds of servers running together and obtaining a failure is not an exception; thus,
managing such a large number of jobs is not easy. A workflow tool, where an analysis definition is determined by a set
of rules and a set of output files from a set of input files is obtained, is a must. In addition to being scalable, being
reproducible is also a critical requirement that the same results can be obtained by other researchers anytime [4]. All the
tools, software used during the analyses, input files, the computational facilities could be defined in a text file so all
environments could be deployed without additional efforts. To satisfy all these requirements, we used a Snakemake
workflow with a conda package manager due to its human-readable, Python-based language, portability, integration
with a conda package manager, automatic deployment, and ability to specific software dependencies.
PHACT framework specifies rules in Snakefile. Rules decompose the workflow into small steps such as finding
homologs of each query sequence (PSI-BLAST), performing multiple sequence alignment (MAFFT), or generating a
maximum-likelihood phylogenetic tree (RAXML-NG, FASTTREE). Each rule has its model parameters, which can be
set via a single configuration file (config/config.yml). A dry-run parameter can be used to check if the workflow is
adequately defined and to estimate the amount of calculation remaining. It summarizes the number of total jobs (rule)
performed and sets of input and output files used and created, respectively. For 2 query files, as given in Fig.1, 29 jobs
will be executed. In addition, to allow workload running on a local computer with a limited number of query IDs, the
PHACT framework is designed to analyze a bulk of query IDs in parallel using HPC. Most HPC clusters have a
scheduler that handles the workload on compute nodes. Users must prepare a bash script and submit it to the cluster to
interact with a scheduler. Snakemake has the functionality to perform all these efforts automatically.
Within this work, a valuable dataset that contains MSAs and phylogenic trees, which amounts to more than 1M files
and a 1.6TByte size, was created and shared with other researchers. All details such as documentation, scripts, tools,
environments, and input proteins can be found on our GitHub page [5] and all results are published on our FTP server
[6]
- …
