17 research outputs found

    SADI, SHARE, and the in silico scientific method

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The emergence and uptake of Semantic Web technologies by the Life Sciences provides exciting opportunities for exploring novel ways to conduct <it>in silico</it> science. Web Service Workflows are already becoming first-class objects in “the new way”, and serve as explicit, shareable, referenceable representations of how an experiment was done. In turn, Semantic Web Service projects aim to facilitate workflow construction by biological domain-experts such that workflows can be edited, re-purposed, and re-published by non-informaticians. However the aspects of the scientific method relating to explicit discourse, disagreement, and hypothesis generation have remained relatively impervious to new technologies.</p> <p>Results</p> <p>Here we present SADI and SHARE - a novel Semantic Web Service framework, and a reference implementation of its client libraries. Together, SADI and SHARE allow the semi- or fully-automatic discovery and pipelining of Semantic Web Services in response to <it>ad hoc</it> user queries.</p> <p>Conclusions</p> <p>The semantic behaviours exhibited by SADI and SHARE extend the functionalities provided by Description Logic Reasoners such that novel assertions can be automatically added to a data-set without logical reasoning, but rather by analytical or annotative services. This behaviour might be applied to achieve the “semantification” of those aspects of the <it>in silico</it> scientific method that are not yet supported by Semantic Web technologies. We support this suggestion using an example in the clinical research space.</p

    Constructing and applying semantic models of clinical phenotypes to support web-embedded clinical research

    No full text
    The adoption of the Semantic Web by life sciences provides exciting opportunities for exploring novel ways to conduct biomedical research. Particularly, the approval of the Web Ontology Language (OWL) by the World Wide Web Consortium (W3C) has provided a global standard for shared representation of biomedical knowledge in the form of ontologies. However, though there are numerous examples of bio-ontologies being used to describe “what is” (i.e., a universal view of reality), there is a dearth of examples where ontologies are used to describe “what might be” (i.e., a hypothetical view of reality). This thesis proposes that, to achieve scientific rigor, it is important to consider approaches for explicitly representing subjective knowledge in a particular domain - in particular, we examine phenotypic classifications in the clinical domain. We provide supporting evidence, that the OWL is suitable for formally representing these subjective perspectives. Envisioning ontologies as hypothetical, contextual and subjective is a notable departure from the commonly-held view in the life-sciences, where ontologies (ostensibly) represent some "universal truth". We support these arguments with both empirical and quantitative studies. We demonstrate that, when expressed in OWL, many phenotypic classification systems can be accurately modeled in silico. We then demonstrate that these knowledge-models can be "personalized", and show that such models can enable the automated analysis of data in a transparent manner. This results in more rigorous clinical research, while simultaneously allowing the clinicians to maintain their role as the final arbiters of decisions. Finally, we investigate methodologies that might facilitate the encoding and sharing of personalized expert-knowledge by non-knowledge-engineers - a necessary step in making these ideas useful to the clinical community. The knowledge-acquisition bottleneck is the primary barrier to the widespread use of ontologies in life sciences. Thus, we investigate data-driven methodologies to automatically extract knowledge from existing data-systems, and show that it is possible to "boot-strap" the construction of knowledge models through various data-mining algorithms. Taken together, these studies begin to reveal a path toward Web-embedded, in silico clinical research, where knowledge is explicit, transparent, personalized, modular, globally-shared, re-used, and dynamically applied to the interpretation and analysis of clinical datasets.Science, Faculty ofGraduat

    Automatic detection and resolution of measurement-unit conflicts in aggregated data

    Get PDF
    Background: Measurement-unit conflicts are a perennial problem in integrative research domains such as clinical meta-analysis. As multi-national collaborations grow, as new measurement instruments appear, and as Linked Open Data infrastructures become increasingly pervasive, the number of such conflicts will similarly increase. Methods: We propose a generic approach to the problem of (a) encoding measurement units in datasets in a machine-readable manner, (b) detecting when a dataset contained mixtures of measurement units, and (c) automatically converting any conflicting units into a desired unit, as defined for a given study. Results: We utilized existing ontologies and standards for scientific data representation, measurement unit definition, and data manipulation to build a simple and flexible Semantic Web Service-based approach to measurement-unit harmonization. A cardiovascular patient cohort in which clinical measurements were recorded in a number of different units (e.g., mmHg and cmHg for blood pressure) was automatically classified into a number of clinical phenotypes, semantically defined using different measurement units. Conclusions: We demonstrate that through a combination of semantic standards and frameworks, unit integration problems can be automatically detected and resolved.Non UBCMedicine, Faculty ofPathology and Laboratory Medicine, Department ofScience, Faculty ofBioinformaticsReviewedFacult

    Extending and encoding existing biological terminologies and datasets for use in the reasoned semantic web

    No full text
    Background: Clinical phenotypes and disease-risk stratification are most often determined through the direct observations of clinicians in conjunction with published standards and guidelines, where the clinical expert is the final arbiter of the patient’s classification. While this "human" approach is highly desirable in the context of personalized and optimal patient care, it is problematic in a healthcare research setting because the basis for the patient's classification is not transparent, and likely not reproducible from one clinical expert to another. This sits in opposition to the rigor required to execute, for example, Genome-wide association analyses and other high-throughput studies where a large number of variables are being compared to a complex disease phenotype. Most clinical classification systems and are not structured for automated classification, and similarly, clinical data is generally not represented in a form that lends itself to automated integration and interpretation. Here we apply Semantic Web technologies to the problem of automated, transparent interpretation of clinical data for use in high-throughput research environments, and explore migration-paths for existing data and legacy semantic standards. Results Using a dataset from a cardiovascular cohort collected two decades ago, we present a migration path - both for the terminologies/classification systems and the data - that enables rich automated clinical classification using well-established standards. This is achieved by establishing a simple and flexible core data model, which is combined with a layered ontological framework utilizing both logical reasoning and analytical algorithms to iteratively "lift" clinical data through increasingly complex layers of interpretation and classification. We compare our automated analysis to that of the clinical expert, and discrepancies are used to refine the ontological models, finally arriving at ontologies that mirror the expert opinion of the individual clinical researcher. Other discrepancies, however, could not be as easily modeled, and we evaluate what information we are lacking that would allow these discrepancies to be resolved in an automated manner. Conclusions We demonstrate that the combination of semantically-explicit data, logically rigorous models of clinical guidelines, and publicly-accessible Semantic Web Services, can be used to execute automated, rigorous and reproducible clinical classifications with an accuracy approaching that of an expert. Discrepancies between the manual and automatic approaches reveal, as expected, that clinicians do not always rigorously follow established guidelines for classification; however, we demonstrate that "personalized" ontologies may represent a re-usable and transparent approach to modeling individual clinical expertise, leading to more reproducible science.Non UBCMedicine, Faculty ofScience, Faculty ofBioinformaticsPathology and Laboratory Medicine, Department ofReviewedFacult

    Bamgineer: Introduction of simulated allele-specific copy number variants into exome and targeted sequence data sets

    Full text link
    AbstractSomatic copy number variations (CNVs) play a crucial role in development of many human cancers. The broad availability of next-generation sequencing data has enabled the development of algorithms to computationally infer CNV profiles from a variety of data types including exome and targeted sequence data; currently the most prevalent types of cancer genomics data. However, systemic evaluation and comparison of these tools remains challenging due to a lack of ground truth reference sets. To address this need, we have developed Bamgineer, a tool written in Python to introduce user-defined haplotype-phased allele-specific copy number events into an existing Binary Alignment Mapping (BAM) file, with a focus on targeted and exome sequencing experiments. As input, this tool requires a read alignment file (BAM format), lists of non-overlapping genome coordinates for introduction of gains and losses (bed file), and an optional file defining known haplotypes (vcf format). To improve runtime performance, Bamgineer introduces the desired CNVs in parallel using queuing and parallel processing on a local machine or on a high-performance computing cluster. As proof-of-principle, we applied Bamgineer to a single high-coverage (mean: 220X) exome sequence file from a blood sample to simulate copy number profiles of 3 exemplar tumors from each of 10 tumor types at 5 tumor cellularity levels (20-100%, 150 BAM files in total). To demonstrate feasibility beyond exome data, we introduced read alignments to a targeted 5-gene cell-free DNA sequencing library to simulate EGFR amplifications at frequencies consistent with circulating tumor DNA (10, 1, 0.1 and 0.01%) while retaining the multimodal insert size distribution of the original data. We expect Bamgineer to be of use for development and systematic benchmarking of CNV calling algorithms by users using locally-generated data for a variety of applications. The source code is freely available at http://github.com/pughlab/bamgineer.Author summaryWe present Bamgineer, a software program to introduce user-defined, haplotype-specific copy number variants (CNVs) at any frequency into standard Binary Alignment Mapping (BAM) files. Copy number gains are simulated by introducing new DNA sequencing read pairs sampled from existing reads and modified to contain SNPs of the haplotype of interest. This approach retains biases of the original data such as local coverage, strand bias, and insert size. Deletions are simulated by removing reads corresponding to one or both haplotypes. In our proof-of-principle study, we simulated copy number profiles from 10 cancer types at varying cellularity levels typically encountered in clinical samples. We also demonstrated introduction of low frequency CNVs into cell-free DNA sequencing data that retained the bimodal fragment size distribution characteristic of these data. Bamgineer is flexible and enables users to simulate CNVs that reflect characteristics of locally-generated sequence files and can be used for many applications including development and benchmarking of CNV inference tools for a variety of data types.</jats:sec

    Bamgineer: Introduction of simulated allele-specific copy number variants into exome and targeted sequence data sets

    No full text
    <div><p>Somatic copy number variations (CNVs) play a crucial role in development of many human cancers. The broad availability of next-generation sequencing data has enabled the development of algorithms to computationally infer CNV profiles from a variety of data types including exome and targeted sequence data; currently the most prevalent types of cancer genomics data. However, systemic evaluation and comparison of these tools remains challenging due to a lack of ground truth reference sets. To address this need, we have developed Bamgineer, a tool written in Python to introduce user-defined haplotype-phased allele-specific copy number events into an existing Binary Alignment Mapping (BAM) file, with a focus on targeted and exome sequencing experiments. As input, this tool requires a read alignment file (BAM format), lists of non-overlapping genome coordinates for introduction of gains and losses (bed file), and an optional file defining known haplotypes (vcf format). To improve runtime performance, Bamgineer introduces the desired CNVs in parallel using queuing and parallel processing on a local machine or on a high-performance computing cluster. As proof-of-principle, we applied Bamgineer to a single high-coverage (mean: 220X) exome sequence file from a blood sample to simulate copy number profiles of 3 exemplar tumors from each of 10 tumor types at 5 tumor cellularity levels (20–100%, 150 BAM files in total). To demonstrate feasibility beyond exome data, we introduced read alignments to a targeted 5-gene cell-free DNA sequencing library to simulate <i>EGFR</i> amplifications at frequencies consistent with circulating tumor DNA (10, 1, 0.1 and 0.01%) while retaining the multimodal insert size distribution of the original data. We expect Bamgineer to be of use for development and systematic benchmarking of CNV calling algorithms by users using locally-generated data for a variety of applications. The source code is freely available at <a href="http://github.com/pughlab/bamgineer" target="_blank">http://github.com/pughlab/bamgineer</a>.</p></div

    Exome-wide simulated copy number profiles at a range of tumor purities yield expected allelic and copy number ratios.

    No full text
    <p>Allelic ratio for allele-specific copy number gain (A) and loss (B) events at heterozygous SNP loci for haplotypes affected (blue), haplotypes not affected (red), and SNPs not in engineered CNV regions (green) as negative controls at different tumor cellularity levels (x-axis) across all cancers. C) Tumor to normal log2 depth ratio boxplots of copy number gain (red) and loss (blue) segments from Sequenza across all cancers (<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1006080#pcbi.1006080.t001" target="_blank">Table 1</a>). D) Accuracy of Sequenza copy segment calling gains (red triangles) and losses (blue circles) decreases as simulated tumor content decreases.</p

    Average number of copy number gains, losses and percent genome altered (PGA) across TCGA cancer types and the three exemplar tumors selected for our study.

    No full text
    <p>Average number of copy number gains, losses and percent genome altered (PGA) across TCGA cancer types and the three exemplar tumors selected for our study.</p

    Log2 ratios from simulated exemplar tumors at varying purity levels.

    No full text
    <p>To assess the ability of Bamgineer to recapitulate CNV profiles of known cancers, we compared, from 10 TCGA cancer types, profiles from an exemplar tumour (top track for case) with CNV called from bam files generated by Bamgineer at five purity levels: 100%, 80%, 60%, 40% and 20% (subsequent tracks for each case). Log2 ratios of copy number segments inferred using the Sequenza algorithm, shown as a heatmap (blue: loss, red: gain; data range is -1.5 to 1.5) for different cancers and different tumor cellularities shown for one exemplar tumor for each cancer type. We generated exemplar composite seg files by combining high-ranking tumor profiles from TCGA segments for each cancer defined by similarity score; <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1006080#pcbi.1006080.e010" target="_blank">Eq 6</a> (top track for each cancer type). The quantized segments from each representative tumor were used as the CNV input for Bamgineer along with a single normal BAM file and sampled to artificially create tumor-normal admixture at the desired purity. The purity value for each tumor type (in gray), is the median purity value for each cancer type according to TCGA segments. As the purity decreases, we observed a corresponding decrease in log2ratios of tumor segments (reduction in color intensities). Occasional discrepancies between exemplar and simulated CNV profiles for each cancer appear to be due to variation in the segmentation of copy number calls by Sequenza as purities fall rather than inaccurate allele-specific coverage at these locations.</p
    corecore