149 research outputs found
Achieving High Accuracy Prediction of Minimotifs
The low complexity of minimotif patterns results in a high false-positive prediction rate, hampering protein function prediction. A multi-filter algorithm, trained and tested on a linear regression model, support vector machine model, and neural network model, using a large dataset of verified minimotifs, vastly improves minimotif prediction accuracy while generating few false positives. An optimal threshold for the best accuracy reaches an overall accuracy above 90%, while a stringent threshold for the best specificity generates less than 1% false positives or even no false positives and still produces more than 90% true positives for the linear regression and neural network models. The minimotif multi-filter with its excellent accuracy represents the state-of-the-art in minimotif prediction and is expected to be very useful to biologists investigating protein function and how missense mutations cause disease
Ga+, In+ and Tl+ Impurities in Alkali Halide Crystals: Distortion Trends
A computational study of the doping of alkali halide crystals (AX: A = Na, K;
X = Cl, Br) by ns2 cations (Ga+, In+ and Tl+) is presented. Active clusters of
increasing size (from 33 to 177 ions) are considered in order to deal with the
large scale distortions induced by the substitutional impurities. Those
clusters are embedded in accurate quantum environments representing the
surrounding crystalline lattice. The convergence of the distortion results with
the size of the active cluster is analyced for some selected impurity systems.
The most important conclusion from this study is that distortions along the
(100) and (110) crystallographic directions are not independent. Once a
reliable cluster model is found, distortion trends as a function of impurity,
alkali cation and halide anion are identified and discussed. These trends may
be useful when analycing other cation impurities in similar host lattices.Comment: LaTeX file. 7 pages and 2 pictures. Accepted for publication in J.
Chem. Phy
Secondary Structure, a Missing Component of Sequence- Based Minimotif Definitions
Minimotifs are short contiguous segments of proteins that have a known biological function. The hundreds of thousands of minimotifs discovered thus far are an important part of the theoretical understanding of the specificity of protein-protein interactions, posttranslational modifications, and signal transduction that occur in cells. However, a longstanding problem is that the different abstractions of the sequence definitions do not accurately capture the specificity, despite decades of effort by many labs. We present evidence that structure is an essential component of minimotif specificity, yet is not used in minimotif definitions. Our analysis of several known minimotifs as case studies, analysis of occurrences of minimotifs in structured and disordered regions of proteins, and review of the literature support a new model for minimotif definitions that includes sequence, structure, and function
Embedding Analytics within the Curation of Scientific Workflows
This paper reports on the ongoing activities and curation practices of the National Center for Biomolecular NMR Data Processing and Analysis1. Over the past several years, the Center has been developing and extending computational workflow management software for use by a community of biomolecular NMR spectroscopists. Previous work had been to refactor the workflow system to utilize the PREMIS framework for reporting retrospective provenance as well as for sharing workflows between scientists and to support data reuse. In this paper, we report on our recent efforts to embed analytics within the workflow execution and within provenance tracking. Important metrics for each of the intermediate datasets are included within the corresponding PREMIS intellectual object, which allows for both inspection of the operation of individual actors as well as visualization of the changes throughout a full processing workflow.
These metrics can be viewed within the workflow management system or through standalone metadata widgets. Our approach is to support a hybrid approach of both automated, workflow execution as well as manual intervention and metadata management. In this combination, the workflow system and metadata widgets encourage the domain experts to be avid curators of the data which they create, fostering both computational reproducibility and scientific data reuse.
 
Explorations in provenance in the information sciences
Provenance is important throughout Library and Information Science and is particularly important for the information infrastructures which support the computational aspects of the natural sciences. This is highlighted by the prominence of provenance as a plank in the FAIR principles for data stewardship (principle R1.2). While traditionally focused on the history/lineage of physical objects, provenance is now commonly accepted to apply to digital objects such as the results of computation as well as to the recipes for computing; in the case of recipes this prospective provenance is critical for reproducibility. This dissertation begins with background in provenance pertaining to data curation and computational reproducibility. The second part describes attempts to “FAIRify” the reporting and execution of workflows within a domain of natural science for better data stewardship to support data reusability. The next chapters argue that there remains a gap in our ability to fully document provenance as there are more story-telling tenses than just the past (retrospective) and future (prospective). There is also the subjunctive (conditional) and perhaps many others. Supporting new flavors of provenance requires new modeling constructs. The thesis concludes with novel information modeling techniques which exploit reification of sub-class relationships suitable for modeling these many sub-classes of provenance, as well as other domains.Submission original under an indefinite embargo labeled 'Open Access'. The submission was exported from vireo on 2025-03-28 without embargo termsThe student, Michael Gryk, accepted the attached license on 2024-11-24 at 13:30.The student, Michael Gryk, submitted this Dissertation for approval on 2024-11-24 at 15:24.This Dissertation was approved for publication on 2024-12-01 at 13:12.DSpace SAF Submission Ingestion Package generated from Vireo submission #21378 on 2025-03-28 at 14:26:0
Foregrounding data curation to foster reproducibility of workflows and scientific data reuse
Scientific data reuse requires careful curation and annotation of the data. Late stage curation activities foster FAIR principles which include metadata standards for making data findable, accessible, interoperable and reusable. However, in scientific domains such as biomolecular nuclear magnetic resonance spectroscopy, there is a considerable time lag (usually more than a year) between data creation and data deposition. It is simply not feasible to backfill the required metadata so long after the data has been created (anything not carefully recorded is forgotten) – curation activities must begin closer to (if not at the point of) data creation. The need for foregrounding data curation activities is well known. However, scientific disciplines which rely on complex experimental design, sophisticated instrumentation, and intricate processing workflows, require extra care. The knowledge gap investigated by this research proposal is to identify classes of important metadata which are hidden within the tacit knowledge of a scientist when constructing an experiment, hidden within the operational specifications of the scientific instrumentation, and hidden within the design / execution of processing workflows. Once these classes of hidden knowledge have been identified, it will be possible to explore mechanisms for preventing the loss of key metadata, either through automated conversion from existing metadata or through curation activities at the time of data creation. The first step of the research plan is to survey artifacts of scientific data creation. That is, (i) existing data files with accompanying metadata, (ii) workflows and scripts for data processing, and (iii) documentation for software and scientific instrumentation. The second step is to group, categorize, and classify the types of "hidden" knowledge discovered. For example, one class of hidden knowledge already uncovered is the implicit recording of data as its reciprocal rather than the value itself, as in magnetogyric versus gyromagnetic ratios. The third step is to design/propose classes of solutions for these classes of problems. For instance, reciprocals are often helped by being explicit with units of measurement. Careful design of metadata display and curation widgets can help expose and document tacit knowledge which would otherwise be lost
Conceptual-level workflow modeling of scientific experiments using NMR as a case study
BACKGROUND: Scientific workflows improve the process of scientific experiments by making computations explicit, underscoring data flow, and emphasizing the participation of humans in the process when intuition and human reasoning are required. Workflows for experiments also highlight transitions among experimental phases, allowing intermediate results to be verified and supporting the proper handling of semantic mismatches and different file formats among the various tools used in the scientific process. Thus, scientific workflows are important for the modeling and subsequent capture of bioinformatics-related data. While much research has been conducted on the implementation of scientific workflows, the initial process of actually designing and generating the workflow at the conceptual level has received little consideration. RESULTS: We propose a structured process to capture scientific workflows at the conceptual level that allows workflows to be documented efficiently, results in concise models of the workflow and more-correct workflow implementations, and provides insight into the scientific process itself. The approach uses three modeling techniques to model the structural, data flow, and control flow aspects of the workflow. The domain of biomolecular structure determination using Nuclear Magnetic Resonance spectroscopy is used to demonstrate the process. Specifically, we show the application of the approach to capture the workflow for the process of conducting biomolecular analysis using Nuclear Magnetic Resonance (NMR) spectroscopy. CONCLUSION: Using the approach, we were able to accurately document, in a short amount of time, numerous steps in the process of conducting an experiment using NMR spectroscopy. The resulting models are correct and precise, as outside validation of the models identified only minor omissions in the models. In addition, the models provide an accurate visual description of the control flow for conducting biomolecular analysis using NMR spectroscopy experiment
Lattice Distortions Around a Tl+ Impurity in NaI:Tl+ and CsI:Tl+ Scintillators. An Ab Initio Study Involving Large Active Clusters
Ab initio Perturbed Ion cluster-in-the-lattice calculations of the impurity
centers NaI:Tl+ and CsI:Tl+ are pressented. We study several active clusters of
increasing complexity and show that the lattice relaxation around the Tl+
impurity implies the concerted movement of several shells of neighbors. The
results also reveal the importance of considering a set of ions that can
respond to the geometrical displacements of the inner shells by adapting
selfconsistently their wave functions. Comparison with other calculations
involving comparatively small active clusters serves to assert the significance
of our conclusions. Contact with experiment is made by calculating absorption
energies. These are in excellent agreement with the experimental data for the
most realistic active clusters considered.Comment: 7 pages plus 6 postscript figures, LaTeX. Submmited to Phys, Rev.
Ab Initio Calculation of the Lattice Distortions induced by Substitutional Ag- and Cu- Impurities in Alkali Halide Crystals
An ab initio study of the doping of alkali halide crystals (AX: A = Li, Na,
K, Rb; X = F, Cl, Br, I) by ns2 anions (Ag- and Cu-) is presented. Large active
clusters with 179 ions embedded in the surrounding crystalline lattice are
considered in order to describe properly the lattice relaxation induced by the
introduction of substitutional impurities. In all the cases considered, the
lattice distortions imply the concerted movement of several shells of
neighbors. The shell displacements are smaller for the smaller anion Cu-, as
expected. The study of the family of rock-salt alkali halides (excepting CsF)
allows us to extract trends that might be useful at a predictive level in the
study of other impurity systems. Those trends are presented and discussed in
terms of simple geometric arguments.Comment: LaTeX file. 8 pages, 3 EPS pictures. New version contains
calculations of the energy of formation of the defects with model clusters of
different size
Curating Scientific Workflows for Biomolecular Nuclear Magnetic Resonance Spectroscopy
This paper describes our recent and ongoing efforts to enhance the curation of scientific workflows to improve reproducibility and reusability of biomolecular nuclear magnetic resonance (bioNMR) data. Our efforts have focused on both developing a workflow management system, called CONNJUR Workflow Builder (CWB), as well as refactoring our workflow data model to make use of the PREMIS model for digital preservation. This revised workflow management system will be available through the NMRbox cloud-computing platform for bioNMR. In addition, we are implementing a new file structure which bundles the original binary data files along with PREMIS XML records describing the provenance of the data. These are packaged together using a standardized file archive utility. In this manner, the provenance and data curation information is maintained together along with the scientific data. The benefits and limitations of these approaches, as well as future directions, are discussed in this paper
- …
