375 research outputs found

    Securely measuring the overlap between private datasets with cryptosets

    Get PDF
    Many scientific questions are best approached by sharing data--collected by different groups or across large collaborative networks--into a combined analysis. Unfortunately, some of the most interesting and powerful datasets--like health records, genetic data, and drug discovery data--cannot be freely shared because they contain sensitive information. In many situations, knowing if private datasets overlap determines if it is worthwhile to navigate the institutional, ethical, and legal barriers that govern access to sensitive, private data. We report the first method of publicly measuring the overlap between private datasets that is secure under a malicious model without relying on private protocols or message passing. This method uses a publicly shareable summary of a dataset's contents, its cryptoset, to estimate its overlap with other datasets. Cryptosets approach "information-theoretic" security, the strongest type of security possible in cryptography, which is not even crackable with infinite computing power. We empirically and theoretically assess both the accuracy of these estimates and the security of the approach, demonstrating that cryptosets are informative, with a stable accuracy, and secure

    Modeling reactivity to biological macromolecules with a deep multitask network

    Get PDF
    Most small-molecule drug candidates fail before entering the market, frequently because of unexpected toxicity. Often, toxicity is detected only late in drug development, because many types of toxicities, especially idiosyncratic adverse drug reactions (IADRs), are particularly hard to predict and detect. Moreover, drug-induced liver injury (DILI) is the most frequent reason drugs are withdrawn from the market and causes 50% of acute liver failure cases in the United States. A common mechanism often underlies many types of drug toxicities, including both DILI and IADRs. Drugs are bioactivated by drug-metabolizing enzymes into reactive metabolites, which then conjugate to sites in proteins or DNA to form adducts. DNA adducts are often mutagenic and may alter the reading and copying of genes and their regulatory elements, causing gene dysregulation and even triggering cancer. Similarly, protein adducts can disrupt their normal biological functions and induce harmful immune responses. Unfortunately, reactive metabolites are not reliably detected by experiments, and it is also expensive to test drug candidates for potential to form DNA or protein adducts during the early stages of drug development. In contrast, computational methods have the potential to quickly screen for covalent binding potential, thereby flagging problematic molecules and reducing the total number of necessary experiments. Here, we train a deep convolution neural networkthe XenoSite reactivity modelusing literature data to accurately predict both sites and probability of reactivity for molecules with glutathione, cyanide, protein, and DNA. On the site level, cross-validated predictions had area under the curve (AUC) performances of 89.8% for DNA and 94.4% for protein. Furthermore, the model separated molecules electrophilically reactive with DNA and protein from nonreactive molecules with cross-validated AUC performances of 78.7% and 79.8%, respectively. On both the site- and molecule-level, the model’s performances significantly outperformed reactivity indices derived from quantum simulations that are reported in the literature. Moreover, we developed and applied a selectivity score to assess preferential reactions with the macromolecules as opposed to the common screening traps. For the entire data set of 2803 molecules, this approach yielded totals of 257 (9.2%) and 227 (8.1%) molecules predicted to be reactive only with DNA and protein, respectively, and hence those that would be missed by standard reactivity screening experiments. Site of reactivity data is an underutilized resource that can be used to not only predict if molecules are reactive, but also show where they might be modified to reduce toxicity while retaining efficacy. The XenoSite reactivity model is available at http://swami.wustl.edu/xenosite/p/reactivity

    Understanding and mitigating the impact of race with adversarial autoencoders

    Get PDF
    BACKGROUND: Artificial intelligence carries the risk of exacerbating some of our most challenging societal problems, but it also has the potential of mitigating and addressing these problems. The confounding effects of race on machine learning is an ongoing subject of research. This study aims to mitigate the impact of race on data-derived models, using an adversarial variational autoencoder (AVAE). In this study, race is a self-reported feature. Race is often excluded as an input variable, however, due to the high correlation between race and several other variables, race is still implicitly encoded in the data. METHODS: We propose building a model that (1) learns a low dimensionality latent spaces, (2) employs an adversarial training procedure that ensure its latent space does not encode race, and (3) contains necessary information for reconstructing the data. We train the autoencoder to ensure the latent space does not indirectly encode race. RESULTS: In this study, AVAE successfully removes information about race from the latent space (AUC ROC = 0.5). In contrast, latent spaces constructed using other approaches still allow the reconstruction of race with high fidelity. The AVAE\u27s latent space does not encode race but conveys important information required to reconstruct the dataset. Furthermore, the AVAE\u27s latent space does not predict variables related to race (R CONCLUSIONS: Though we constructed a race-independent latent space, any variable could be similarly controlled. We expect AVAEs are one of many approaches that will be required to effectively manage and understand bias in ML

    Performance evaluation of flexible manufacturing systems under uncertain and dynamic situations

    Get PDF
    The present era demands the efficient modelling of any manufacturing system to enable it to cope with unforeseen situations on the shop floor. One of the complex issues affecting the performance of manufacturing systems is the scheduling of part types. In this paper, the authors have attempted to overcome the impact of uncertainties such as machine breakdowns, deadlocks, etc., by inserting slack that can absorb these disruptions without affecting the other scheduled activities. The impact of the flexibilities in this scenario is also investigated. The objective functions have been formulated in such a manner that a better trade-off between the uncertainties and flexibilities can be established. Consideration of automated guided vehicles (AGVs) in this scenario helps in the loading or unloading of part types in a better manner. In the recent past, a comprehensive literature survey revealed the supremacy of random search algorithms in evaluating the performance of these types of dynamic manufacturing system. The authors have used a metaheuristic known as the quick convergence simulated annealing (QCSA) algorithm, and employed it to resolve the dynamic manufacturing scenario. The metaheuristic encompasses a Cauchy distribution function as a probability function that helps in escaping the local minima in a better manner. Various machine breakdown scenarios are generated. A ‘heuristic gap’ is measured, and it indicates the effectiveness of the performance of the proposed methodology with the varying problem complexities. Statistical validation is also carried out, which helps in authenticating the effectiveness of the proposed approach. The efficacy of the proposed approach is also compared with deterministic priority rules

    Opportunities and obstacles for deep learning in biology and medicine

    Get PDF
    Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems-patient classification, fundamental biological processes and treatment of patients-and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural network\u27s prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine

    Large scale study of multiple-molecule queries

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In ligand-based screening, as well as in other chemoinformatics applications, one seeks to effectively search large repositories of molecules in order to retrieve molecules that are similar typically to a single molecule lead. However, in some case, multiple molecules from the same family are available to seed the query and search for other members of the same family.</p> <p>Multiple-molecule query methods have been less studied than single-molecule query methods. Furthermore, the previous studies have relied on proprietary data and sometimes have not used proper cross-validation methods to assess the results. In contrast, here we develop and compare multiple-molecule query methods using several large publicly available data sets and background. We also create a framework based on a strict cross-validation protocol to allow unbiased benchmarking for direct comparison in future studies across several performance metrics.</p> <p>Results</p> <p>Fourteen different multiple-molecule query methods were defined and benchmarked using: (1) 41 publicly available data sets of related molecules with similar biological activity; and (2) publicly available background data sets consisting of up to 175,000 molecules randomly extracted from the ChemDB database and other sources. Eight of the fourteen methods were parameter free, and six of them fit one or two free parameters to the data using a careful cross-validation protocol. All the methods were assessed and compared for their ability to retrieve members of the same family against the background data set by using several performance metrics including the Area Under the Accumulation Curve (AUAC), Area Under the Curve (AUC), F1-measure, and BEDROC metrics.</p> <p>Consistent with the previous literature, the best parameter-free methods are the MAX-SIM and MIN-RANK methods, which score a molecule to a family by the maximum similarity, or minimum ranking, obtained across the family. One new parameterized method introduced in this study and two previously defined methods, the Exponential Tanimoto Discriminant (ETD), the Tanimoto Power Discriminant (TPD), and the Binary Kernel Discriminant (<b>BKD</b>), outperform most other methods but are more complex, requiring one or two parameters to be fit to the data.</p> <p>Conclusion</p> <p>Fourteen methods for multiple-molecule querying of chemical databases, including novel methods, (ETD) and (TPD), are validated using publicly available data sets, standard cross-validation protocols, and established metrics. The best results are obtained with ETD, TPD, BKD, MAX-SIM, and MIN-RANK. These results can be replicated and compared with the results of future studies using data freely downloadable from <url>http://cdb.ics.uci.edu/</url>.</p

    The role of manufacturing and market managers in strategy development:lessons from three companies

    Get PDF
    According to researchers and managers, there is a lack of agreement between marketing and manufacturing managers on critical strategic issues. However, most of the literature on the subject is anecdotal and little formal empirical research has been done. Three companies are investigated to study the extent of agreement/disagreement between manufacturing and marketing managers on strategy content and process. A novel method permits the study of agreement between the two different functional managers on the process of developing strategy. The findings consistently show that manufacturing managers operate under a wider range of strategic priorities than marketing managers, and that manufacturing managers participate less than marketing managers in the strategy development process. Further, both marketing and manufacturing managers show higher involvement in the strategy development process in the latter stages of the Hayes and Wheelwright four-stage model of manufacturing’s strategic role

    Development and validation of a deep learning model to quantify glomerulosclerosis in kidney biopsy specimens

    Get PDF
    Importance: A chronic shortage of donor kidneys is compounded by a high discard rate, and this rate is directly associated with biopsy specimen evaluation, which shows poor reproducibility among pathologists. A deep learning algorithm for measuring percent global glomerulosclerosis (an important predictor of outcome) on images of kidney biopsy specimens could enable pathologists to more reproducibly and accurately quantify percent global glomerulosclerosis, potentially saving organs that would have been discarded. Objective: To compare the performances of pathologists with a deep learning model on quantification of percent global glomerulosclerosis in whole-slide images of donor kidney biopsy specimens, and to determine the potential benefit of a deep learning model on organ discard rates. Design, Setting, and Participants: This prognostic study used whole-slide images acquired from 98 hematoxylin-eosin-stained frozen and 51 permanent donor biopsy specimen sections retrieved from 83 kidneys. Serial annotation by 3 board-certified pathologists served as ground truth for model training and for evaluation. Images of kidney biopsy specimens were obtained from the Washington University database (retrieved between June 2015 and June 2017). Cases were selected randomly from a database of more than 1000 cases to include biopsy specimens representing an equitable distribution within 0% to 5%, 6% to 10%, 11% to 15%, 16% to 20%, and more than 20% global glomerulosclerosis. Main Outcomes and Measures: Correlation coefficient (r) and root-mean-square error (RMSE) with respect to annotations were computed for cross-validated model predictions and on-call pathologists\u27 estimates of percent global glomerulosclerosis when using individual and pooled slide results. Data were analyzed from March 2018 to August 2020. Results: The cross-validated model results of section images retrieved from 83 donor kidneys showed higher correlation with annotations (r = 0.916; 95% CI, 0.886-0.939) than on-call pathologists (r = 0.884; 95% CI, 0.825-0.923) that was enhanced when pooling glomeruli counts from multiple levels (r = 0.933; 95% CI, 0.898-0.956). Model prediction error for single levels (RMSE, 5.631; 95% CI, 4.735-6.517) was 14% lower than on-call pathologists (RMSE, 6.523; 95% CI, 5.191-7.783), improving to 22% with multiple levels (RMSE, 5.094; 95% CI, 3.972-6.301). The model decreased the likelihood of unnecessary organ discard by 37% compared with pathologists. Conclusions and Relevance: The findings of this prognostic study suggest that this deep learning model provided a scalable and robust method to quantify percent global glomerulosclerosis in whole-slide images of donor kidneys. The model performance improved by analyzing multiple levels of a section, surpassing the capacity of pathologists in the time-sensitive setting of examining donor biopsy specimens. The results indicate the potential of a deep learning model to prevent erroneous donor organ discard
    corecore