1,299 research outputs found
Scalable Similarity Search for Molecular Descriptors
Similarity search over chemical compound databases is a fundamental task in
the discovery and design of novel drug-like molecules. Such databases often
encode molecules as non-negative integer vectors, called molecular descriptors,
which represent rich information on various molecular properties. While there
exist efficient indexing structures for searching databases of binary vectors,
solutions for more general integer vectors are in their infancy. In this paper
we present a time- and space- efficient index for the problem that we call the
succinct intervals-splitting tree algorithm for molecular descriptors (SITAd).
Our approach extends efficient methods for binary-vector databases, and uses
ideas from succinct data structures. Our experiments, on a large database of
over 40 million compounds, show SITAd significantly outperforms alternative
approaches in practice.Comment: To be appeared in the Proceedings of SISAP'1
Prediction of Hydrate and Solvate Formation Using Statistical Models
Novel, knowledge based models for the prediction of hydrate and solvate formation are introduced, which require only the molecular formula as input. A data set of more than 19 000 organic, nonionic, and nonpolymeric molecules was extracted from the Cambridge Structural Database. Molecules that formed solvates were compared with those that did not using molecular descriptors and statistical methods, which allowed the identification of chemical properties that contribute to solvate formation. The study was conducted for five types of solvates: ethanol, methanol, dichloromethane, chloroform, and water solvates. The identified properties were all related to the size and branching of the molecules and to the hydrogen bonding ability of the molecules. The corresponding molecular descriptors were used to fit logistic regression models to predict the probability of any given molecule to form a solvate. The established models were able to predict the behavior of ∼80% of the data correctly using only two descriptors in the predictive model
GTI-space : the space of generalized topological indices
A new extension of the generalized topological indices (GTI) approach is carried out torepresent 'simple' and 'composite' topological indices (TIs) in an unified way. Thisapproach defines a GTI-space from which both simple and composite TIs represent particular subspaces. Accordingly, simple TIs such as Wiener, Balaban, Zagreb, Harary and Randićconnectivity indices are expressed by means of the same GTI representation introduced for composite TIs such as hyper-Wiener, molecular topological index (MTI), Gutman index andreverse MTI. Using GTI-space approach we easily identify mathematical relations between some composite and simple indices, such as the relationship between hyper-Wiener and Wiener index and the relation between MTI and first Zagreb index. The relation of the GTI space with the sub-structural cluster expansion of property/activity is also analysed and some routes for the applications of this approach to QSPR/QSAR are also given
Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection
The estimation of the accuracy of predictions is a critical problem in QSAR modeling. The "distance to model" can be defined as a metric that defines the similarity between the training set molecules and the test set compound for the given property in the context of a specific model. It could be expressed in many different ways, e.g., using Tanimoto coefficient, leverage, correlation in space of models, etc. In this paper we have used mixtures of Gaussian distributions as well as statistical tests to evaluate six types of distances to models with respect to their ability to discriminate compounds with small and large prediction errors. The analysis was performed for twelve QSAR models of aqueous toxicity against T. pyriformis obtained with different machine-learning methods and various types of descriptors. The distances to model based on standard deviation of predicted toxicity calculated from the ensemble of models afforded the best results. This distance also successfully discriminated molecules with low and large prediction errors for a mechanism-based model developed using log P and the Maximum Acceptor Superdelocalizability descriptors. Thus, the distance to model metric could also be used to augment mechanistic QSAR models by estimating their prediction errors. Moreover, the accuracy of prediction is mainly determined by the training set data distribution in the chemistry and activity spaces but not by QSAR approaches used to develop the models. We have shown that incorrect validation of a model may result in the wrong estimation of its performance and suggested how this problem could be circumvented. The toxicity of 3182 and 48774 molecules from the EPA High Production Volume (HPV) Challenge Program and EINECS (European chemical Substances Information System), respectively, was predicted, and the accuracy of prediction was estimated. The developed models are available online at http://www.qspr.org site
Peripheral T-cell lymphoma unspecified (PTCL-U): a new prognostic model from a retrospective multicentric clinical study
To assess the prognosis of peripheral T-cell lymphoma unspecified, we retrospectively analyzed 385 cases fulfilling the criteria defined by the World Health Organization classification. Factors associated with a worse overall survival (OS) in a univariate analysis were age older than 60 years (P=.0002), equal to or more than 2 extranodal sites (P=.0002), lactic dehydrogenase (LDH) value at normal levels or above (P<.0001), performance status (PS) equal to or more than 2 (Pless than or equal to.0001), stage III or higher (P=.0001), and bone marrow involvement (P=.0001). Multivariate analysis showed that age (relative risk, 1.732; 95% CI, 1.300-2.309; P<.0001), PS (relative risk, 1.719; 95% CI, 1.269-2.327, P<.0001), LDH level (relative risk, 1.905; 95% CI, 1.415-2.564; P<.0001), and bone marrow involvement (relative risk, 1.454; 95% CI, 1.045-2.023; P=.026) were factors independently predictive for survival. Using these 4 variables we constructed a new prognostic model that singled out 4 groups at different risk: group 1, no adverse factors, with 5-year and 10-year OS of 62.3% and 54.9%, respectively; group 2, one factor, with a 5-year and 10-year OS of 52.9% and 38.8%, respectively; group 3, 2 factors, with 5-year and 10-year OS of 32.9% and 18.0%, respectively; group 4,3 or 4 factors, with a 5-year and 10-year OS of 18.3 and 12.6%, respectively (Pless than or equal to.0001; log-rank, 66.79)
Pneumocystis carinii pneumonia in patients with malignant haematological diseases: 10 years' experience of infection in GIMEMA centres.
A retrospective survey was conducted over a 10-year period (1990-99) among 52 haematology divisions in order to evaluate the clinical and laboratory characteristics and outcome of patients with proven Pneumocystis carinii pneumonia (PCP) complicating haematological diseases. The study included 55 patients (18 with non-Hodgkin's lymphoma, 10 with acute lymphoblastic leukaemia, eight with acute myeloid leukaemia, five with chronic myeloid leukaemia, four with chronic lymphocytic leukaemia, four with multiple myeloma, three with myelodys-plastic syndrome, two with myelofibrosis and one with thalassemia) who developed PCP. Among these, 18 (33%) underwent stem cell transplantation; only two received an oral prophylaxis with trimethroprim/sulphamethoxazole. Twelve patients (22%) developed PCP despite protective isolation in a laminar airflow room. The most frequent symptoms were: fever (86%), dyspnoea (78%), non-productive cough (71%), thoracic pain (14%) and chills (5%); a severe hypoxaemia was present in 39 patients (71%). Chest radiography or computerized tomography showed interstitial infiltrates in 34 patients (62%), alveolar infiltrates in 12 patients (22%), and alveolar-interstitial infiltrates in nine patients (16%). Bronchoalveolar lavage was diagnostic in 47/48 patients, induced sputum in 9/18 patients and lung biopsy in 3/8 patients. The diagnosis was made in two patients at autopsy. All patients except one started a specific treatment (52 patients trimethroprim/sulphamethoxazole, one pentamidine and one dapsone). Sixteen patients (29%) died of PCP within 30 d of diagnosis. Multivariate analysis showed that prolonged steroid treatment (P < 0.006) and a radiological picture of diffuse lung involvement (P < 0.003) were negative diagnostic factors
Modeling complex metabolic reactions, ecological systems, and financial and legal networks with MIANN models based on Markov-Wiener node descriptors
[Abstract] The use of numerical parameters in Complex Network analysis is expanding to new fields of application. At a molecular level, we can use them to describe the molecular structure of chemical entities, protein interactions, or metabolic networks. However, the applications are not restricted to the world of molecules and can be extended to the study of macroscopic nonliving systems, organisms, or even legal or social networks. On the other hand, the development of the field of Artificial Intelligence has led to the formulation of computational algorithms whose design is based on the structure and functioning of networks of biological neurons. These algorithms, called Artificial Neural Networks (ANNs), can be useful for the study of complex networks, since the numerical parameters that encode information of the network (for example centralities/node descriptors) can be used as inputs for the ANNs. The Wiener index (W) is a graph invariant widely used in chemoinformatics to quantify the molecular structure of drugs and to study complex networks. In this work, we explore for the first time the possibility of using Markov chains to calculate analogues of node distance numbers/W to describe complex networks from the point of view of their nodes. These parameters are called Markov-Wiener node descriptors of order kth (Wk). Please, note that these descriptors are not related to Markov-Wiener stochastic processes. Here, we calculated the Wk(i) values for a very high number of nodes (>100,000) in more than 100 different complex networks using the software MI-NODES. These networks were grouped according to the field of application. Molecular networks include the Metabolic Reaction Networks (MRNs) of 40 different organisms. In addition, we analyzed other biological and legal and social networks. These include the Interaction Web Database Biological Networks (IWDBNs), with 75 food webs or ecological systems and the Spanish Financial Law Network (SFLN). The calculated Wk(i) values were used as inputs for different ANNs in order to discriminate correct node connectivity patterns from incorrect random patterns. The MIANN models obtained present good values of Sensitivity/Specificity (%): MRNs (78/78), IWDBNs (90/88), and SFLN (86/84). These preliminary results are very promising from the point of view of a first exploratory study and suggest that the use of these models could be extended to the high-throughput re-evaluation of connectivity in known complex networks (collation)
The use of 2D fingerprint methods to support the assessment of structural similarity in orphan drug legislation.
In the European Union, medicines are authorised for some rare disease only if they are judged to be dissimilar to authorised orphan drugs for that disease. This paper describes the use of 2D fingerprints to show the extent of the relationship between computed levels of structural similarity for pairs of molecules and expert judgments of the similarities of those pairs. The resulting relationship can be used to provide input to the assessment of new active compounds for which orphan drug authorisation is being sought
ANN multiscale model of anti-HIV Drugs activity vs AIDS prevalence in the US at county level based on information indices of molecular graphs and social networks
[Abstract] This work is aimed at describing the workflow for a methodology that combines chemoinformatics and pharmacoepidemiology methods and at reporting the first predictive model developed with this methodology. The new model is able to predict complex networks of AIDS prevalence in the US counties, taking into consideration the social determinants and activity/structure of anti-HIV drugs in preclinical assays. We trained different Artificial Neural Networks (ANNs) using as input information indices of social networks and molecular graphs. We used a Shannon information index based on the Gini coefficient to quantify the effect of income inequality in the social network. We obtained the data on AIDS prevalence and the Gini coefficient from the AIDSVu database of Emory University. We also used the Balaban information indices to quantify changes in the chemical structure of anti-HIV drugs. We obtained the data on anti-HIV drug activity and structure (SMILE codes) from the ChEMBL database. Last, we used Box-Jenkins moving average operators to quantify information about the deviations of drugs with respect to data subsets of reference (targets, organisms, experimental parameters, protocols). The best model found was a Linear Neural Network (LNN) with values of Accuracy, Specificity, and Sensitivity above 0.76 and AUROC > 0.80 in training and external validation series. This model generates a complex network of AIDS prevalence in the US at county level with respect to the preclinical activity of anti-HIV drugs in preclinical assays. To train/validate the model and predict the complex network we needed to analyze 43,249 data points including values of AIDS prevalence in 2,310 counties in the US vs ChEMBL results for 21,582 unique drugs, 9 viral or human protein targets, 4,856 protocols, and 10 possible experimental measures.Ministerio de Educación, Cultura y Deportes; AGL2011-30563-C03-0
- …
