1,238 research outputs found
RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning
Anonymized electronic medical records are an increasingly popular source of
research data. However, these datasets often lack race and ethnicity
information. This creates problems for researchers modeling human disease, as
race and ethnicity are powerful confounders for many health exposures and
treatment outcomes; race and ethnicity are closely linked to
population-specific genetic variation. We showed that deep neural networks
generate more accurate estimates for missing racial and ethnic information than
competing methods (e.g., logistic regression, random forest). RIDDLE yielded
significantly better classification performance across all metrics that were
considered: accuracy, cross-entropy loss (error), and area under the curve for
receiver operating characteristic plots (all ). We made specific
efforts to interpret the trained neural network models to identify, quantify,
and visualize medical features which are predictive of race and ethnicity. We
used these characterizations of informative features to perform a systematic
comparison of differential disease patterns by race and ethnicity. The fact
that clinical histories are informative for imputing race and ethnicity could
reflect (1) a skewed distribution of blue- and white-collar professions across
racial and ethnic groups, (2) uneven accessibility and subjective importance of
prophylactic health, (3) possible variation in lifestyle, such as dietary
habits, and (4) differences in background genetic variation which predispose to
diseases
Representation of probabilistic scientific knowledge
This article is available through the Brunel Open Access Publishing Fund. Copyright © 2013 Soldatova et al; licensee BioMed Central Ltd.The theory of probability is widely used in biomedical research for data analysis and modelling. In previous work the probabilities of the research hypotheses have been recorded as experimental metadata. The ontology HELO is designed to support probabilistic reasoning, and provides semantic descriptors for reporting on research that involves operations with probabilities. HELO explicitly links research statements such as hypotheses, models, laws, conclusions, etc. to the associated probabilities of these statements being true. HELO enables the explicit semantic representation and accurate recording of probabilities in hypotheses, as well as the inference methods used to generate and update those hypotheses. We demonstrate the utility of HELO on three worked examples: changes in the probability of the hypothesis that sirtuins regulate human life span; changes in the probability of hypotheses about gene functions in the S. cerevisiae aromatic amino acid pathway; and the use of active learning in drug design (quantitative structure activity relation learning), where a strategy for the selection of compounds with the highest probability of improving on the best known compound was used. HELO is open source and available at https://github.com/larisa-soldatova/HELO.This work was partially supported by grant BB/F008228/1 from the UK Biotechnology & Biological Sciences Research Council, from the European Commission under the FP7 Collaborative Programme, UNICELLSYS, KU Leuven GOA/08/008 and ERC Starting Grant 240186
Food-chain competition influences gene's size
We have analysed an effect of the Bak-Sneppen predator-prey food-chain
self-organization on nucleotide content of evolving species. In our model,
genomes of the species under consideration have been represented by their
nucleotide genomic fraction and we have applied two-parameter Kimura model of
substitutions to include the changes of the fraction in time. The initial
nucleotide fraction and substitution rates were decided with the help of random
number generator. Deviation of the genomic nucleotide fraction from its
equilibrium value was playing the role of the fitness parameter, , in
Bak-Sneppen model. Our finding is, that the higher is the value of the
threshold fitness, during the evolution course, the more frequent are large
fluctuations in number of species with strongly differentiated nucleotide
content; and it is more often the case that the oldest species, which survive
the food-chain competition, might have specific nucleotide fraction making
possible generating long genesComment: 11 pages including 7 figure
A dynamic network approach for the study of human phenotypes
The use of networks to integrate different genetic, proteomic, and metabolic
datasets has been proposed as a viable path toward elucidating the origins of
specific diseases. Here we introduce a new phenotypic database summarizing
correlations obtained from the disease history of more than 30 million patients
in a Phenotypic Disease Network (PDN). We present evidence that the structure
of the PDN is relevant to the understanding of illness progression by showing
that (1) patients develop diseases close in the network to those they already
have; (2) the progression of disease along the links of the network is
different for patients of different genders and ethnicities; (3) patients
diagnosed with diseases which are more highly connected in the PDN tend to die
sooner than those affected by less connected diseases; and (4) diseases that
tend to be preceded by others in the PDN tend to be more connected than
diseases that precede other illnesses, and are associated with higher degrees
of mortality. Our findings show that disease progression can be represented and
studied using network methods, offering the potential to enhance our
understanding of the origin and evolution of human diseases. The dataset
introduced here, released concurrently with this publication, represents the
largest relational phenotypic resource publicly available to the research
community.Comment: 28 pages (double space), 6 figure
Recommended from our members
DiseaseConnect: a comprehensive web server for mechanism-based disease–disease connections
The DiseaseConnect (http://disease-connect.org) is a web server for analysis and visualization of a comprehensive knowledge on mechanism-based disease connectivity. The traditional disease classification system groups diseases with similar clinical symptoms and phenotypic traits. Thus, diseases with entirely different pathologies could be grouped together, leading to a similar treatment design. Such problems could be avoided if diseases were classified based on their molecular mechanisms. Connecting diseases with similar pathological mechanisms could inspire novel strategies on the effective repositioning of existing drugs and therapies. Although there have been several studies attempting to generate disease connectivity networks, they have not yet utilized the enormous and rapidly growing public repositories of disease-related omics data and literature, two primary resources capable of providing insights into disease connections at an unprecedented level of detail. Our DiseaseConnect, the first public web server, integrates comprehensive omics and literature data, including a large amount of gene expression data, Genome-Wide Association Studies catalog, and text-mined knowledge, to discover disease–disease connectivity via common molecular mechanisms. Moreover, the clinical comorbidity data and a comprehensive compilation of known drug–disease relationships are additionally utilized for advancing the understanding of the disease landscape and for facilitating the mechanism-based development of new drug treatments
Data incongruence and the problem of avian louse phylogeny
Recent studies based on different types of data (i.e. morphological and molecular) have supported conflicting phylogenies for the genera of avian feather lice (Ischnocera: Phthiraptera). We analyse new and published data from morphology and from mitochondrial (12S rRNA and COI) and nuclear (EF1-) genes to explore the sources of this incongruence and explain these conflicts. Character convergence, multiple substitutions at high divergences, and ancient radiation over a short period of time have contributed to the problem of resolving louse phylogeny with the data currently available. We show that apparent incongruence between the molecular datasets is largely attributable to rate variation and nonstationarity of base composition. In contrast, highly significant character incongruence leads to topological incongruence between the molecular and morphological data. We consider ways in which biases in the sequence data could be misleading, using several maximum likelihood models and LogDet corrections. The hierarchical structure of the data is explored using likelihood mapping and SplitsTree methods. Ultimately, we concede there is strong discordance between the molecular and morphological data and apply the conditional combination approach in this case. We conclude that higher level phylogenetic relationships within avian Ischnocera remain extremely problematic. However, consensus between datasets is beginning to converge on a stable phylogeny for avian lice, at and below the familial rank
Representation of research hypotheses
BACKGROUND: Hypotheses are now being automatically produced on an industrial scale by computers in biology, e.g. the annotation of a genome is essentially a large set of hypotheses generated by sequence similarity programs; and robot scientists enable the full automation of a scientific investigation, including generation and testing of research hypotheses. RESULTS: This paper proposes a logically defined way for recording automatically generated hypotheses in machine amenable way. The proposed formalism allows the description of complete hypotheses sets as specified input and output for scientific investigations. The formalism supports the decomposition of research hypotheses into more specialised hypotheses if that is required by an application. Hypotheses are represented in an operational way – it is possible to design an experiment to test them. The explicit formal description of research hypotheses promotes the explicit formal description of the results and conclusions of an investigation. The paper also proposes a framework for automated hypotheses generation. We demonstrate how the key components of the proposed framework are implemented in the Robot Scientist “Adam”. CONCLUSIONS: A formal representation of automatically generated research hypotheses can help to improve the way humans produce, record, and validate research hypotheses. AVAILABILITY: http://www.aber.ac.uk/en/cs/research/cb/projects/robotscientist/results
Imitating Manual Curation of Text-Mined Facts in Biomedicine
Text-mining algorithms make mistakes in extracting facts from natural-language texts. In biomedical applications, which rely on use of text-mined data, it is critical to assess the quality (the probability that the message is correctly extracted) of individual facts—to resolve data conflicts and inconsistencies. Using a large set of almost 100,000 manually produced evaluations (most facts were independently reviewed more than once, producing independent evaluations), we implemented and tested a collection of algorithms that mimic human evaluation of facts provided by an automated information-extraction system. The performance of our best automated classifiers closely approached that of our human evaluators (ROC score close to 0.95). Our hypothesis is that, were we to use a larger number of human experts to evaluate any given sentence, we could implement an artificial-intelligence curator that would perform the classification job at least as accurately as an average individual human evaluator. We illustrated our analysis by visualizing the predicted accuracy of the text-mined relations involving the term cocaine
- …
