94 research outputs found
Overlap-based undersampling method for classification of imbalanced medical datasets.
Early diagnosis of some life-threatening diseases such as cancers and heart is crucial for effective treatments. Supervised machine learning has proved to be a very useful tool to serve this purpose. Historical data of patients including clinical and demographic information is used for training learning algorithms. This builds predictive models that provide initial diagnoses. However, in the medical domain, it is common to have the positive class under-represented in a dataset. In such a scenario, a typical learning algorithm tends to be biased towards the negative class, which is the majority class, and misclassify positive cases. This is known as the class imbalance problem. In this paper, a framework for predictive diagnostics of diseases with imbalanced records is presented. To reduce the classification bias, we propose the usage of an overlap-based undersampling method to improve the visibility of minority class samples in the region where the two classes overlap. This is achieved by detecting and removing negative class instances from the overlapping region. This will improve class separability in the data space. Experimental results show achievement of high accuracy in the positive class, which is highly preferable in the medical domain, while good trade-offs between sensitivity and specificity were obtained. Results also show that the method often outperformed other state-of-the-art and well-established techniques
An insight into imbalanced Big Data classification: outcomes and challenges
Big Data applications are emerging during the last years, and researchers from many disciplines are aware of the high advantages related to the knowledge extraction from this type of problem. However, traditional learning approaches cannot be directly applied due to scalability issues. To overcome this issue, the MapReduce framework has arisen as a “de facto” solution. Basically, it carries out a “divide-and-conquer” distributed procedure in a fault-tolerant way to adapt for commodity hardware. Being still a recent discipline, few research has been conducted on imbalanced classification for Big Data. The reasons behind this are mainly the difficulties in adapting standard techniques to the MapReduce programming style. Additionally, inner problems of imbalanced data, namely lack of data and small disjuncts, are accentuated during the data partitioning to fit the MapReduce programming style. This paper is designed under three main pillars. First, to present the first outcomes for imbalanced classification in Big Data problems, introducing the current research state of this area. Second, to analyze the behavior of standard pre-processing techniques in this particular framework. Finally, taking into account the experimental results obtained throughout this work, we will carry out a discussion on the challenges and future directions for the topic.This work has been partially supported by the Spanish Ministry of Science and Technology under Projects TIN2014-57251-P and TIN2015-68454-R, the Andalusian Research Plan P11-TIC-7765, the Foundation BBVA Project 75/2016 BigDaPTOOLS, and the National Science Foundation (NSF) Grant IIS-1447795
Predicting no-show medical appointments using machine learning
Health care centers face many issues due to the limited availability of resources, such as funds, equipment, beds, physicians, and
nurses. Appointment absences lead to a waste of hospital resources as well
as endangering patient health. This fact makes unattended medi- cal
appointments both socially expensive and economically costly. This
research aimed to build a predictive model to identify whether an
appointment would be a no-show or not in order to reduce its consequences. This paper proposes a multi-stage framework to build an accurate predictor that also tackles the imbalanced property that the data
exhibits. The first stage includes dimensionality reduction to compress
the data into its most important components. The second stage deals with
the imbalanced nature of the data. Different machine learning algorithms were used to build the classifiers in the third stage. Various evaluation metrics are also discussed and an evaluation scheme that fits the
problem at hand is described. The work presented in this paper will help
decision makers at health care centers to implement effective strategies to
reduce the number of no-shows
Ultra-high dimensional variable selection with application to normative aging study: DNA methylation and metabolic syndrome
Optimal robust reinsurance-investment strategies for insurers with mean reversion and mispricing
Solid-Phase Peptide Synthesis Using a Four-Dimensional (Safety-Catch) Protecting Group Scheme
Peptides of importance to both academia and industry are mostly synthesized in the solid-phase mode using a two-dimensional scheme. The so-called Fmoc/tBu strategy, where the groups are removed by piperidine and TFA, respectively, is currently the method of choice for peptide synthesis. However, as the molecular diversity of cyclic and branched peptides becomes a challenging interest, a high level of orthogonal dimensionality is required, such as through triorthogonal protection schemes. Here we present a fourth category of orthogonal protecting groups that are stable under cleavage conditions, including the TFA treatment that removes the tBu-based groups. At the end of the synthetic process and upon some chemical manipulation, the groups in this fourth category were removed with TFA. This new concept of protecting groups could facilitate the synthesis and manipulation of difficult peptides.This work was partially funded by the National Research Foundation (NRF) (Blue Sky’s Research Program no. 120386). We thank Geraldo. A. Acosta, University of Barcelona, for the HRMS and NMR characterization.Peer reviewe
Prediction of Road Accidents’ Severity on Russian Roads Using Machine Learning Techniques
Mechanistic understanding of the relationships between molecular structure and emulsification properties of octenyl succinic anhydride (OSA) modified starches
Privacy Preserving in Data Stream Mining Using Statistical Learning Methods for Building Ensemble Classifier
- …
