326 research outputs found

    Case-base retrieval of childhood leukaemia patients using gene expression data

    Full text link
    University of Technology, Sydney. Faculty of Engineering and Information Technology.Acute Lymphoblastic Leukaemia (ALL) is the most common childhood malignancy. Nowadays, ALL is diagnosed by a full blood count and a bone marrow biopsy. With microarray technology, it is becoming more feasible to look at the problem from a genetic point of view and to perform assessment for each patient. This thesis proposes a case-base retrieval framework for ALL using a nearest neighbour classifier that can retrieve previously treated patients based on their gene expression data. However, the wealth of gene expression values being generated by high throughout microarray technologies leads to complex high dimensional datasets, and there is a critical need to apply data-mining and computational intelligence techniques to analyse these datasets efficiently. Gene expression datasets are typically noisy and have very high dimensionality. Moreover, gene expression microarray datasets often consist of a limited number of observations relative to the large number of gene expression values (thousands of genes). These characteristics adversely affect the analysis of microarray datasets and pose a challenge for building an efficient gene-based similarity model. Four problems are associated with calculating the similarity between cancer patients on the basis of their gene expression data: feature selection, dimensionality reduction, feature weighting and imbalanced classes. The main contributions of this thesis are: (i) a case-base retrieval framework, (ii) a Balanced Iterative Random Forest algorithm for feature selection, (iii) a Local Principal Component algorithm for dimensionality reduction and visualization and (iv) a Weight Learning Genetic algorithm for feature weighting. This thesis introduces Balanced Iterative Random Forest (BIRF) algorithm for selecting the most relevant features to the disease and discarding the non-relevant genes. Balanced iterative random forest is applied on four cancer microarray datasets: Childhood Leukaemia dataset, Golub Leukaemia dataset, Colon dataset and Lung cancer dataset. Childhood Leukaemia dataset represents the main target of this project and it is collected from The Children's Hospital at Westmead. Patients are classified based on the cancer's risk type (Medium, Standard and High risk); Colon cancer (cancer vs. normal); Golub Leukaemia dataset (acute lymphoblastic leukaemia vs. acute myeloid leukaemia) and Lung cancer (malignant pleural mesothelioma or adenocarcinoma). The results obtained by BIRF are compared to those of Support Vector Machine-Recursive Feature Elimination (SVM-RFE) and Naive Bayes (NB) classifiers. The BIRF approach results are competitive with these state-of-art methods and better in some cases. The Local Principal Component (LPC) algorithm introduced in this thesis for visualization is validated on three datasets: Childhood Leukaemia, Swiss-roll and Iris datasets. Significant results are achieved with LPC algorithm in comparison to other methods including local linear embedding and principal component analysis. This thesis introduces a Weight Learning Genetic algorithm based on genetic algorithms for feature weighting in the nearest neighbour classifier. The results show that a weighted nearest neighbour classifier with weights generated from the Weight Learning Genetic algorithm produces better results than the un-weighted nearest neighbour algorithm. This thesis also applies synthetic minority over sampling technique (SMOTE) to increase the number of points in the minority classes and reduce the effect of imbalanced classes. The results show that the minority class becomes recognised by the nearest neighbour classifier. SMOTE also reduces the effect of imbalanced classes in predicting the class of new queries especially if the query sample should be classified to the minority class

    SVM-based association rules for knowledge discovery and classification

    Full text link
    © 2015 IEEE. Improving analysis of market basket data requires the development of approaches that lead to recommendation systems that are tailored to specifically benefit grocery chain. The main purpose of that is to find relationships existing among the sales of the products that can help retailer identify new opportunities for cross-selling their products to customers. This paper aims to discover knowledge patterns hidden in large data set that can yield more understanding to the data holders and identify new opportunities for imperative tasks including strategic planning and decision making. This paper delivers a strategy for the implementation of a systematic analysis framework built on the established principles used in data mining and machine learning. The primary goal of that is to form the foundation of what we envisage will be a new recommendation system in the market. Uniquely, our strategy seeks to implement data mining tools that will allow the analyst to interact with the data and address business questions such as promotions advertisement. We employ Apriori algorithm and support vector machine to implement our recommendation systems. Experiments are done using a real market dataset and the 0.632+ bootstrap method is used here in order to evaluate our framework. The obtained results suggest that the proposed framework will be able to generate benefits for grocery chain using a real-world grocery store data

    A framework for high dimensional data reduction in the microarray domain

    Full text link
    Microarray analysis and visualization is very helpful for biologists and clinicians to understand gene expression in cells and to facilitate diagnosis and treatment of patients. However, a typical microarray dataset has thousands of features and a very small number of observations. This very high dimensional data has a massive amount of information which often contains some noise, non-useful information and small number of relevant features for disease or genotype. This paper proposes a framework for very high dimensional data reduction based on three technologies: feature selection, linear dimensionality reduction and non-linear dimensionality reduction. In this paper, feature selection based on mutual information will be proposed for filtering features and selecting the most relevant features with the minimum redundancy. A kernel linear dimensionality reduction method is also used to extract the latent variables from a high dimensional data set. In addition, a non-linear dimensionality reduction based on local linear embedding is used to reduce the dimension and visualize the data. Experimental results are presented to show the outputs of each step and the efficiency of this framework. © 2010 IEEE

    ABC-sampling for balancing imbalanced datasets based on artificial bee colony algorithm

    Full text link
    © 2015 IEEE. Class imbalanced data is a common problem for predictive modelling in domains such as bioinformatics. It occurs when the distribution of classes is not uniform among samples and results in a biased prediction of learning towards majority classes. In this study, we propose the ABC-Sampling algorithm based on a swarm optimization method called Artificial Bee Colony, which models the natural foraging behaviour of honeybees. Our algorithm lessens the effects of imbalanced classes by selecting the most informative majority samples using a forward search and storing them in a ranked subset. Then we construct a balanced dataset with a planned undersampling strategy to extract the most frequent majority samples from the top ranked subset and combine them with all minority samples. Our algorithm is superior to a state-of-the-art method on nine benchmark datasets with various levels of imbalance ratios

    Feature selection of imbalanced gene expression microarray data

    Full text link
    Gene expression data is a very complex data set characterised by abundant numbers of features but with a low number of observations. However, only a small number of these features are relevant to an outcome of interest. With this kind of data set, feature selection becomes a real prerequisite. This paper proposes a methodology for feature selection for an imbalanced leukaemia gene expression data based on random forest algorithm. It presents the importance of feature selection in terms of reducing the number of features, enhancing the quality of machine learning and providing better understanding for biologists in diagnosis and prediction. Algorithms are presented to show the methodology and strategy for feature selection taking care to avoid over fitting. Moreover, experiments are done using imbalanced Leukaemia gene expression data and special measurement is used to evaluate the quality of feature selection and performance of classification. © 2011 IEEE

    A balanced iterative random forest for gene selection from microarray data

    Get PDF
    Background: The wealth of gene expression values being generated by high throughput microarray technologies leads to complex high dimensional datasets. Moreover, many cohorts have the problem of imbalanced classes where the number of patients belonging

    Ensemble feature learning of genomic data using support vector machine

    Full text link
    © 2016 Anaissi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. The identification of a subset of genes having the ability to capture the necessary information to distinguish classes of patients is crucial in bioinformatics applications. Ensemble and bagging methods have been shown to work effectively in the process of gene selection and classification. Testament to that is random forest which combines random decision trees with bagging to improve overall feature selection and classification accuracy. Surprisingly, the adoption of these methods in support vector machines has only recently received attention but mostly on classification not gene selection. This paper introduces an ensemble SVM-Recursive Feature Elimination (ESVM-RFE) for gene selection that follows the concepts of ensemble and bagging used in random forest but adopts the backward elimination strategy which is the rationale of RFE algorithm. The rationale behind this is, building ensemble SVM models using randomly drawn bootstrap samples from the training set, will produce different feature rankings which will be subsequently aggregated as one feature ranking. As a result, the decision for elimination of features is based upon the ranking of multiple SVM models instead of choosing one particular model. Moreover, this approach will address the problem of imbalanced datasets by constructing a nearly balanced bootstrap sample. Our experiments show that ESVM-RFE for gene selection substantially increased the classification performance on five microarray datasets compared to state-of-the-art methods. Experiments on the childhood leukaemia dataset show that an average 9% better accuracy is achieved by ESVM-RFE over SVM-RFE, and 5% over random forest based approach. The selected genes by the ESVM-RFE algorithm were further explored with Singular Value Decomposition (SVD) which reveals significant clusters with the selected data

    Teste de eletrodos modificados com argila para a produção de hidrogênio

    Get PDF
    VII Seminário de Extensão Universitária da UNILA (SEUNI); VIII Encontro de Iniciação Científica e IV Encontro de Iniciação em Desenvolvimento Tecnológico e Inovação (EICTI 2019) e Seminário de Atividades Formativas da UNILA (SAFOR)A eletrólise alcalina da água é uma rota alternativa para a produção de hidrogênio de forma sustentável. Sendo assim, neste trabalho, foi realizada a avaliação do efeito do recobrimento de três eletrodos atuando como cátodo (Pt, Ni e aço AISI940L), com um nanocompósito de Argila/Ni(OH) 2 atuando como eletrocatalisador na produção de hidrogênio, buscou-se verificar a influência na produção de carga e densidade de corrente por meio de cronoamperometria realizada em um potencial de -1,7 V durante 1 hora. Como resultados, obteve-se que, o nanocompósito proporcionou aumentos significativos da densidade de corrente na produção de hidrogênioAgradeço à UNILA pela bolsa concedida e apoio ao projeto. Ao Núcleo de pesquisas em Hidrogênio (NuPHI) do Parque Tecnológico Itaipu (PTI) pela colaboração, e também a toda equipe do LabMat da UNICENTRO pelo fornecimento do material testad

    Síntese de óxidos de ferro utilizando rejeitos metalicos: aplicação em tintas e cimentados

    Get PDF
    Anais do 35º Seminário de Extensão Universitária da Região Sul - Área temática: Meio AmbienteResíduos sólidos de lâminas de aço inoxidável (LAM) tornaram-se um problema social e principalmente ambiental, pois são descartados em locais inapropriados. Estes resíduos têm em sua composição quantidades significativas de Cromo (Cr). Cr no estado de oxidação hexavalente tem efeito acumulativo no organismo humano podendo desencadear doenças cancerígenas ou teratogênicas. A fim de reciclar ferro (Fe) e cromo (Cr) destes rejeitos, desenvolvemos uma metodologia simples e viável para o preparo de materiais inorgânicos. A metodologia apresenta viabilidade, pois é rápida e econômica. Os pós foram caracterizados por Difratometria de Raios-X (DRX) e Fluorescência de Raios-X (XRF
    corecore