9 research outputs found
PROBLEMS IN STATISTICAL GENETICS: CLASSIFICATION AND TESTING FOR NETWORK CHANGES
his thesis addresses the problems of classification of microarray data and the
statistical integration of molecular data to test for network changes. For the
classification problem, we consider the unpreprocessed and preprocessed microarray
data sets. We implement an extension of the partial least squares generalized
linear regression (PLSGLR) Bastien et al. (2005) achieved by combining it with the
logistic regression to get partial least squares generalized linear regression-logistic
regression model (PLSGLR-log) and also with the linear discriminant analysis to
get the partial least squares generalized linear regression-linear discriminant analysis
denoted by (PLSGLRDA). These two classification methodologies are then compared
with the classical methodologies namely the k-nearest neighbours (KNN), linear
discriminant analysis (LDA), partial least squares discriminant analysis (PLSDA),
ridge partial least squares (RPLS), the support vector machine (SVM). Furthermore,
we implement a recent algorithm by Dalmau et al. (2015) known as kernel multilogit
algorithm (KMA). The results indicate that for the noisy unpreprocessed data, the
KMA emerged as the clear “winner” based on based on their low misclassification
error rates. For the preprocessed normalized data, there was no clear “winner” since
there was no single method that performed outstandingly better than the rest. The
KNN emerged as a clear “loser” since it consistently had a relatively higher rate of
misclassification both when applied to the un-preprocessed and preprocessed data
sets.
The statistical integration of molecular data to test for network changes considers
an experiment involving two main groups namely the healthy (H) and acute
rheumatic fever (ARF) subjects. For each group, each specimen is divided in
two portions so that one portion is group A streptococcus (GAS) stimulated while
the other is unstimulated so that we end up with four sub groups: Healthy GAS
stimulated, Healthy unstimulated, ARF-GAS stimulated and ARF unstimulated.
As a result, we have dependence within the groups and independence between the
groups. For all the groups, p genes are measured for expression. We identify a
prior network from the curated literature and online sources. The genes considered
in the experiment are then matched with the ones in the prior network so that
we reduce the prior network to only the genes that are found in the experimental
data. We then construct two networks, one for the healthy and th
Anomalies Detection Using the Benford's Law: Application to the Kenyan Presidential Elections of 2017
In the modern times, the populace in most African countries are left wondering whether the declared election winner actually got the most votes. The validity of the declared election results in most cases remain questionable. In order to determine the validity of the declared results, an empirical statistical methodology could be used to give some hint and or evidence of anomalies in the declared election count data. This paper therefore considers a statistical method based on the pattern of digits in vote counts known as 2 digit Benfords Law (2BL) that is useful for detecting fraud or other anomalies. The 2BL methodology and other extensions are applied to detect the possible anomalies and fraud in the 2017 Kenyan presidential elections results data. The analysis show that the data for the top two presidential candidates: Uhuru Kenyatta and Raila Odinga do not follow the 2BL distribution. The digits are signi cantly di erent at 5% signi cance level when tested using the chi-square and the Euclidean tests. The mean absolute deviation (M.A.D) also con rms the non-conformity of the data to the 2BL distributions test. Further tests namely,the second order test, the summation test and the duplication test are utilized in order to detected any possible anomalies and fraud that could be present. All the three additional tests con rm the presence of fraud and anomalies in the data. These are red ags on the credibility of the presidential election results data published by the Independent Electoral and Boundaries Commission (IEBC).Keywords: Anomalies, Benford's Law, Kenyan Presidential Elections 201
Singular Spectrum Analysis: An Application to Kenya’s Industrial Inputs Price Index
Time series modeling and forecasting techniques serve as gauging tools to understand the time-related properties of a given time series and its future course. Most financial and economic time series data do not meet the restrictive assumptions of normality, linearity, and stationarity of the observed data, limiting the application of classical models without data transformation. As non-parametric methods, Singular Spectrum Analysis (SSA) is data-adaptive; hence do not necessarily consider these restrictive assumptions as in classical methods. The current study employed a longitudinal research design to evaluate how SSA fist Kenya’s monthly industrial inputs price index from January 1992 to April 2022. Since 2018, reducing the costs of industrial inputs has been one of Kenya’s manufacturing agendas to level the playing field and foster Kenya’s manufacturing sector. It was expected that Kenya’s Manufacturing Value Added hit a tune of 22% by 2022. The study results showed that the SSA (L = 12, r =7) (MAPE = 0.707%) provides more reliable forecasts. The 24-period forecasts showed that the industrial inputs price index remains high above the index in 2017 before the post-industrial agenda targeting a reduction in the cost of industrial inputs. Thus, the industrial input prices should be reduced to a sustainable level.</jats:p
Forecasting Commodity Price Index of Food and Beverages in Kenya Using Seasonal Autoregressive Integrated Moving Average (SARIMA) Models
Price stability is the primary monetary policy objective in any economy since it protects the interests of both consumers and producers. As a result, forecasting is a common practice and a vital aspect of monetary policymaking. Future predictions guide monetary and fiscal policy tools that that be used to stabilize commodity prices. As a result, developing an accurate and precise forecasting model is critical. The current study fitted and forecasted the food and beverages price index (FBPI) in Kenya using seasonal autoregressive integrated moving average (SARIMA) models. Unlike other ARIMA models like the autoregressive (AR), Moving Average (MA), and non-seasonal ARMA models, the SARIMA model accounts for the seasonal component in a given time series data better forecasts. The study relied on secondary data obtained from the KNBS website on monthly food and beverage price index in Kenya from January 1991 to February 2020. R-statistical software was used to analyze the data. The parameter estimation was done using the Maximum Likelihood Estimation method. Competing SARIMA models were compared using the Mean Absolute Error (MAE), Mean Absolute Scaled Error (MASE),.and Mean Absolute Percentage Error (MAPE). A first-order differenced SARIMA (1,1,1) (0,1,1)12 minimized these model evaluation criteria (AIC = 1818.15, BIC =1833.40). The forecasting ability evaluation statistics MAE = 2.00%, MAPE = 1.62% and MASE = 0.87%. The 24-step ahead forecasts showed that the FPBI is unstable with an overall increasing trend. Therefore, the monetary policy committee ought to control inflation through monetary or fiscal policy, strengthening food security and trade liberalization.</jats:p
Evaluating the Predictive Ability of Seasonal Autoregressive Integrated Moving Average (SARIMA) Models using Food and Beverages Price Index in Kenya
Price instability has been a major concern in most economies. Kenya's commodity markets have been characterized by high price volatility affecting investment and consumer behaviour due to uncertainty on future prices. Therefore, precise forecasting models can help consumers plan for their expenditure and government policymakers formulate price control measures. Due to the seasonality of Kenya's food and beverage price indices, the current study postulates that the Seasonal Autoregressive Integrated Moving Average (SARIMA) model can best be the best fit model for the data. The study used secondary data on Kenya's monthly food and beverage prices index from January 1991 to February 2020 to examine the predictive ability of the possible SARIMA models based on the minimisation of the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). A first-order differenced SARIMA (1,1,1) (0,1,1)12 minimized these model evaluation criteria (AIC = 1818.15, BIC =1833.40). The cross-validation test results of 6, 12, 18, 24, 30, and 36 step-ahead forecasts demonstrated that SARIMA models are unstable for use in forecasting over a long-time period with a tendency of increasing prediction errors with an increase in the forecast period. It is anticipated that the findings of the current study will provide necessary valuable information to the policymakers and stakeholders to understand future trends in commodity price</jats:p
PLS Generalized Linear Regression and Kernel Multilogit Algorithm (KMA) for Microarray Data Classification Problem
This study involves the implentation of the extensions of the partial least squares generalized linear regression (PLSGLR) by combining it with logistic regression and linear discriminant analysis, to get a partial least squares generalized linear regression-logistic regression model (PLSGLR-log), and a partial least squares generalized linear regression-linear discriminant analysis model (PLSGLRDA). A comparative study of the obtained classifiers with the classical methodologies like the k-nearest neighbours (KNN), linear discriminant analysis (LDA), partial least squares discriminant analysis (PLSDA), ridge partial least squares (RPLS), and support vector machines(SVM) is then carried out. Furthermore, a new methodology known as kernel multilogit algorithm (KMA) is also implemented and its performance compared with those of the other classifiers. The KMA emerged as the best classifier based on the lowest classification error rates compared to the others when applied to the types of data are considered; the un- preprocessed and preprocessed.</jats:p
Regresión lineal generalizada por MCP y algoritmo kernel multilogit para la clasificación de datos de microarreglos
This study involves the implentation of the extensions of the partial least squares generalized linear regression (PLSGLR) by combining it with logistic regression and linear discriminant analysis, to get a partial least squares generalized linear regression-logistic regression model (PLSGLR-log), and a partial least squares generalized linear regression-linear discriminant analysis model (PLSGLRDA). A comparative study of the obtained classifiers with the classical methodologies like the k-nearest neighbours (KNN), linear discriminant analysis (LDA), partial least squares discriminant analysis (PLSDA), ridge partial least squares (RPLS), and support vector machines(SVM) is then carried out. Furthermore, a new methodology known as kernel multilogit algorithm (KMA) is also implemented and its performance compared with those of the other classifiers. The KMA emerged as the best classifier based on the lowest classification error rates compared to the others when applied to the types of data are considered; the un- preprocessed and preprocessed.Este estudio combina el modelo de regresión lineal generalizado por mínimos cuadrado parciales (RLGMCP), con regresión logística y análisis discriminante lineal, para obtener los modelos de regresión logística generalizada por mínimos cuadrados parciales, (RLGMCP) y regresión logística generalizada-discriminante por mínimos cuadrados parciales (RLGDMCP). Se realiza un estudio comparativo con clasificadores clásicos como, k-vecinos más cercanos (KVC), análisis discriminante lineal (ADL), análisis discriminante de por mínimos cuadrados parciales (ADMCP), regresión por mínimos cuadrados parciales (RMCP) y máquinas de vectores de soporte de soporte vectorial (MSV). Además, se implementa una nueva metodología conocida como algoritmo de kernel multilogit (AKM). Su desempeño es comparado con los de los otros clasificadores. De acuerdo con las tasas de error de clasificación obtenidas a partir de los diferentes tipos de datos, el KMA es el de mejor resultado
