Search CORE

49 research outputs found

Not Available

Author: Prabina Kumar Meher
Subhrajit Satpathy
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 31/10/2021
Field of study

Not AvailableIdentification of splice sites is an important aspect with regard to the prediction of gene structure. In most of the existing splice site prediction studies, machine learning algorithms coupled with sequence-derived features have been successfully employed for splice site recognition. However, the splice site identification by incorporating the secondary structure information is lacking, particularly in plant species. Thus, we made an attempt in this study to evaluate the performance of structural features on the splice site prediction accuracy in Arabidopsis thaliana. Prediction accuracies were evaluated with the sequence-derived features alone as well as by incorporating the structural features into the sequence-derived features, where support vector machine (SVM) was employed as prediction algorithm. Both short (40 base pairs) and long (105 base pairs) sequence datasets were considered for evaluation. After incorporating the secondary structure features, improvements in accuracies were observed only for the longer sequence dataset and the improvement was found to be higher with the sequence-derived features that accounted nucleotide dependencies. On the other hand, either a little or no improvement in accuracies was found for the short sequence dataset. The performance of SVM was further compared with that of LogitBoost, Random Forest (RF), AdaBoost and XGBoost machine learning methods. The prediction accuracies of SVM, AdaBoost and XGBoost were observed to be at par and higher than that of RF and LogitBoost algorithms. While prediction was performed by taking all the sequence-derived features along with the structural features, a little improvement in accuracies was found as compared to the combination of individual sequence-based features and structural features. To the best of our knowledge, this is the first attempt concerning the computational prediction of splice sites using machine learning methods by incorporating the secondary structure information into the sequence-derived features. All the source codes are available at https://github.com/meher861982/SSFeature.Not Availabl

PubMed Central

KRISHI Publications and Data Repository

Improved recognition of splice sites in A. thaliana by incorporating secondary structure information into sequence-derived features: a computational study

Author: Prabina Kumar Meher
Subhrajit Satpathy
Publication venue: Springer Science and Business Media LLC
Publication date: 31/10/2021
Field of study

Crossref

Not Available

Author: Atmakuri Ramakrishna Rao
Prabina Kumar Meher
Subhrajit Satpathy
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 03/09/2020
Field of study

Not AvailableMicroRNAs (miRNAs) are one kind of non-coding RNA, play vital role in regulating several physiological and developmental processes. Subcellular localization of miRNAs and their abundance in the native cell are central for maintaining physiological homeostasis. Besides, RNA silencing activity of miRNAs is also influenced by their localization and stability. Thus, development of computational method for subcellular localization prediction of miRNAs is desired. In this work, we have proposed a computational method for predicting subcellular localizations of miRNAs based on principal component scores of thermodynamic, structural properties and pseudo compositions of di-nucleotides. Prediction accuracy was analyzed following fivefold cross validation, where ~ 63-71% of AUC-ROC and ~ 69-76% of AUC-PR were observed. While evaluated with independent test set, > 50% localizations were found to be correctly predicted. Besides, the developed computational model achieved higher accuracy than the existing methods. A user-friendly prediction server "miRNALoc" is freely accessible at https://cabgrid.res.in:8080/mirnaloc/ , by which the user can predict localizations of miRNAs.Not Availabl

Crossref

KRISHI Publications and Data Repository

Publisher Correction: miRNALoc: predicting miRNA subcellular localizations based on principal component scores of physico-chemical properties and pseudo compositions of di-nucleotides

Author: Atmakuri Ramakrishna Rao
Prabina Kumar Meher
Subhrajit Satpathy
Publication venue: Springer Science and Business Media LLC
Publication date: 02/02/2021
Field of study

An amendment to this paper has been published and can be accessed via a link at the top of the paper.</jats:p

Crossref

Not Available

Author: Prabina Kumar Meher
Sagarika Dash
Subhrajit Satpathy
Sukanta Kumar Pradhan
Tanmaya Kumar Sahu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

Not AvailableIn plants, GIGANTEA (GI) protein plays different biological functions including carbon and sucrose metabolism, cell wall deposition, transpiration and hypocotyl elongation. This suggests that GI is an important class of proteins. So far, the resource-intensive experimental methods have been mostly utilized for identification of GI proteins. Thus, we made an attempt in this study to develop a computational model for fast and accurate prediction of GI proteins. Ten different supervised learning algorithms i.e., SVM, RF, JRIP, J48, LMT, IBK, NB, PART, BAGG and LGB were employed for prediction, where the amino acid composition (AAC), FASGAI features and physico-chemical (PHYC) properties were used as numerical inputs for the learning algorithms. Higher accuracies i.e., 96.75% of AUC-ROC and 86.7% of AUC-PR were observed for SVM coupled with AAC + PHYC feature combination, while evaluated with five-fold cross validation. With leave-one-out cross validation, 97.29% of AUC-ROC and 87.89% of AUC-PR were respectively achieved. While the performance of the model was evaluated with an independent dataset of 18 GI sequences, 17 were observed as correctly predicted. We have also performed proteome-wide identification of GI proteins in wheat, followed by functional annotation using Gene Ontology terms. A prediction server “GIpred” is freely accessible at http://cabgrid.res.in:8080/gipred/ for proteome-wide recognition of GI proteins.Not Availabl

PubMed Central

KRISHI Publications and Data Repository

Evaluating the performance of sequence encoding schemes and machine learning methods for splice sites recognition

Author: Atmakuri Ramakrishna Rao
Prabina Kumar Meher
Shachi Gahoi
Subhrajit Satpathy
Tanmaya Kumar Sahu
Publication venue: Elsevier BV
Publication date: 01/07/2019
Field of study

Crossref

GIpred: a computational tool for prediction of GIGANTEA proteins using machine learning algorithm

Author: Prabina Kumar Meher
Sagarika Dash
Subhrajit Satpathy
Sukanta Kumar Pradhan
Tanmaya Kumar Sahu
Publication venue: Springer Science and Business Media LLC
Publication date: 01/01/2022
Field of study

Crossref

Not Available

Author: Anil Rai
Anuj Sharma
Isha Saini
Prabina Kumar Meher
Subhrajit Satpathy
Sukanta Kumar Pradhan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 26/04/2021
Field of study

Not AvailableCircadian rhythms regulate several physiological and developmental processes of plants. Hence, the identification of genes with the underlying circadian rhythmic features is pivotal. Though computational methods have been developed for the identification of circadian genes, all these methods are based on gene expression datasets. In other words, we failed to search any sequence-based model, and that motivated us to deploy the present computational method to identify the proteins encoded by the circadian genes. Support vector machine (SVM) with seven kernels, i.e., linear, polynomial, radial, sigmoid, hyperbolic, Bessel and Laplace was utilized for prediction by employing compositional, transitional and physico-chemical features. Higher accuracy of 62.48% was achieved with the Laplace kernel, following the fivefold cross- validation approach. The developed model further secured 62.96% accuracy with an independent dataset. The SVM also outperformed other state-of-art machine learning algorithms, i.e., Random Forest, Bagging, AdaBoost, XGBoost and LASSO. We also performed proteome-wide identification of circadian proteins in two cereal crops namely, Oryza sativa and Sorghum bicolor, followed by the functional annotation of the predicted circadian proteins with Gene Ontology (GO) terms. To the best of our knowledge, this is the first computational method to identify the circadian genes with the sequence data. Based on the proposed method, we have developed an R-package PredCRG (https:// cran.rproject. org/ web/ packa ges/ PredC RG/ index. html) for the scientific community for proteome-wide identification of circadian genes. The present study supplements the existing computational methods as well as wet-lab experiments for the recognition of circadian genes.Not Availabl

KRISHI Publications and Data Repository

Not Available

Author: Atmakuri Ramakrishna Rao
Prabina Kumar Meher
Shachi Gahoi
Subhrajit Satpathy
Tanmaya Kumar Sahu
Publication venue: 'Elsevier BV'
Publication date: 01/07/2019
Field of study

Not AvailableIdentification of splice sites is imperative for prediction of gene structure. Machine learning-based approaches (MLAs) have been reported to be more successful than the rule-based methods for identification of splice sites. However, the strings of alphabets should be transformed into numeric features through sequence encoding before using them as input in MLAs. In this study, we evaluated the performances of 8 different sequence encoding schemes i.e., Bayes kernel, density and sparse (DS), distribution of tri-nucleotide and 1st order Markov model (DM), frequency difference distance measure (FDDM), paired-nucleotide frequency difference between true and false sites (FDTF), 1st order Markov model (MM1), combination of both 1st and 2nd order Markov model (MM1 + MM2) and 2nd order Markov model (MM2) in respect of predicting donor and acceptor splice sites using 5 supervised learning methods (ANN, Bagging, Boosting, RF and SVM). The encoding schemes and machine learning methods were first evaluated in 4 species i.e., A. thaliana, C. elegans, D. melanogaster and H. sapiens, and then performances were validated with another four species i.e., Ciona intestinalis, Dictyostelium discoideum, Phaeodactylum tricornutum and Trypanosoma brucei. In terms of ROC (receiver-operating-characteristics) and PR (precision-recall) curves, FDTF encoding approach achieved higher accuracy followed by either MM2 or FDDM. Further, SVM was found to achieve higher accuracy (in terms of ROC and PR curves) followed by RF across encoding schemes and species. In terms of prediction accuracy across species, the SVM-FDTF combination was optimum than other combinations of classifiers and encoding schemes. Further, splice site prediction accuracies were observed higher for the species with low intron density. To our limited knowledge, this is the first attempt as far as comprehensive evaluation of sequence encoding schemes for prediction of splice sites is concerned. We have also developed an R-package EncDNA (https://cran.r-project.org/web/packages/EncDNA/index.html) for encoding of splice site motifs with different encoding schemes, which is expected to supplement the existing nucleotide sequence encoding approaches. This study is believed to be useful for the computational biologists for predicting different functional elements on the genomic DNA.Not Availabl

KRISHI Publications and Data Repository

PredCRG: A computational method for recognition of plant circadian genes by employing support vector machine with Laplace kernel

Author: Anil Rai
Ansuman Mohapatra
Anuj Sharma
Isha Saini
Prabina Kumar Meher
Subhrajit Satpathy
Sukanta Kumar Pradhan
Publication venue: Springer Science and Business Media LLC
Publication date: 26/04/2021
Field of study

Abstract Background Circadian rhythms regulate several physiological and developmental processes of plants. Hence, the identification of genes with the underlying circadian rhythmic features is pivotal. Though computational methods have been developed for the identification of circadian genes, all these methods are based on gene expression datasets. In other words, we failed to search any sequence-based model, and that motivated us to deploy the present computational method to identify the proteins encoded by the circadian genes. Results Support vector machine (SVM) with seven kernels, i.e., linear, polynomial, radial, sigmoid, hyperbolic, Bessel and Laplace was utilized for prediction by employing compositional, transitional and physico-chemical features. Higher accuracy of 62.48% was achieved with the Laplace kernel, following the fivefold cross- validation approach. The developed model further secured 62.96% accuracy with an independent dataset. The SVM also outperformed other state-of-art machine learning algorithms, i.e., Random Forest, Bagging, AdaBoost, XGBoost and LASSO. We also performed proteome-wide identification of circadian proteins in two cereal crops namely, Oryza sativa and Sorghum bicolor, followed by the functional annotation of the predicted circadian proteins with Gene Ontology (GO) terms. Conclusions To the best of our knowledge, this is the first computational method to identify the circadian genes with the sequence data. Based on the proposed method, we have developed an R-package PredCRG (https://cran.r-project.org/web/packages/PredCRG/index.html) for the scientific community for proteome-wide identification of circadian genes. The present study supplements the existing computational methods as well as wet-lab experiments for the recognition of circadian genes. </jats:sec

Crossref