54 research outputs found
Simultaneous Quantitation of Amino Acid Mixtures using Clustering Agents
A method that uses the abundances of large clusters formed in electrospray ionization to determine the solution-phase molar fractions of amino acids in multi-component mixtures is demonstrated. For solutions containing either four or 10 amino acids, the relative abundances of protonated molecules differed from their solution-phase molar fractions by up to 30-fold and 100-fold, respectively. For the four-component mixtures, the molar fractions determined from the abundances of larger clusters consisting of 19 or more molecules were within 25% of the solution-phase molar fractions, indicating that the abundances and compositions of these clusters reflect the relative concentrations of these amino acids in solution, and that ionization and detection biases are significantly reduced. Lower accuracy was obtained for the 10-component mixtures where values determined from the cluster abundances were typically within a factor of three of their solution molar fractions. The lower accuracy of this method with the more complex mixtures may be due to specific clustering effects owing to the heterogeneity as a result of significantly different physical properties of the components, or it may be the result of lower S/N for the more heterogeneous clusters and not including the low-abundance more highly heterogeneous clusters in this analysis. Although not as accurate as using traditional standards, this clustering method may find applications when suitable standards are not readily available
The APEX Quantitative Proteomics Tool: Generating protein quantitation estimates from LC-MS/MS proteomics results
Mass spectrometry (MS) based label-free protein quantitation has mainly focused on analysis of ion peak heights and peptide spectral counts. Most analyses of tandem mass spectrometry (MS/MS) data begin with an enzymatic digestion of a complex protein mixture to generate smaller peptides that can be separated and identified by an MS/MS instrument. Peptide spectral counting techniques attempt to quantify protein abundance by counting the number of detected tryptic peptides and their corresponding MS spectra. However, spectral counting is confounded by the fact that peptide physicochemical properties severely affect MS detection resulting in each peptide having a different detection probability. Lu et al. (2007) described a modified spectral counting technique, Absolute Protein Expression (APEX), which improves on basic spectral counting methods by including a correction factor for each protein (called O(i) value) that accounts for variable peptide detection by MS techniques. The technique uses machine learning classification to derive peptide detection probabilities that are used to predict the number of tryptic peptides expected to be detected for one molecule of a particular protein (O(i)). This predicted spectral count is compared to the protein's observed MS total spectral count during APEX computation of protein abundances. Results: The APEX Quantitative Proteomics Tool, introduced here, is a free open source Java application that supports the APEX protein quantitation technique. The APEX tool uses data from standard tandem mass spectrometry proteomics experiments and provides computational support for APEX protein abundance quantitation through a set of graphical user interfaces that partition thparameter controls for the various processing tasks. The tool also provides a Z-score analysis for identification of significant differential protein expression, a utility to assess APEX classifier performance via cross validation, and a utility to merge multiple APEX results into a standardized format in preparation for further statistical analysis. Conclusion: The APEX Quantitative Proteomics Tool provides a simple means to quickly derive hundreds to thousands of protein abundance values from standard liquid chromatography-tandem mass spectrometry proteomics datasets. The APEX tool provides a straightforward intuitive interface design overlaying a highly customizable computational workflow to produce protein abundance values from LC-MS/MS datasets.National Institute of Allergy and Infectious Diseases (NIAID) N01-AI15447National Institutes of HealthNational Science Foundation, the Welsh and Packard FoundationsInternational Human Frontier Science ProgramCenter for Systems and Synthetic Biolog
An Introspective Comparison of Random Forest-Based Classifiers for the Analysis of Cluster-Correlated Data by Way of RF++
Many mass spectrometry-based studies, as well as other biological experiments produce cluster-correlated data. Failure to account for correlation among observations may result in a classification algorithm overfitting the training data and producing overoptimistic estimated error rates and may make subsequent classifications unreliable. Current common practice for dealing with replicated data is to average each subject replicate sample set, reducing the dataset size and incurring loss of information. In this manuscript we compare three approaches to dealing with cluster-correlated data: unmodified Breiman's Random Forest (URF), forest grown using subject-level averages (SLA), and RF++ with subject-level bootstrapping (SLB). RF++, a novel Random Forest-based algorithm implemented in C++, handles cluster-correlated data through a modification of the original resampling algorithm and accommodates subject-level classification. Subject-level bootstrapping is an alternative sampling method that obviates the need to average or otherwise reduce each set of replicates to a single independent sample. Our experiments show nearly identical median classification and variable selection accuracy for SLB forests and URF forests when applied to both simulated and real datasets. However, the run-time estimated error rate was severely underestimated for URF forests. Predictably, SLA forests were found to be more severely affected by the reduction in sample size which led to poorer classification and variable selection accuracy. Perhaps most importantly our results suggest that it is reasonable to utilize URF for the analysis of cluster-correlated data. Two caveats should be noted: first, correct classification error rates must be obtained using a separate test dataset, and second, an additional post-processing step is required to obtain subject-level classifications. RF++ is shown to be an effective alternative for classifying both clustered and non-clustered data. Source code and stand-alone compiled versions of command-line and easy-to-use graphical user interface (GUI) versions of RF++ for Windows and Linux as well as a user manual (Supplementary File S2) are available for download at: http://sourceforge.org/projects/rfpp/ under the GNU public license
Accurate peak list extraction from proteomic mass spectra for identification and profiling studies
<p>Abstract</p> <p>Background</p> <p>Mass spectrometry is an essential technique in proteomics both to identify the proteins of a biological sample and to compare proteomic profiles of different samples. In both cases, the main phase of the data analysis is the procedure to extract the significant features from a mass spectrum. Its final output is the so-called peak list which contains the mass, the charge and the intensity of every detected biomolecule. The main steps of the peak list extraction procedure are usually preprocessing, peak detection, peak selection, charge determination and monoisotoping operation.</p> <p>Results</p> <p>This paper describes an original algorithm for peak list extraction from low and high resolution mass spectra. It has been developed principally to improve the precision of peak extraction in comparison to other reference algorithms. It contains many innovative features among which a sophisticated method for managing the overlapping isotopic distributions.</p> <p>Conclusions</p> <p>The performances of the basic version of the algorithm and of its optional functionalities have been evaluated in this paper on both SELDI-TOF, MALDI-TOF and ESI-FTICR ECD mass spectra. Executable files of MassSpec, a MATLAB implementation of the peak list extraction procedure for Windows and Linux systems, can be downloaded free of charge for nonprofit institutions from the following web site: <url>http://aimed11.unipv.it/MassSpec</url></p
Recipes for sparse LDA of horizontal data
Many important modern applications require analyzing data with more variables than observations, called for short horizontal. In such situation the classical Fisher’s linear discriminant analysis (LDA) does not possess solution because the within-group scatter matrix is singular. Moreover, the number of the variables is usually huge and the classical type of solutions (discriminant functions) are difficult to interpret as they involve all available variables. Nowadays, the aim is to develop fast and reliable algorithms for sparse LDA of horizontal data. The resulting discriminant functions depend on very few original variables, which facilitates their interpretation. The main theoretical and numerical challenge is how to cope with the singularity of the within-group scatter matrix. This work aims at classifying the existing approaches according to the way they tackle this singularity issue, and suggest new ones
The influence of cultivation methods on Shewanella oneidensis physiology and proteome expression
High-throughput analyses that are central to microbial systems biology and ecophysiology research benefit from highly homogeneous and physiologically well-defined cell cultures. While attention has focused on the technical variation associated with high-throughput technologies, biological variation introduced as a function of cell cultivation methods has been largely overlooked. This study evaluated the impact of cultivation methods, controlled batch or continuous culture in bioreactors versus shake flasks, on the reproducibility of global proteome measurements in Shewanellaoneidensis MR-1. Variability in dissolved oxygen concentration and consumption rate, metabolite profiles, and proteome was greater in shake flask than controlled batch or chemostat cultures. Proteins indicative of suboxic and anaerobic growth (e.g., fumarate reductase and decaheme c-type cytochromes) were more abundant in cells from shake flasks compared to bioreactor cultures, a finding consistent with data demonstrating that “aerobic” flask cultures were O2 deficient due to poor mass transfer kinetics. The work described herein establishes the necessity of controlled cultivation for ensuring highly reproducible and homogenous microbial cultures. By decreasing cell to cell variability, higher quality samples will allow for the interpretive accuracy necessary for drawing conclusions relevant to microbial systems biology research
Differential Proteomic Analysis of Mammalian Tissues Using SILAM
Differential expression of proteins between tissues underlies organ-specific functions. Under certain pathological conditions, this may also lead to tissue vulnerability. Furthermore, post-translational modifications exist between different cell types and pathological conditions. We employed SILAM (Stable Isotope Labeling in Mammals) combined with mass spectrometry to quantify the proteome between mammalian tissues. Using 15N labeled rat tissue, we quantified 3742 phosphorylated peptides in nuclear extracts from liver and brain tissue. Analysis of the phosphorylation sites revealed tissue specific kinase motifs. Although these tissues are quite different in their composition and function, more than 500 protein identifications were common to both tissues. Specifically, we identified an up-regulation in the brain of the phosphoprotein, ZFHX1B, in which a genetic deletion causes the neurological disorder Mowat–Wilson syndrome. Finally, pathway analysis revealed distinct nuclear pathways enriched in each tissue. Our findings provide a valuable resource as a starting point for further understanding of tissue specific gene regulation and demonstrate SILAM as a useful strategy for the differential proteomic analysis of mammalian tissues
- …
