114 research outputs found
Delete or merge regressors for linear model selection
We consider a problem of linear model selection in the presence of both
continuous and categorical predictors. Feasible models consist of subsets of
numerical variables and partitions of levels of factors. A new algorithm called
delete or merge regressors (DMR) is presented which is a stepwise backward
procedure involving ranking the predictors according to squared t-statistics
and choosing the final model minimizing BIC. In the article we prove
consistency of DMR when the number of predictors tends to infinity with the
sample size and describe a simulation study using a pertaining R package. The
results indicate significant advantage in time complexity and selection
accuracy of our algorithm over Lasso-based methods described in the literature.
Moreover, a version of DMR for generalized linear models is proposed
A systematic analysis of scoring functions in rigid-body protein docking: The delicate balance between the predictive rate improvement and the risk of overtraining
Protein-protein interactions play fundamental roles in biological processes including signaling, metabolism, and trafficking. While the structure of a protein complex reveals crucial details about the interaction, it is often difficult to acquire this information experimentally. As the number of interactions discovered increases faster than they can be characterized, protein-protein docking calculations may be able to reduce this disparity by providing models of the interacting proteins. Rigid-body docking is a widely used docking approach, and is often capable of generating a pool of models within which a near-native structure can be found. These models need to be scored in order to select the acceptable ones from the set of poses. Recently, more than 100 scoring functions from the CCharPPI server were evaluated for this task using decoy structures generated with SwarmDock. Here, we extend this analysis to identify the predictive success rates of the scoring functions on decoys from three rigid-body docking programs, ZDOCK, FTDock, and SDOCK, allowing us to assess the transferability of the functions. We also apply set-theoretic measure to test whether the scoring functions are capable of identifying near-native poses within different subsets of the benchmark. This information can provide guides for the use of the most efficient scoring function for each docking method, as well as instruct future scoring functions development efforts.Grant sponsor: MINECO BIO2013-48213-R; Grant sponsor: CONACyT (D.B.-B.);
Grant sponsor: EC FP7-PEOPLE (I.H.M.) PIEF-GA-2012-327899; Grant sponsor:
BSRC (I.H.M.) BB/N011600/1.Peer ReviewedPostprint (author's final draft
AAindex: amino acid index database, progress report 2008
AAindex is a database of numerical indices representing various physicochemical and biochemical properties of amino acids and pairs of amino acids. We have added a collection of protein contact potentials to the AAindex as a new section. Accordingly AAindex consists of three sections now: AAindex1 for the amino acid index of 20 numerical values, AAindex2 for the amino acid substitution matrix and AAindex3 for the statistical protein contact potentials. All data are derived from published literature. The database can be accessed through the DBGET/LinkDB system at GenomeNet (http://www.genome.jp/dbget-bin/www_bfind?aaindex) or downloaded by anonymous FTP (ftp://ftp.genome.jp/pub/db/community/aaindex/)
Inferring Ideal Amino Acid Interaction Forms From Statistical Protein Contact Potentials
We have analyzed 29 different published matrices of protein pairwise contact potentials (CPs) between amino acids derived from different sets of proteins, either crystallographic structures taken from the Protein Data Bank (PDB) or computer-generated decoys. Each of the CPs is similar to 1 of the 2 matrices derived in the work of Miyazawa and Jernigan (Proteins 1999;34:49–68). The CP matrices of the first class can be approximated with a correlation of order 0.9 by the formula eij = hi + hj, 1 ≤ i, j ≤ 20, where the residue-type dependent factor h is highly correlated with the frequency of occurrence of a given amino acid type inside proteins. Electrostatic interactions for the potentials of this class are almost negligible. In the potentials belonging to this class, the major contribution to the potentials is the one-body transfer energy of the amino acid from water to the protein environment. Potentials belonging to the second class can be approximated with a correlation of 0.9 by the formula eij = c0 − hihj + qiqj, where c0 is a constant, h is highly correlated with the Kyte–Doolittle hydrophobicity scale, and a new, less dominant, residue-type dependent factor q is correlated (~0.9) with amino acid isoelectric points pI. Including electrostatic interactions significantly improves the approximation for this class of potentials. While, the high correlation between potentials of the first class and the hydrophobic transfer energies is well known, the fact that this approximation can work well also for the second class of potentials is a new finding. We interpret potentials of this class as representing energies of contact of amino acid pairs within an average protein environment
Distance Matrix-Based Approach to Protein Structure Prediction
Much structural information is encoded in the internal distances; a distance matrix-based approach can be used to predict protein structure and dynamics, and for structural refinement. Our approach is based on the square distance matrix D = [rij2] containing all square distances between residues in proteins. This distance matrix contains more information than the contact matrix C, that has elements of either 0 or 1 depending on whether the distance rij is greater or less than a cutoff value rcutoff .We have performed spectral decomposition of the distance matrices D=∑λkVkVTk , in terms of eigenvalues λk and the corresponding eigenvectors vk and found that it contains at most 5 nonzero terms. A dominant eigenvector is proportional to r2 - the square distance of points from the center of mass, with the next three being the principal components of the system of points. By knowing r2 we can approximate a distance matrix of a protein with an expected RMSD value of about 4.5Å. We can also explain the role of hydrophobic interactions for the protein structure, because r is highly correlated with the hydrophobic profile of the sequence. Moreover, r is highly correlated with several sequence profiles which are useful in protein structure prediction, such as contact number, the residue-wise contact order (RWCO) or mean square fluctuations (i.e. crystallographic temperature factors). We have also shown that the next three components are related to spatial directionality of the secondary structure elements, and they may be also predicted from the sequence, improving overall structure prediction. We have also shown that the large number of available HIV-1 protease structures provides a remarkable sampling of conformations, which can be viewed as direct structural information about the dynamics. After structure matching, we apply principal component analysis (PCA) to obtain the important apparent motions for both bound and unbound structures. There are significant similarities between the first few key motions and the first few low-frequency normal modes calculated from a static representative structure with an elastic network model (ENM) that is based on the contact matrix C (related to D), strongly suggesting that the variations among the observed structures and the corresponding conformational changes are facilitated by the low-frequency, global motions intrinsic to the structure. Similarities are also found when the approach is applied to an NMR ensemble, as well as to atomic molecular dynamics (MD) trajectories. Thus, a sufficiently large number of experimental structures can directly provide important information about protein dynamics, but ENM can also provide a similar sampling of conformations. Finally, we use distance constraints from databases of known protein structures for structure refinement. We use the distributions of distances of various types in known protein structures to obtain the most probable ranges or the mean-force potentials for the distances. We then impose these constraints on structures to be refined or include the mean-force potentials directly in the energy minimization so that more plausible structural models can be built. This approach has been successfully used by us in 2006 in the CASPR structure refinement http://predictioncenter.org/caspR)
On consensus biomarker selection
<p>Abstract</p> <p>Background</p> <p>Recent development of mass spectrometry technology enabled the analysis of complex peptide mixtures. A lot of effort is currently devoted to the identification of biomarkers in human body fluids like serum or plasma, based on which new diagnostic tests for different diseases could be constructed. Various biomarker selection procedures have been exploited in recent studies. It has been noted that they often lead to different biomarker lists and as a consequence, the patient classification may also vary.</p> <p>Results</p> <p>Here we propose a new approach to the biomarker selection problem: to apply several competing feature ranking procedures and compute a consensus list of features based on their outcomes. We validate our methods on two proteomic datasets for the diagnosis of ovarian and prostate cancer.</p> <p>Conclusion</p> <p>The proposed methodology can improve the classification results and at the same time provide a unified biomarker list for further biological examinations and interpretation.</p
- …
