30 research outputs found
Torus principal component analysis with applications to RNA structure
There are several cutting edge applications needing PCA methods for data on tori, and we propose a novel torus-PCA method that adaptively favors low-dimensional representations while preventing overfitting by a new test—both of which can be generally applied and address shortcomings in two previously proposed PCA methods. Unlike tangent space PCA, our torus-PCA features structure fidelity by honoring the cyclic topology of the data space and, unlike geodesic PCA, produces nonwinding, nondense descriptors. These features are achieved by deforming tori into spheres with self-gluing and then using a variant of the recently developed principal nested spheres analysis. This PCA analysis involves a step of subsphere fitting, and we provide a new test to avoid overfitting. We validate our torus-PCA by application to an RNA benchmark data set. Further, using a larger RNA data set, torus-PCA recovers previously found structure, now globally at the one-dimensional representation, which is not accessible via tangent space PCA
Distribution of Hβ hyperfine couplings in a tyrosyl radical revealed by 263 GHz ENDOR spectroscopy
1H ENDOR spectra of tyrosyl radicals (Y∙) have been the subject of numerous EPR spectroscopic studies due to their importance in biology. Nevertheless, assignment of all internal 1H hyperfine couplings has been challenging because of substantial spectral overlap. Recently, using 263 GHz ENDOR in conjunction with statistical analysis, we could identify the signature of the Hβ2 coupling in the essential Y122 radical of Escherichia coli ribonucleotide reductase, and modeled it with a distribution of radical conformations. Here, we demonstrate that this analysis can be extended to the full-width 1H ENDOR spectra that contain the larger Hβ1 coupling. The Hβ2 and Hβ1 couplings are related to each other through the ring dihedral and report on the amino acid conformation. The 263 GHz ENDOR data, acquired in batches instead of averaging, and data processing by a new “drift model” allow reconstructing the ENDOR spectra with statistically meaningful confidence intervals and separating them from baseline distortions. Spectral simulations using a distribution of ring dihedral angles confirm the presence of a conformational distribution, consistent with the previous analysis of the Hβ2 coupling. The analysis was corroborated by 94 GHz 2H ENDOR of deuterated Y∙122. These studies provide a starting point to investigate low populated states of tyrosyl radicals in greater detail
Drift Models on Complex Projective Space for Electron-Nuclear Double Resonance
ENDOR spectroscopy is an important tool to determine the complicated three-dimensional structure of biomolecules and in particular enables measurements of intramolecular distances. Usually, spectra are determined by averaging the data matrix, which does not take into account the significant thermal drifts that occur in the measurement process. In contrast, we present an asymptotic analysis for the homoscedastic drift model, a pioneering parametric model that achieves striking model fits in practice and allows both hypothesis testing and confidence intervals for spectra. The ENDOR spectrum and an orthogonal component are modeled as an element of complex projective space, and formulated in the framework of generalized Fréchet means. To this end, two general formulations of strong consistency for set-valued Fréchet means are extended and subsequently applied to the homoscedastic drift model to prove strong consistency. Building on this, central limit theorems for the ENDOR spectrum are shown. Furthermore, we extend applicability by taking into account a phase noise contribution leading to the heteroscedastic drift model. Both drift models offer improved signal-to-noise ratio over pre-existing models
Distribution of H-beta Hyperfine Couplings in a Tyrosyl Radical Revealed by 263 GHz ENDOR Spectroscopy
1H ENDOR spectra of tyrosyl radicals (Y∙) have been the subject of numerous EPR spectroscopic studies due to their importance in biology. Nevertheless, assignment of all internal 1H hyperfine couplings has been challenging because of substantial spectral overlap. Recently, using 263 GHz ENDOR in conjunction with statistical analysis, we could identify the signature of the Hβ2 coupling in the essential Y122 radical of Escherichia coli ribonucleotide reductase, and modeled it with a distribution of radical conformations. Here, we demonstrate that this analysis can be extended to the full-width 1H ENDOR spectra that contain the larger Hβ1 coupling. The Hβ2 and Hβ1 couplings are related to each other through the ring dihedral and report on the amino acid conformation. The 263 GHz ENDOR data, acquired in batches instead of averaging, and data processing by a new “drift model” allow reconstructing the ENDOR spectra with statistically meaningful confidence intervals and separating them from baseline distortions. Spectral simulations using a distribution of ring dihedral angles confirm the presence of a conformational distribution, consistent with the previous analysis of the Hβ2 coupling. The analysis was corroborated by 94 GHz 2H ENDOR of deuterated Y∙122. These studies provide a starting point to investigate low populated states of tyrosyl radicals in greater detail
Statistical analysis of ENDOR spectra
Electron–nuclear double resonance (ENDOR) measures the hyperfine interaction of magnetic nuclei with paramagnetic centers and is hence a powerful tool for spectroscopic investigations extending from biophysics to material science. Progress in microwave technology and the recent availability of commercial electron paramagnetic resonance (EPR) spectrometers up to an electron Larmor frequency of 263 GHz now open the opportunity for a more quantitative spectral analysis. Using representative spectra of a prototype amino acid radical in a biologically relevant enzyme, the Y∙122 in Escherichia coli ribonucleotide reductase, we developed a statistical model for ENDOR data and conducted statistical inference on the spectra including uncertainty estimation and hypothesis testing. Our approach in conjunction with 1H/2H isotopic labeling of Y∙122 in the protein unambiguously established new unexpected spectral contributions. Density functional theory (DFT) calculations and ENDOR spectral simulations indicated that these features result from the beta-methylene hyperfine coupling and are caused by a distribution of molecular conformations, likely important for the biological function of this essential radical. The results demonstrate that model-based statistical analysis in combination with state-of-the-art spectroscopy accesses information hitherto beyond standard approaches
Bayesian optimization to estimate hyperfine couplings from 19F ENDOR spectra
ENDOR spectroscopy is a fundamental method to detect nuclear spins in the vicinity of paramagnetic centers and their mutual hyperfine interaction. Recently, site-selective introduction of 19F as nuclear labels has been proposed as a tool for ENDOR-based distance determination in biomolecules, complementing pulsed dipolar spectroscopy in the range of angstrom to nanometer. Nevertheless, one main challenge of ENDOR still consists of its spectral analysis, which is aggravated by a large parameter space and broad resonances from hyperfine interactions. Additionally, at high EPR frequencies and fields (⩾94 GHz/3.4 Tesla), chemical shift anisotropy might contribute to broadening and asymmetry in the spectra. Here, we use two nitroxide-fluorine model systems to examine a statistical approach to finding the best parameter fit to experimental 263 GHz 19F ENDOR spectra. We propose Bayesian optimization for a rapid, global parameter search with little prior knowledge, followed by a refinement by more standard gradient-based fitting procedures. Indeed, the latter suffer from finding local rather than global minima of a suitably defined loss function. Using a new and accelerated simulation procedure, results for the semi-rigid nitroxide-fluorine two and three spin systems lead to physically reasonable solutions, if minima of similar loss can be distinguished by DFT predictions. The approach also delivers the stochastic error of the obtained parameter estimates. Future developments and perspectives are discussed
Adsorption and deuterium NMR of ammonia, pyridine, dimethylamine, and benzene on rutile and anatase
Learning torus PCA based classification for multiscale RNA correction with application to SARS-CoV-2
Abstract Three-dimensional RNA structures frequently contain atomic clashes. Usually, corrections approximate the biophysical chemistry, which is computationally intensive and often does not correct all clashes. We propose fast, data-driven reconstructions from clash-free benchmark data with two-scale shape analysis: microscopic (suites) dihedral backbone angles, mesoscopic sugar ring centre landmarks. Our analysis relates concentrated mesoscopic scale neighbourhoods to microscopic scale clusters, correcting within-suite-backbone-to-backbone clashes exploiting angular shape and size-and-shape Fréchet means. Validation shows that learned classes highly correspond with literature clusters and reconstructions are well within physical resolution. We illustrate the power of our method using cutting-edge SARS-CoV-2 RNA
Diffusion means in geometric spaces
We introduce a location statistic for distributions on non-linear geometric spaces, the diffusion mean, serving as an extension and an alternative to the Fréchet mean. The diffusion mean arises as the generalization of Gaussian maximum likelihood analysis to non-linear spaces by maximizing the likelihood of a Brownian motion. The diffusion mean depends on a time parameter t, which admits the interpretation of the allowed variance of the diffusion. The diffusion t-mean of a distribution X is the most likely origin of a Brownian motion at time t, given the end-point distribution X. We give a detailed description of the asymptotic behavior of the diffusion estimator and provide sufficient conditions for the diffusion estimator to be strongly consistent. Particularly, we present a smeary central limit theorem for diffusion means and we show that joint estimation of the mean and diffusion variance rules out smeariness in all directions simultaneously in general situations. Furthermore, we investigate properties of the diffusion mean for distributions on the sphere Sm. Experimentally, we consider simulated data and data from magnetic pole reversals, all indicating similar or improved convergence rate compared to the Fréchet mean. Here, we additionally estimate t and consider its effects on smeariness and uniqueness of the diffusion mean for distributions on the sphere
Principal component analysis and clustering on manifolds
Big data, high dimensional data, sparse data, large scale data, and imaging data are all becoming new frontiers of statistics. Changing technologies have created this flood and have led to a real hunger for new modeling strategies and data analysis by scientists. In many cases data are not Euclidean; for example, in molecular biology, the data sit on manifolds. Even in a simple non-Euclidean manifold (circle), to summarize angles by the arithmetic average cannot make sense and so more care is needed. Thus non-Euclidean settings throw up many major challenges, both mathematical and statistical. This paper will focus on the PCA and clustering methods for some manifolds. Of course, the PCA and clustering methods in multivariate analysis are one of the core topics.
We basically deal with two key manifolds from a practical point of view, namely spheres and tori. It is well known that dimension reduction on non-Euclidean manifolds with PCA-like methods has been a challenging task for quite some time but recently there has been some breakthrough. One of them is the idea of nested spheres and another is transforming a torus into a sphere effectively and subsequently use the technology of nested spheres PCA. We also provide a new method of clustering for multivariate analysis which has a fundamental property required for molecular biology that penalizes wrong assignments to avoid chemically no go areas. We give various examples to illustrate these methods. One of the important examples includes dealing with COVID-19 data
