478 research outputs found
Minimax rates of convergence for nonparametric location-scale models
This paper studies minimax rates of convergence for nonparametric
location-scale models, which include mean, quantile and expectile regression
settings. Under Hellinger differentiability on the error distribution and other
mild conditions, we show that the minimax rate of convergence for estimating
the regression function under the squared loss is determined by the
metric entropy of the nonparametric function class. Different error
distributions, including asymmetric Laplace distribution, asymmetric connected
double truncated gamma distribution, connected normal-Laplace distribution,
Cauchy distribution and asymmetric normal distribution are studied as examples.
Applications on low order interaction models and multiple index models are also
given
High accuracy momentum compaction measurement for the APS storage ring with undulator radiation
ProtSolM: Protein Solubility Prediction with Multi-modal Features
Understanding protein solubility is essential for their functional
applications. Computational methods for predicting protein solubility are
crucial for reducing experimental costs and enhancing the efficiency and
success rates of protein engineering. Existing methods either construct a
supervised learning scheme on small-scale datasets with manually processed
physicochemical properties, or blindly apply pre-trained protein language
models to extract amino acid interaction information. The scale and quality of
available training datasets leave significant room for improvement in terms of
accuracy and generalization. To address these research gaps, we propose \sol, a
novel deep learning method that combines pre-training and fine-tuning schemes
for protein solubility prediction. ProtSolM integrates information from
multiple dimensions, including physicochemical properties, amino acid
sequences, and protein backbone structures. Our model is trained using \data,
the largest solubility dataset that we have constructed. PDBSol includes over
protein sequences and structures. We provide a comprehensive
leaderboard of existing statistical learning and deep learning methods on
independent datasets with computational and experimental labels. ProtSolM
achieved state-of-the-art performance across various evaluation metrics,
demonstrating its potential to significantly advance the accuracy of protein
solubility prediction.Comment: 10 pages, 7 figures, 9 table
CNETML: Maximum likelihood inference of phylogeny from copy number profiles of spatio-temporal samples
Phylogenetic trees based on copy number alterations (CNAs) for multi-region samples of a single cancer patient are helpful to understand the spatio-temporal evolution of cancers, especially in tumours driven by chromosomal instability. Due to the high cost of deep sequencing data, low-coverage data are more accessible in practice, which only allow the calling of (relative) total copy numbers due to the lower resolution. However, methods to reconstruct sample phylogenies from CNAs often use allele-specific copy numbers and those using total copy number are mostly distance matrix or maximum parsimony methods which do not handle temporal data or estimate mutation rates. In this work, we developed a new maximum likelihood method based on a novel evolutionary model of CNAs, CNETML, to infer phylogenies from spatio-temporal samples taken within a single patient. CNETML is the first program to jointly infer the tree topology, node ages, and mutation rates from total copy numbers when samples were taken at different time points. Our extensive simulations suggest CNETML performed well even on relative copy numbers with subclonal whole genome doubling events and under slight violation of model assumptions. The application of CNETML to real data from Barrett’s esophagus patients also generated consistent results with previous discoveries and novel early CNAs for further investigations
Protein Representation Learning with Sequence Information Embedding: Does it Always Lead to a Better Performance?
Deep learning has become a crucial tool in studying proteins. While the
significance of modeling protein structure has been discussed extensively in
the literature, amino acid types are typically included in the input as a
default operation for many inference tasks. This study demonstrates with
structure alignment task that embedding amino acid types in some cases may not
help a deep learning model learn better representation. To this end, we propose
ProtLOCA, a local geometry alignment method based solely on amino acid
structure representation. The effectiveness of ProtLOCA is examined by a global
structure-matching task on protein pairs with an independent test dataset based
on CATH labels. Our method outperforms existing sequence- and structure-based
representation learning methods by more quickly and accurately matching
structurally consistent protein domains. Furthermore, in local structure
pairing tasks, ProtLOCA for the first time provides a valid solution to
highlight common local structures among proteins with different overall
structures but the same function. This suggests a new possibility for using
deep learning methods to analyze protein structure to infer function.Comment: 8 pages, 4 figure
- …
