750 research outputs found
Zero-Truncated Poisson Tensor Factorization for Massive Binary Tensors
We present a scalable Bayesian model for low-rank factorization of massive
tensors with binary observations. The proposed model has the following key
properties: (1) in contrast to the models based on the logistic or probit
likelihood, using a zero-truncated Poisson likelihood for binary data allows
our model to scale up in the number of \emph{ones} in the tensor, which is
especially appealing for massive but sparse binary tensors; (2)
side-information in form of binary pairwise relationships (e.g., an adjacency
network) between objects in any tensor mode can also be leveraged, which can be
especially useful in "cold-start" settings; and (3) the model admits simple
Bayesian inference via batch, as well as \emph{online} MCMC; the latter allows
scaling up even for \emph{dense} binary data (i.e., when the number of ones in
the tensor/network is also massive). In addition, non-negative factor matrices
in our model provide easy interpretability, and the tensor rank can be inferred
from the data. We evaluate our model on several large-scale real-world binary
tensors, achieving excellent computational scalability, and also demonstrate
its usefulness in leveraging side-information provided in form of
mode-network(s).Comment: UAI (Uncertainty in Artificial Intelligence) 201
Mammalian DNA2 helicase/nuclease cleaves G-quadruplex DNA and is required for telomere integrity
Efficient and faithful replication of telomeric DNA is critical for maintaining genome integrity. The G-quadruplex (G4) structure arising in the repetitive TTAGGG sequence is thought to stall replication forks, impairing efficient telomere replication and leading to telomere instabilities. However, pathways modulating telomeric G4 are poorly understood, and it is unclear whether defects in these pathways contribute to genome instabilities in vivo. Here, we report that mammalian DNA2 helicase/nuclease recognizes and cleaves telomeric G4 in vitro. Consistent with DNA2’s role in removing G4, DNA2 deficiency in mouse cells leads to telomere replication defects, elevating the levels of fragile telomeres (FTs) and sister telomere associations (STAs). Such telomere defects are enhanced by stabilizers of G4. Moreover, DNA2 deficiency induces telomere DNA damage and chromosome segregation errors, resulting in tetraploidy and aneuploidy. Consequently, DNA2-deficient mice develop aneuploidy-associated cancers containing dysfunctional telomeres. Collectively, our genetic, cytological, and biochemical results suggest that mammalian DNA2 reduces replication stress at telomeres, thereby preserving genome stability and suppressing cancer development, and that this may involve, at least in part, nucleolytic processing of telomeric G4
Topic-Based Embeddings for Learning from Large Knowledge Graphs
Abstract We present a scalable probabilistic framework for learning from multi-relational data, given in form of entity-relation-entity triplets, with a potentially massive number of entities and relations (e.g., in multirelational networks, knowledge bases, etc.). We define each triplet via a relation-specific bilinear function of the embeddings of entities associated with it (these embeddings correspond to "topics"). To handle massive number of relations and the data sparsity problem (very few observations per relation), we also extend this model to allow sharing of parameters across relations, which leads to a substantial reduction in the number of parameters to be learned. In addition to yielding excellent predictive performance (e.g., for knowledge base completion tasks), the interpretability of our topic-based embedding framework enables easy qualitative analyses. Computational cost of our models scales in the number of positive triplets, which makes it easy to scale to massive realworld multi-relational data sets, which are usually extremely sparse. We develop simpleto-implement batch as well as online Gibbs sampling algorithms and demonstrate the effectiveness of our models on tasks such as multi-relational link-prediction, and learning from large knowledge bases
Non-negative Matrix Factorization for Discrete Data with Hierarchical Side-Information
Abstract We present a probabilistic framework for efficient non-negative matrix factorization of discrete (count/binary) data with sideinformation. The side-information is given as a multi-level structure, taxonomy, or ontology, with nodes at each level being categorical-valued observations. For example, when modeling documents with a twolevel side-information (documents being at level-zero), level-one may represent (one or more) authors associated with each document and level-two may represent affiliations of each author. The model easily generalizes to more than two levels (or taxonomy/ontology of arbitrary depth). Our model can learn embeddings of entities present at each level in the data/sideinformation hierarchy (e.g., documents, authors, affiliations, in the previous example), with appropriate sharing of information across levels. The model also enjoys full local conjugacy, facilitating efficient Gibbs sampling for model inference. Inference cost scales in the number of non-zero entries in the data matrix, which is especially appealing for real-world massive but sparse matrices. We demonstrate the effectiveness of the model on several real-world data sets
- …
