6,626 research outputs found
Exponentially Twisted Sampling: a Unified Approach for Centrality Analysis in Attributed Networks
In our recent works, we developed a probabilistic framework for structural
analysis in undirected networks and directed networks. The key idea of that
framework is to sample a network by a symmetric and asymmetric bivariate
distribution and then use that bivariate distribution to formerly defining
various notions, including centrality, relative centrality, community, and
modularity. The main objective of this paper is to extend the probabilistic
definition to attributed networks, where sampling bivariate distributions by
exponentially twisted sampling. Our main finding is that we find a way to deal
with the sampling of the attributed network including signed network. By using
the sampling method, we define the various centralities in attributed networks.
The influence centralities and trust centralities correctly show that how to
identify centralities in signed network. The advertisement-specific influence
centralities also perfectly define centralities when the attributed networks
that have node attribute. Experimental results on real-world dataset
demonstrate the different centralities with changing the temperature. Further
experiments are conducted to gain a deeper understanding of the importance of
the temperature
A Unified Framework for Sampling, Clustering and Embedding Data Points in Semi-Metric Spaces
In this paper, we propose a unified framework for sampling, clustering and
embedding data points in semi-metric spaces. For a set of data points
in a semi-metric space, we consider a
complete graph with nodes and self edges and then map each data point
in to a node in the graph with the edge weight between two nodes being
the distance between the corresponding two points in . By doing so,
several well-known sampling techniques can be applied for clustering data
points in a semi-metric space. One particularly interesting sampling technique
is the exponentially twisted sampling in which one can specify the desired
average distance from the sampling distribution to detect clusters with various
resolutions.
We also propose a softmax clustering algorithm that can perform a clustering
and embed data points in a semi-metric space to a low dimensional Euclidean
space. Our experimental results show that after a certain number of iterations
of "training", our softmax algorithm can reveal the "topology" of the data from
a high dimensional Euclidean. We also show that the eigendecomposition of a
covariance matrix is equivalent to the principal component analysis (PCA).
To deal with the hierarchical structure of clusters, our softmax clustering
algorithm can also be used with a hierarchical clustering algorithm. For this,
we propose a partitional-hierarchical algorithm, called PHD, in this paper.
Our experimental results show that those algorithms based on the maximization
of normalized modularity tend to balance the sizes of detected clusters and
thus do not perform well when the ground-truth clusters are different in sizes.
Also, using a metric is better than using a semi-metric as the triangular
inequality is not satisfied for a semi-metric and that is more prone to
clustering errors
A Quasi-random Algorithm for Anonymous Rendezvous in Heterogeneous Cognitive Radio Networks
The multichannel rendezvous problem that asks two secondary users to
rendezvous on a common available channel in a cognitive radio network (CRN) has
received a lot of attention lately. Most rendezvous algorithms in the
literature focused on constructing channel hopping (CH) sequences that
guarantee finite maximum time-to-rendezvous (MTTR). However, these algorithms
perform rather poorly in terms of the expected time-to-rendezvous (ETTR) even
when compared to the simple random algorithm. In this paper, we propose the
quasi-random (QR) CH algorithm that has a comparable ETTR to the random
algorithm and a comparable MTTR to the best bound in the literature. Our QR
algorithm does not require the unique identifier (ID) assumption and it is very
simple to implement in the symmetric, asynchronous, and heterogeneous setting
with multiple radios. In a CRN with commonly labelled channels, the MTTR of
the QR algorithm is bounded above by time slots, where (resp. ) is the number of
available channels to user (resp. 2), (resp. ) is the number of
radios for user (resp. 2), and . Such a bound is only slightly larger than the best bound in the literature. When each SU has a single
radio, the ETTR is bounded above by , where is the number of common channels between
these two users. By conducting extensive simulations, we show that for both the
MTTR and the ETTR, our algorithm is comparable to the simple random algorithm
and it outperforms several existing algorithms in the literature
Temporal Matrix Factorization for Tracking Concept Drift in Individual User Preferences
The matrix factorization (MF) technique has been widely adopted for solving
the rating prediction problem in recommender systems. The MF technique utilizes
the latent factor model to obtain static user preferences (user latent vectors)
and item characteristics (item latent vectors) based on historical rating data.
However, in the real world user preferences are not static but full of
dynamics. Though there are several previous works that addressed this time
varying issue of user preferences, it seems (to the best of our knowledge) that
none of them is specifically designed for tracking concept drift in individual
user preferences. Motivated by this, we develop a Temporal Matrix Factorization
approach (TMF) for tracking concept drift in each individual user latent
vector. There are two key innovative steps in our approach: (i) we develop a
modified stochastic gradient descent method to learn an individual user latent
vector at each time step, and (ii) by the Lasso regression we learn a linear
model for the transition of the individual user latent vectors. We test our
method on a synthetic dataset and several real datasets. In comparison with the
original MF, our experimental results show that our temporal method is able to
achieve lower root mean square errors (RMSE) for both the synthetic and real
datasets. One interesting finding is that the performance gain in RMSE is
mostly from those users who indeed have concept drift in their user latent
vectors at the time of prediction. In particular, for the synthetic dataset and
the Ciao dataset, there are quite a few users with that property and the
performance gains for these two datasets are roughly 20% and 5%, respectively
K-sets+: a Linear-time Clustering Algorithm for Data Points with a Sparse Similarity Measure
In this paper, we first propose a new iterative algorithm, called the K-sets+
algorithm for clustering data points in a semi-metric space, where the distance
measure does not necessarily satisfy the triangular inequality. We show that
the K-sets+ algorithm converges in a finite number of iterations and it retains
the same performance guarantee as the K-sets algorithm for clustering data
points in a metric space. We then extend the applicability of the K-sets+
algorithm from data points in a semi-metric space to data points that only have
a symmetric similarity measure. Such an extension leads to great reduction of
computational complexity. In particular, for an n * n similarity matrix with m
nonzero elements in the matrix, the computational complexity of the K-sets+
algorithm is O((Kn + m)I), where I is the number of iterations. The memory
complexity to achieve that computational complexity is O(Kn + m). As such, both
the computational complexity and the memory complexity are linear in n when the
n * n similarity matrix is sparse, i.e., m = O(n). We also conduct various
experiments to show the effectiveness of the K-sets+ algorithm by using a
synthetic dataset from the stochastic block model and a real network from the
WonderNetwork website
ODSQA: Open-domain Spoken Question Answering Dataset
Reading comprehension by machine has been widely studied, but machine
comprehension of spoken content is still a less investigated problem. In this
paper, we release Open-Domain Spoken Question Answering Dataset (ODSQA) with
more than three thousand questions. To the best of our knowledge, this is the
largest real SQA dataset. On this dataset, we found that ASR errors have
catastrophic impact on SQA. To mitigate the effect of ASR errors, subword units
are involved, which brings consistent improvements over all the models. We
further found that data augmentation on text-based QA training examples can
improve SQA
Community Detection in Signed Networks: an Error-Correcting Code Approach
In this paper, we consider the community detection problem in signed
networks, where there are two types of edges: positive edges (friends) and
negative edges (enemies). One renowned theorem of signed networks, known as
Harary's theorem, states that structurally balanced signed networks are
clusterable. By viewing each cycle in a signed network as a parity-check
constraint, we show that the community detection problem in a signed network
with two communities is equivalent to the decoding problem for a parity-check
code. We also show how one can use two renowned decoding algorithms in error-
correcting codes for community detection in signed networks: the bit-flipping
algorithm, and the belief propagation algorithm. In addition to these two
algorithms, we also propose a new community detection algorithm, called the
Hamming distance algorithm, that performs community detection by finding a
codeword that minimizes the Hamming distance. We compare the performance of
these three algorithms by conducting various experiments with known ground
truth. Our experimental results show that our Hamming distance algorithm
outperforms the other two
Constructions of Optical Queues With a Limited Number of Recirculations--Part I: Greedy Constructions
In this two-part paper, we consider SDL constructions of optical queues with
a limited number of recirculations through the optical switches and the fiber
delay lines. We show that the constructions of certain types of optical queues,
including linear compressors, linear decompressors, and 2-to-1 FIFO
multiplexers, under a simple packet routing scheme and under the constraint of
a limited number of recirculations can be transformed into equivalent integer
representation problems under a corresponding constraint. Given and ,
the problem of finding an \emph{optimal} construction, in the sense of
maximizing the maximum delay (resp., buffer size), among our constructions of
linear compressors/decompressors (resp., 2-to-1 FIFO multiplexers) is
equivalent to the problem of finding an optimal sequence {\dbf^*}_1^M in
\Acal_M (resp., \Bcal_M) such that B({\dbf^*}_1^M;k)=\max_{\dbf_1^M\in
\Acal_M}B(\dbf_1^M;k) (resp., B({\dbf^*}_1^M;k)=\max_{\dbf_1^M\in
\Bcal_M}B(\dbf_1^M;k)), where \Acal_M (resp., \Bcal_M) is the set of all
sequences of fiber delays allowed in our constructions of linear
compressors/decompressors (resp., 2-to-1 FIFO multiplexers). In Part I, we
propose a class of \emph{greedy} constructions of linear
compressors/decompressors and 2-to-1 FIFO multiplexers by specifying a class
\Gcal_{M,k} of sequences such that \Gcal_{M,k}\subseteq \Bcal_M\subseteq
\Acal_M and each sequence in \Gcal_{M,k} is obtained recursively in a greedy
manner. We then show that every optimal construction must be a greedy
construction. In Part II, we further show that there are at most two optimal
constructions and give a simple algorithm to obtain the optimal
construction(s).Comment: 59 pages; 1 figure; This paper was presented in part at the IEEE
International Conference on Computer Communications (INFOCOM'08), Phoenix,
AZ, USA, April~13--18, 2008. This paper has been submitted to IEEE
Transactions on Information Theory for possible publication
A Mathematical Theory for Clustering in Metric Spaces
Clustering is one of the most fundamental problems in data analysis and it
has been studied extensively in the literature. Though many clustering
algorithms have been proposed, clustering theories that justify the use of
these clustering algorithms are still unsatisfactory. In particular, one of the
fundamental challenges is to address the following question:
What is a cluster in a set of data points?
In this paper, we make an attempt to address such a question by considering a
set of data points associated with a distance measure (metric). We first
propose a new cohesion measure in terms of the distance measure. Using the
cohesion measure, we define a cluster as a set of points that are cohesive to
themselves. For such a definition, we show there are various equivalent
statements that have intuitive explanations. We then consider the second
question:
How do we find clusters and good partitions of clusters under such a
definition?
For such a question, we propose a hierarchical agglomerative algorithm and a
partitional algorithm. Unlike standard hierarchical agglomerative algorithms,
our hierarchical agglomerative algorithm has a specific stopping criterion and
it stops with a partition of clusters. Our partitional algorithm, called the
K-sets algorithm in the paper, appears to be a new iterative algorithm. Unlike
the Lloyd iteration that needs two-step minimization, our K-sets algorithm only
takes one-step minimization.
One of the most interesting findings of our paper is the duality result
between a distance measure and a cohesion measure. Such a duality result leads
to a dual K-sets algorithm for clustering a set of data points with a cohesion
measure. The dual K-sets algorithm converges in the same way as a sequential
version of the classical kernel K-means algorithm. The key difference is that a
cohesion measure does not need to be positive semi-definite
Percolation Threshold for Competitive Influence in Random Networks
In this paper, we propose a new averaging model for modeling the competitive
influence of candidates among voters in an election process. For such
an influence propagation model, we address the question of how many seeded
voters a candidate needs to place among undecided voters in order to win an
election. We show that for a random network generated from the stochastic block
model, there exists a percolation threshold for a candidate to win the election
if the number of seeded voters placed by the candidate exceeds the threshold.
By conducting extensive experiments, we show that our theoretical percolation
thresholds are very close to those obtained from simulations for random
networks and the errors are within for a real-world network.Comment: 11 pages, 9 figures, this article is the complete version (with
proofs) of the IEEE Global Communications Conference 2019 review pape
- …
