6,626 research outputs found

    Exponentially Twisted Sampling: a Unified Approach for Centrality Analysis in Attributed Networks

    Full text link
    In our recent works, we developed a probabilistic framework for structural analysis in undirected networks and directed networks. The key idea of that framework is to sample a network by a symmetric and asymmetric bivariate distribution and then use that bivariate distribution to formerly defining various notions, including centrality, relative centrality, community, and modularity. The main objective of this paper is to extend the probabilistic definition to attributed networks, where sampling bivariate distributions by exponentially twisted sampling. Our main finding is that we find a way to deal with the sampling of the attributed network including signed network. By using the sampling method, we define the various centralities in attributed networks. The influence centralities and trust centralities correctly show that how to identify centralities in signed network. The advertisement-specific influence centralities also perfectly define centralities when the attributed networks that have node attribute. Experimental results on real-world dataset demonstrate the different centralities with changing the temperature. Further experiments are conducted to gain a deeper understanding of the importance of the temperature

    A Unified Framework for Sampling, Clustering and Embedding Data Points in Semi-Metric Spaces

    Full text link
    In this paper, we propose a unified framework for sampling, clustering and embedding data points in semi-metric spaces. For a set of data points Ω={x1,x2,,xn}\Omega=\{x_1, x_2, \ldots, x_n\} in a semi-metric space, we consider a complete graph with nn nodes and nn self edges and then map each data point in Ω\Omega to a node in the graph with the edge weight between two nodes being the distance between the corresponding two points in Ω\Omega. By doing so, several well-known sampling techniques can be applied for clustering data points in a semi-metric space. One particularly interesting sampling technique is the exponentially twisted sampling in which one can specify the desired average distance from the sampling distribution to detect clusters with various resolutions. We also propose a softmax clustering algorithm that can perform a clustering and embed data points in a semi-metric space to a low dimensional Euclidean space. Our experimental results show that after a certain number of iterations of "training", our softmax algorithm can reveal the "topology" of the data from a high dimensional Euclidean. We also show that the eigendecomposition of a covariance matrix is equivalent to the principal component analysis (PCA). To deal with the hierarchical structure of clusters, our softmax clustering algorithm can also be used with a hierarchical clustering algorithm. For this, we propose a partitional-hierarchical algorithm, called iiPHD, in this paper. Our experimental results show that those algorithms based on the maximization of normalized modularity tend to balance the sizes of detected clusters and thus do not perform well when the ground-truth clusters are different in sizes. Also, using a metric is better than using a semi-metric as the triangular inequality is not satisfied for a semi-metric and that is more prone to clustering errors

    A Quasi-random Algorithm for Anonymous Rendezvous in Heterogeneous Cognitive Radio Networks

    Full text link
    The multichannel rendezvous problem that asks two secondary users to rendezvous on a common available channel in a cognitive radio network (CRN) has received a lot of attention lately. Most rendezvous algorithms in the literature focused on constructing channel hopping (CH) sequences that guarantee finite maximum time-to-rendezvous (MTTR). However, these algorithms perform rather poorly in terms of the expected time-to-rendezvous (ETTR) even when compared to the simple random algorithm. In this paper, we propose the quasi-random (QR) CH algorithm that has a comparable ETTR to the random algorithm and a comparable MTTR to the best bound in the literature. Our QR algorithm does not require the unique identifier (ID) assumption and it is very simple to implement in the symmetric, asynchronous, and heterogeneous setting with multiple radios. In a CRN with NN commonly labelled channels, the MTTR of the QR algorithm is bounded above by 9Mn1/m1n2/m29 M \lceil n_1/m_1 \rceil \cdot \lceil n_2/m_2 \rceil time slots, where n1n_1 (resp. n2n_2) is the number of available channels to user 11 (resp. 2), m1m_1 (resp. m2m_2) is the number of radios for user 11 (resp. 2), and M=log2N/45+6M=\lceil \lceil \log_2 N \rceil /4 \rceil *5+6. Such a bound is only slightly larger than the best O((loglogN)n1n2m1m2)O((\log \log N) \frac{n_1 n_2}{m_1 m_2}) bound in the literature. When each SU has a single radio, the ETTR is bounded above by n1n2G+9Mn1n2(1Gn1n2)M\frac{n_1 n_2}{G}+9Mn_1n_2 \cdot (1-\frac{G}{n_1 n_2})^M, where GG is the number of common channels between these two users. By conducting extensive simulations, we show that for both the MTTR and the ETTR, our algorithm is comparable to the simple random algorithm and it outperforms several existing algorithms in the literature

    Temporal Matrix Factorization for Tracking Concept Drift in Individual User Preferences

    Full text link
    The matrix factorization (MF) technique has been widely adopted for solving the rating prediction problem in recommender systems. The MF technique utilizes the latent factor model to obtain static user preferences (user latent vectors) and item characteristics (item latent vectors) based on historical rating data. However, in the real world user preferences are not static but full of dynamics. Though there are several previous works that addressed this time varying issue of user preferences, it seems (to the best of our knowledge) that none of them is specifically designed for tracking concept drift in individual user preferences. Motivated by this, we develop a Temporal Matrix Factorization approach (TMF) for tracking concept drift in each individual user latent vector. There are two key innovative steps in our approach: (i) we develop a modified stochastic gradient descent method to learn an individual user latent vector at each time step, and (ii) by the Lasso regression we learn a linear model for the transition of the individual user latent vectors. We test our method on a synthetic dataset and several real datasets. In comparison with the original MF, our experimental results show that our temporal method is able to achieve lower root mean square errors (RMSE) for both the synthetic and real datasets. One interesting finding is that the performance gain in RMSE is mostly from those users who indeed have concept drift in their user latent vectors at the time of prediction. In particular, for the synthetic dataset and the Ciao dataset, there are quite a few users with that property and the performance gains for these two datasets are roughly 20% and 5%, respectively

    K-sets+: a Linear-time Clustering Algorithm for Data Points with a Sparse Similarity Measure

    Full text link
    In this paper, we first propose a new iterative algorithm, called the K-sets+ algorithm for clustering data points in a semi-metric space, where the distance measure does not necessarily satisfy the triangular inequality. We show that the K-sets+ algorithm converges in a finite number of iterations and it retains the same performance guarantee as the K-sets algorithm for clustering data points in a metric space. We then extend the applicability of the K-sets+ algorithm from data points in a semi-metric space to data points that only have a symmetric similarity measure. Such an extension leads to great reduction of computational complexity. In particular, for an n * n similarity matrix with m nonzero elements in the matrix, the computational complexity of the K-sets+ algorithm is O((Kn + m)I), where I is the number of iterations. The memory complexity to achieve that computational complexity is O(Kn + m). As such, both the computational complexity and the memory complexity are linear in n when the n * n similarity matrix is sparse, i.e., m = O(n). We also conduct various experiments to show the effectiveness of the K-sets+ algorithm by using a synthetic dataset from the stochastic block model and a real network from the WonderNetwork website

    ODSQA: Open-domain Spoken Question Answering Dataset

    Full text link
    Reading comprehension by machine has been widely studied, but machine comprehension of spoken content is still a less investigated problem. In this paper, we release Open-Domain Spoken Question Answering Dataset (ODSQA) with more than three thousand questions. To the best of our knowledge, this is the largest real SQA dataset. On this dataset, we found that ASR errors have catastrophic impact on SQA. To mitigate the effect of ASR errors, subword units are involved, which brings consistent improvements over all the models. We further found that data augmentation on text-based QA training examples can improve SQA

    Community Detection in Signed Networks: an Error-Correcting Code Approach

    Full text link
    In this paper, we consider the community detection problem in signed networks, where there are two types of edges: positive edges (friends) and negative edges (enemies). One renowned theorem of signed networks, known as Harary's theorem, states that structurally balanced signed networks are clusterable. By viewing each cycle in a signed network as a parity-check constraint, we show that the community detection problem in a signed network with two communities is equivalent to the decoding problem for a parity-check code. We also show how one can use two renowned decoding algorithms in error- correcting codes for community detection in signed networks: the bit-flipping algorithm, and the belief propagation algorithm. In addition to these two algorithms, we also propose a new community detection algorithm, called the Hamming distance algorithm, that performs community detection by finding a codeword that minimizes the Hamming distance. We compare the performance of these three algorithms by conducting various experiments with known ground truth. Our experimental results show that our Hamming distance algorithm outperforms the other two

    Constructions of Optical Queues With a Limited Number of Recirculations--Part I: Greedy Constructions

    Full text link
    In this two-part paper, we consider SDL constructions of optical queues with a limited number of recirculations through the optical switches and the fiber delay lines. We show that the constructions of certain types of optical queues, including linear compressors, linear decompressors, and 2-to-1 FIFO multiplexers, under a simple packet routing scheme and under the constraint of a limited number of recirculations can be transformed into equivalent integer representation problems under a corresponding constraint. Given MM and kk, the problem of finding an \emph{optimal} construction, in the sense of maximizing the maximum delay (resp., buffer size), among our constructions of linear compressors/decompressors (resp., 2-to-1 FIFO multiplexers) is equivalent to the problem of finding an optimal sequence {\dbf^*}_1^M in \Acal_M (resp., \Bcal_M) such that B({\dbf^*}_1^M;k)=\max_{\dbf_1^M\in \Acal_M}B(\dbf_1^M;k) (resp., B({\dbf^*}_1^M;k)=\max_{\dbf_1^M\in \Bcal_M}B(\dbf_1^M;k)), where \Acal_M (resp., \Bcal_M) is the set of all sequences of fiber delays allowed in our constructions of linear compressors/decompressors (resp., 2-to-1 FIFO multiplexers). In Part I, we propose a class of \emph{greedy} constructions of linear compressors/decompressors and 2-to-1 FIFO multiplexers by specifying a class \Gcal_{M,k} of sequences such that \Gcal_{M,k}\subseteq \Bcal_M\subseteq \Acal_M and each sequence in \Gcal_{M,k} is obtained recursively in a greedy manner. We then show that every optimal construction must be a greedy construction. In Part II, we further show that there are at most two optimal constructions and give a simple algorithm to obtain the optimal construction(s).Comment: 59 pages; 1 figure; This paper was presented in part at the IEEE International Conference on Computer Communications (INFOCOM'08), Phoenix, AZ, USA, April~13--18, 2008. This paper has been submitted to IEEE Transactions on Information Theory for possible publication

    A Mathematical Theory for Clustering in Metric Spaces

    Full text link
    Clustering is one of the most fundamental problems in data analysis and it has been studied extensively in the literature. Though many clustering algorithms have been proposed, clustering theories that justify the use of these clustering algorithms are still unsatisfactory. In particular, one of the fundamental challenges is to address the following question: What is a cluster in a set of data points? In this paper, we make an attempt to address such a question by considering a set of data points associated with a distance measure (metric). We first propose a new cohesion measure in terms of the distance measure. Using the cohesion measure, we define a cluster as a set of points that are cohesive to themselves. For such a definition, we show there are various equivalent statements that have intuitive explanations. We then consider the second question: How do we find clusters and good partitions of clusters under such a definition? For such a question, we propose a hierarchical agglomerative algorithm and a partitional algorithm. Unlike standard hierarchical agglomerative algorithms, our hierarchical agglomerative algorithm has a specific stopping criterion and it stops with a partition of clusters. Our partitional algorithm, called the K-sets algorithm in the paper, appears to be a new iterative algorithm. Unlike the Lloyd iteration that needs two-step minimization, our K-sets algorithm only takes one-step minimization. One of the most interesting findings of our paper is the duality result between a distance measure and a cohesion measure. Such a duality result leads to a dual K-sets algorithm for clustering a set of data points with a cohesion measure. The dual K-sets algorithm converges in the same way as a sequential version of the classical kernel K-means algorithm. The key difference is that a cohesion measure does not need to be positive semi-definite

    Percolation Threshold for Competitive Influence in Random Networks

    Full text link
    In this paper, we propose a new averaging model for modeling the competitive influence of KK candidates among nn voters in an election process. For such an influence propagation model, we address the question of how many seeded voters a candidate needs to place among undecided voters in order to win an election. We show that for a random network generated from the stochastic block model, there exists a percolation threshold for a candidate to win the election if the number of seeded voters placed by the candidate exceeds the threshold. By conducting extensive experiments, we show that our theoretical percolation thresholds are very close to those obtained from simulations for random networks and the errors are within 10%10\% for a real-world network.Comment: 11 pages, 9 figures, this article is the complete version (with proofs) of the IEEE Global Communications Conference 2019 review pape
    corecore