5,251 research outputs found
Effective Blog Pages Extractor for Better UGC Accessing
Blog is becoming an increasingly popular media for information publishing.
Besides the main content, most of blog pages nowadays also contain noisy
information such as advertisements etc. Removing these unrelated elements can
improves user experience, but also can better adapt the content to various
devices such as mobile phones. Though template-based extractors are highly
accurate, they may incur expensive cost in that a large number of template need
to be developed and they will fail once the template is updated. To address
these issues, we present a novel template-independent content extractor for
blog pages. First, we convert a blog page into a DOM-Tree, where all elements
including the title and body blocks in a page correspond to subtrees. Then we
construct subtree candidate set for the title and the body blocks respectively,
and extract both spatial and content features for elements contained in the
subtree. SVM classifiers for the title and the body blocks are trained using
these features. Finally, the classifiers are used to extract the main content
from blog pages. We test our extractor on 2,250 blog pages crawled from nine
blog sites with obviously different styles and templates. Experimental results
verify the effectiveness of our extractor.Comment: 2016 3rd International Conference on Information Science and Control
Engineering (ICISCE
Low-Rank Modeling and Its Applications in Image Analysis
Low-rank modeling generally refers to a class of methods that solve problems
by representing variables of interest as low-rank matrices. It has achieved
great success in various fields including computer vision, data mining, signal
processing and bioinformatics. Recently, much progress has been made in
theories, algorithms and applications of low-rank modeling, such as exact
low-rank matrix recovery via convex programming and matrix completion applied
to collaborative filtering. These advances have brought more and more
attentions to this topic. In this paper, we review the recent advance of
low-rank modeling, the state-of-the-art algorithms, and related applications in
image analysis. We first give an overview to the concept of low-rank modeling
and challenging problems in this area. Then, we summarize the models and
algorithms for low-rank matrix recovery and illustrate their advantages and
limitations with numerical experiments. Next, we introduce a few applications
of low-rank modeling in the context of image analysis. Finally, we conclude
this paper with some discussions.Comment: To appear in ACM Computing Survey
Apply current exponential de Finetti theorem to realistic quantum key distribution
In the realistic quantum key distribution (QKD), Alice and Bob respectively
get a quantum state from an unknown channel, whose dimension may be unknown.
However, while discussing the security, sometime we need to know exact
dimension, since current exponential de Finetti theorem, crucial to the
information-theoretical security proof, is deeply related with the dimension
and can only be applied to finite dimensional case. Here we address this
problem in detail. We show that if POVM elements corresponding to Alice and
Bob's measured results can be well described in a finite dimensional subspace
with sufficiently small error, then dimensions of Alice and Bob's states can be
almost regarded as finite. Since the security is well defined by the smooth
entropy, which is continuous with the density matrix, the small error of state
actually means small change of security. Then the security of
unknown-dimensional system can be solved. Finally we prove that for heterodyne
detection continuous variable QKD and differential phase shift QKD, the
collective attack is optimal under the infinite key size case.Comment: 11 pages, 2 figures, detailed version, applications adde
Navigation Objects Extraction for Better Content Structure Understanding
Existing works for extracting navigation objects from webpages focus on
navigation menus, so as to reveal the information architecture of the site.
However, web 2.0 sites such as social networks, e-commerce portals etc. are
making the understanding of the content structure in a web site increasingly
difficult. Dynamic and personalized elements such as top stories, recommended
list in a webpage are vital to the understanding of the dynamic nature of web
2.0 sites. To better understand the content structure in web 2.0 sites, in this
paper we propose a new extraction method for navigation objects in a webpage.
Our method will extract not only the static navigation menus, but also the
dynamic and personalized page-specific navigation lists. Since the navigation
objects in a webpage naturally come in blocks, we first cluster hyperlinks into
different blocks by exploiting spatial locations of hyperlinks, the
hierarchical structure of the DOM-tree and the hyperlink density. Then we
identify navigation objects from those blocks using the SVM classifier with
novel features such as anchor text lengths etc. Experiments on real-world data
sets with webpages from various domains and styles verified the effectiveness
of our method.Comment: 2017 IEEE/WIC/ACM International Conference on Web Intelligence (WI
- …
