Search CORE

5,251 research outputs found

Effective Blog Pages Extractor for Better UGC Accessing

Author: Hu Xia
Wang Can
Wang Yi
Zhao Kui
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/07/2016
Field of study

Blog is becoming an increasingly popular media for information publishing. Besides the main content, most of blog pages nowadays also contain noisy information such as advertisements etc. Removing these unrelated elements can improves user experience, but also can better adapt the content to various devices such as mobile phones. Though template-based extractors are highly accurate, they may incur expensive cost in that a large number of template need to be developed and they will fail once the template is updated. To address these issues, we present a novel template-independent content extractor for blog pages. First, we convert a blog page into a DOM-Tree, where all elements including the title and body blocks in a page correspond to subtrees. Then we construct subtree candidate set for the title and the body blocks respectively, and extract both spatial and content features for elements contained in the subtree. SVM classifiers for the title and the body blocks are trained using these features. Finally, the classifiers are used to extract the main content from blog pages. We test our extractor on 2,250 blog pages crawled from nine blog sites with obviously different styles and templates. Experimental results verify the effectiveness of our extractor.Comment: 2016 3rd International Conference on Information Science and Control Engineering (ICISCE

arXiv.org e-Print Archive

Crossref

Low-Rank Modeling and Its Applications in Image Analysis

Author: Yang Can
Yu Weichuan
Zhao Hongyu
Zhou Xiaowei
Publication venue
Publication date: 22/10/2014
Field of study

Low-rank modeling generally refers to a class of methods that solve problems by representing variables of interest as low-rank matrices. It has achieved great success in various fields including computer vision, data mining, signal processing and bioinformatics. Recently, much progress has been made in theories, algorithms and applications of low-rank modeling, such as exact low-rank matrix recovery via convex programming and matrix completion applied to collaborative filtering. These advances have brought more and more attentions to this topic. In this paper, we review the recent advance of low-rank modeling, the state-of-the-art algorithms, and related applications in image analysis. We first give an overview to the concept of low-rank modeling and challenging problems in this area. Then, we summarize the models and algorithms for low-rank matrix recovery and illustrate their advantages and limitations with numerical experiments. Next, we introduce a few applications of low-rank modeling in the context of image analysis. Finally, we conclude this paper with some discussions.Comment: To appear in ACM Computing Survey

arXiv.org e-Print Archive

Crossref

Hong Kong University of Science and Technology Institutional Repository

Apply current exponential de Finetti theorem to realistic quantum key distribution

Author: Guo Guang-Can
Han Zheng-Fu
Zhao Yi-Bo
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date: 18/01/2009
Field of study

In the realistic quantum key distribution (QKD), Alice and Bob respectively get a quantum state from an unknown channel, whose dimension may be unknown. However, while discussing the security, sometime we need to know exact dimension, since current exponential de Finetti theorem, crucial to the information-theoretical security proof, is deeply related with the dimension and can only be applied to finite dimensional case. Here we address this problem in detail. We show that if POVM elements corresponding to Alice and Bob's measured results can be well described in a finite dimensional subspace with sufficiently small error, then dimensions of Alice and Bob's states can be almost regarded as finite. Since the security is well defined by the smooth entropy, which is continuous with the density matrix, the small error of state actually means small change of security. Then the security of unknown-dimensional system can be solved. Finally we prove that for heterodyne detection continuous variable QKD and differential phase shift QKD, the collective attack is optimal under the infinite key size case.Comment: 11 pages, 2 figures, detailed version, applications adde

arXiv.org e-Print Archive

Crossref

Navigation Objects Extraction for Better Content Structure Understanding

Author: Bu Jiajun
Li Bangpeng
Peng Zilun
Wang Can
Zhao Kui
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 23/08/2017
Field of study

Existing works for extracting navigation objects from webpages focus on navigation menus, so as to reveal the information architecture of the site. However, web 2.0 sites such as social networks, e-commerce portals etc. are making the understanding of the content structure in a web site increasingly difficult. Dynamic and personalized elements such as top stories, recommended list in a webpage are vital to the understanding of the dynamic nature of web 2.0 sites. To better understand the content structure in web 2.0 sites, in this paper we propose a new extraction method for navigation objects in a webpage. Our method will extract not only the static navigation menus, but also the dynamic and personalized page-specific navigation lists. Since the navigation objects in a webpage naturally come in blocks, we first cluster hyperlinks into different blocks by exploiting spatial locations of hyperlinks, the hierarchical structure of the DOM-tree and the hyperlink density. Then we identify navigation objects from those blocks using the SVM classifier with novel features such as anchor text lengths etc. Experiments on real-world data sets with webpages from various domains and styles verified the effectiveness of our method.Comment: 2017 IEEE/WIC/ACM International Conference on Web Intelligence (WI

arXiv.org e-Print Archive

Crossref