2,691 research outputs found
Second-order Temporal Pooling for Action Recognition
Deep learning models for video-based action recognition usually generate
features for short clips (consisting of a few frames); such clip-level features
are aggregated to video-level representations by computing statistics on these
features. Typically zero-th (max) or the first-order (average) statistics are
used. In this paper, we explore the benefits of using second-order statistics.
Specifically, we propose a novel end-to-end learnable feature aggregation
scheme, dubbed temporal correlation pooling that generates an action descriptor
for a video sequence by capturing the similarities between the temporal
evolution of clip-level CNN features computed across the video. Such a
descriptor, while being computationally cheap, also naturally encodes the
co-activations of multiple CNN features, thereby providing a richer
characterization of actions than their first-order counterparts. We also
propose higher-order extensions of this scheme by computing correlations after
embedding the CNN features in a reproducing kernel Hilbert space. We provide
experiments on benchmark datasets such as HMDB-51 and UCF-101, fine-grained
datasets such as MPII Cooking activities and JHMDB, as well as the recent
Kinetics-600. Our results demonstrate the advantages of higher-order pooling
schemes that when combined with hand-crafted features (as is standard practice)
achieves state-of-the-art accuracy.Comment: Accepted in the International Journal of Computer Vision (IJCV
Unsupervised Human Action Detection by Action Matching
We propose a new task of unsupervised action detection by action matching.
Given two long videos, the objective is to temporally detect all pairs of
matching video segments. A pair of video segments are matched if they share the
same human action. The task is category independent---it does not matter what
action is being performed---and no supervision is used to discover such video
segments. Unsupervised action detection by action matching allows us to align
videos in a meaningful manner. As such, it can be used to discover new action
categories or as an action proposal technique within, say, an action detection
pipeline. Moreover, it is a useful pre-processing step for generating video
highlights, e.g., from sports videos.
We present an effective and efficient method for unsupervised action
detection. We use an unsupervised temporal encoding method and exploit the
temporal consistency in human actions to obtain candidate action segments. We
evaluate our method on this challenging task using three activity recognition
benchmarks, namely, the MPII Cooking activities dataset, the THUMOS15 action
detection benchmark and a new dataset called the IKEA dataset. On the MPII
Cooking dataset we detect action segments with a precision of 21.6% and recall
of 11.7% over 946 long video pairs and over 5000 ground truth action segments.
Similarly, on THUMOS dataset we obtain 18.4% precision and 25.1% recall over
5094 ground truth action segment pairs.Comment: IEEE International Conference on Computer Vision and Pattern
Recognition CVPR 2017 Workshop
The Allure of Celebrities: Unpacking Their Polysemic Consumer Appeal
The file attached to this record is the author's final peer reviewed version.To explain their deep resonance with consumers this paper unpacks the individual constituents of a celebrity’s polysemic appeal. While celebrities are traditionally theorised as unidimensional ‘semiotic receptacles of cultural meaning’, we conceptualise them here instead as human beings/performers with a multi-constitutional, polysemic consumer appeal.
Supporting evidence is drawn from autoethnographic data collected over a total period of 25 months and structured through a hermeneutic analysis.
In ‘rehumanising’ the celebrity, the study finds that each celebrity offers the individual consumer a unique and very personal parasocial appeal as a) the performer, b) the ‘private’ person behind the public performer, c) the tangible manifestation of either through products, and d) the social link to other consumers. The stronger these constituents, individually or symbiotically, appeal to the consumer’s personal desires the more s/he feels emotionally attached to this particular celebrity.
Although using autoethnography means that the breadth of collected data is limited, the depth of insight this approach garners sufficiently unpacks the polysemic appeal of celebrities to consumers.
The findings encourage talent agents, publicists and marketing managers to reconsider underlying assumptions in their talent management and/or celebrity endorsement practices. While prior research on celebrity appeal has tended to enshrine celebrities in a “dehumanised” structuralist semiosis, which erases the very idea of individualised consumer meanings, this paper reveals the multi-constitutional polysemy of any particular celebrity’s personal appeal as a performer and human being to any particular consumer
Generalized Rank Pooling for Activity Recognition
Most popular deep models for action recognition split video sequences into
short sub-sequences consisting of a few frames; frame-based features are then
pooled for recognizing the activity. Usually, this pooling step discards the
temporal order of the frames, which could otherwise be used for better
recognition. Towards this end, we propose a novel pooling method, generalized
rank pooling (GRP), that takes as input, features from the intermediate layers
of a CNN that is trained on tiny sub-sequences, and produces as output the
parameters of a subspace which (i) provides a low-rank approximation to the
features and (ii) preserves their temporal order. We propose to use these
parameters as a compact representation for the video sequence, which is then
used in a classification setup. We formulate an objective for computing this
subspace as a Riemannian optimization problem on the Grassmann manifold, and
propose an efficient conjugate gradient scheme for solving it. Experiments on
several activity recognition datasets show that our scheme leads to
state-of-the-art performance.Comment: Accepted at IEEE International Conference on Computer Vision and
Pattern Recognition (CVPR), 201
DeepPermNet: Visual Permutation Learning
We present a principled approach to uncover the structure of visual data by
solving a novel deep learning task coined visual permutation learning. The goal
of this task is to find the permutation that recovers the structure of data
from shuffled versions of it. In the case of natural images, this task boils
down to recovering the original image from patches shuffled by an unknown
permutation matrix. Unfortunately, permutation matrices are discrete, thereby
posing difficulties for gradient-based methods. To this end, we resort to a
continuous approximation of these matrices using doubly-stochastic matrices
which we generate from standard CNN predictions using Sinkhorn iterations.
Unrolling these iterations in a Sinkhorn network layer, we propose DeepPermNet,
an end-to-end CNN model for this task. The utility of DeepPermNet is
demonstrated on two challenging computer vision problems, namely, (i) relative
attributes learning and (ii) self-supervised representation learning. Our
results show state-of-the-art performance on the Public Figures and OSR
benchmarks for (i) and on the classification and segmentation tasks on the
PASCAL VOC dataset for (ii).Comment: Accepted in IEEE International Conference on Computer Vision and
Pattern Recognition CVPR 201
- …
