2,691 research outputs found

    Second-order Temporal Pooling for Action Recognition

    Full text link
    Deep learning models for video-based action recognition usually generate features for short clips (consisting of a few frames); such clip-level features are aggregated to video-level representations by computing statistics on these features. Typically zero-th (max) or the first-order (average) statistics are used. In this paper, we explore the benefits of using second-order statistics. Specifically, we propose a novel end-to-end learnable feature aggregation scheme, dubbed temporal correlation pooling that generates an action descriptor for a video sequence by capturing the similarities between the temporal evolution of clip-level CNN features computed across the video. Such a descriptor, while being computationally cheap, also naturally encodes the co-activations of multiple CNN features, thereby providing a richer characterization of actions than their first-order counterparts. We also propose higher-order extensions of this scheme by computing correlations after embedding the CNN features in a reproducing kernel Hilbert space. We provide experiments on benchmark datasets such as HMDB-51 and UCF-101, fine-grained datasets such as MPII Cooking activities and JHMDB, as well as the recent Kinetics-600. Our results demonstrate the advantages of higher-order pooling schemes that when combined with hand-crafted features (as is standard practice) achieves state-of-the-art accuracy.Comment: Accepted in the International Journal of Computer Vision (IJCV

    Unsupervised Human Action Detection by Action Matching

    Full text link
    We propose a new task of unsupervised action detection by action matching. Given two long videos, the objective is to temporally detect all pairs of matching video segments. A pair of video segments are matched if they share the same human action. The task is category independent---it does not matter what action is being performed---and no supervision is used to discover such video segments. Unsupervised action detection by action matching allows us to align videos in a meaningful manner. As such, it can be used to discover new action categories or as an action proposal technique within, say, an action detection pipeline. Moreover, it is a useful pre-processing step for generating video highlights, e.g., from sports videos. We present an effective and efficient method for unsupervised action detection. We use an unsupervised temporal encoding method and exploit the temporal consistency in human actions to obtain candidate action segments. We evaluate our method on this challenging task using three activity recognition benchmarks, namely, the MPII Cooking activities dataset, the THUMOS15 action detection benchmark and a new dataset called the IKEA dataset. On the MPII Cooking dataset we detect action segments with a precision of 21.6% and recall of 11.7% over 946 long video pairs and over 5000 ground truth action segments. Similarly, on THUMOS dataset we obtain 18.4% precision and 25.1% recall over 5094 ground truth action segment pairs.Comment: IEEE International Conference on Computer Vision and Pattern Recognition CVPR 2017 Workshop

    The Allure of Celebrities: Unpacking Their Polysemic Consumer Appeal

    Get PDF
    The file attached to this record is the author's final peer reviewed version.To explain their deep resonance with consumers this paper unpacks the individual constituents of a celebrity’s polysemic appeal. While celebrities are traditionally theorised as unidimensional ‘semiotic receptacles of cultural meaning’, we conceptualise them here instead as human beings/performers with a multi-constitutional, polysemic consumer appeal. Supporting evidence is drawn from autoethnographic data collected over a total period of 25 months and structured through a hermeneutic analysis. In ‘rehumanising’ the celebrity, the study finds that each celebrity offers the individual consumer a unique and very personal parasocial appeal as a) the performer, b) the ‘private’ person behind the public performer, c) the tangible manifestation of either through products, and d) the social link to other consumers. The stronger these constituents, individually or symbiotically, appeal to the consumer’s personal desires the more s/he feels emotionally attached to this particular celebrity. Although using autoethnography means that the breadth of collected data is limited, the depth of insight this approach garners sufficiently unpacks the polysemic appeal of celebrities to consumers. The findings encourage talent agents, publicists and marketing managers to reconsider underlying assumptions in their talent management and/or celebrity endorsement practices. While prior research on celebrity appeal has tended to enshrine celebrities in a “dehumanised” structuralist semiosis, which erases the very idea of individualised consumer meanings, this paper reveals the multi-constitutional polysemy of any particular celebrity’s personal appeal as a performer and human being to any particular consumer

    Generalized Rank Pooling for Activity Recognition

    Full text link
    Most popular deep models for action recognition split video sequences into short sub-sequences consisting of a few frames; frame-based features are then pooled for recognizing the activity. Usually, this pooling step discards the temporal order of the frames, which could otherwise be used for better recognition. Towards this end, we propose a novel pooling method, generalized rank pooling (GRP), that takes as input, features from the intermediate layers of a CNN that is trained on tiny sub-sequences, and produces as output the parameters of a subspace which (i) provides a low-rank approximation to the features and (ii) preserves their temporal order. We propose to use these parameters as a compact representation for the video sequence, which is then used in a classification setup. We formulate an objective for computing this subspace as a Riemannian optimization problem on the Grassmann manifold, and propose an efficient conjugate gradient scheme for solving it. Experiments on several activity recognition datasets show that our scheme leads to state-of-the-art performance.Comment: Accepted at IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 201

    DeepPermNet: Visual Permutation Learning

    Full text link
    We present a principled approach to uncover the structure of visual data by solving a novel deep learning task coined visual permutation learning. The goal of this task is to find the permutation that recovers the structure of data from shuffled versions of it. In the case of natural images, this task boils down to recovering the original image from patches shuffled by an unknown permutation matrix. Unfortunately, permutation matrices are discrete, thereby posing difficulties for gradient-based methods. To this end, we resort to a continuous approximation of these matrices using doubly-stochastic matrices which we generate from standard CNN predictions using Sinkhorn iterations. Unrolling these iterations in a Sinkhorn network layer, we propose DeepPermNet, an end-to-end CNN model for this task. The utility of DeepPermNet is demonstrated on two challenging computer vision problems, namely, (i) relative attributes learning and (ii) self-supervised representation learning. Our results show state-of-the-art performance on the Public Figures and OSR benchmarks for (i) and on the classification and segmentation tasks on the PASCAL VOC dataset for (ii).Comment: Accepted in IEEE International Conference on Computer Vision and Pattern Recognition CVPR 201
    corecore