366 research outputs found

    SIMCO: SIMilarity-based object COunting

    Full text link
    We present SIMCO, the first agnostic multi-class object counting approach. SIMCO starts by detecting foreground objects through a novel Mask RCNN-based architecture trained beforehand (just once) on a brand-new synthetic 2D shape dataset, InShape; the idea is to highlight every object resembling a primitive 2D shape (circle, square, rectangle, etc.). Each object detected is described by a low-dimensional embedding, obtained from a novel similarity-based head branch; this latter implements a triplet loss, encouraging similar objects (same 2D shape + color and scale) to map close. Subsequently, SIMCO uses this embedding for clustering, so that different types of objects can emerge and be counted, making SIMCO the very first multi-class unsupervised counter. Experiments show that SIMCO provides state-of-the-art scores on counting benchmarks and that it can also help in many challenging image understanding tasks

    The pictures we like are our image: continuous mapping of favorite pictures into self-assessed and attributed personality traits

    Get PDF
    Flickr allows its users to tag the pictures they like as “favorite”. As a result, many users of the popular photo-sharing platform produce galleries of favorite pictures. This article proposes new approaches, based on Computational Aesthetics, capable to infer the personality traits of Flickr users from the galleries above. In particular, the approaches map low-level features extracted from the pictures into numerical scores corresponding to the Big-Five Traits, both self-assessed and attributed. The experiments were performed over 60,000 pictures tagged as favorite by 300 users (the PsychoFlickr Corpus). The results show that it is possible to predict beyond chance both self-assessed and attributed traits. In line with the state-of-the art of Personality Computing, these latter are predicted with higher effectiveness (correlation up to 0.68 between actual and predicted traits)

    Audio-visual foreground extraction for event characterization

    Get PDF
    This paper presents a new method able to integrate audio and visual information for scene analysis in a typical surveillance scenario, using only one camera and one monaural microphone. Visual information is analyzed by a standard visual background/foreground (BG/FG) modelling module, enhanced with a novelty detection stage, and coupled with an audio BG/FG modelling scheme. The audiovisual association is performed on-line, by exploiting the concept of synchrony. Experimental tests carrying out classification and clustering of events show all the potentialities of the proposed approach, also in comparison with the results obtained by using the single modalities

    F-formation Detection: Individuating Free-standing Conversational Groups in Images

    Full text link
    Detection of groups of interacting people is a very interesting and useful task in many modern technologies, with application fields spanning from video-surveillance to social robotics. In this paper we first furnish a rigorous definition of group considering the background of the social sciences: this allows us to specify many kinds of group, so far neglected in the Computer Vision literature. On top of this taxonomy, we present a detailed state of the art on the group detection algorithms. Then, as a main contribution, we present a brand new method for the automatic detection of groups in still images, which is based on a graph-cuts framework for clustering individuals; in particular we are able to codify in a computational sense the sociological definition of F-formation, that is very useful to encode a group having only proxemic information: position and orientation of people. We call the proposed method Graph-Cuts for F-formation (GCFF). We show how GCFF definitely outperforms all the state of the art methods in terms of different accuracy measures (some of them are brand new), demonstrating also a strong robustness to noise and versatility in recognizing groups of various cardinality.Comment: 32 pages, submitted to PLOS On

    The Visual Social Distancing Problem

    Get PDF
    One of the main and most effective measures to contain the recent viral outbreak is the maintenance of the so-called Social Distancing (SD). To comply with this constraint, workplaces, public institutions, transports and schools will likely adopt restrictions over the minimum inter-personal distance between people. Given this actual scenario, it is crucial to massively measure the compliance to such physical constraint in our life, in order to figure out the reasons of the possible breaks of such distance limitations, and understand if this implies a possible threat given the scene context. All of this, complying with privacy policies and making the measurement acceptable. To this end, we introduce the Visual Social Distancing (VSD) problem, defined as the automatic estimation of the inter-personal distance from an image, and the characterization of the related people aggregations. VSD is pivotal for a non-invasive analysis to whether people comply with the SD restriction, and to provide statistics about the level of safety of specific areas whenever this constraint is violated. We then discuss how VSD relates with previous literature in Social Signal Processing and indicate which existing Computer Vision methods can be used to manage such problem. We conclude with future challenges related to the effectiveness of VSD systems, ethical implications and future application scenarios.Comment: 9 pages, 5 figures. All the authors equally contributed to this manuscript and they are listed by alphabetical order. Under submissio

    Looking Beyond Appearances: Synthetic Training Data for Deep CNNs in Re-identification

    Get PDF
    Re-identification is generally carried out by encoding the appearance of a subject in terms of outfit, suggesting scenarios where people do not change their attire. In this paper we overcome this restriction, by proposing a framework based on a deep convolutional neural network, SOMAnet, that additionally models other discriminative aspects, namely, structural attributes of the human figure (e.g. height, obesity, gender). Our method is unique in many respects. First, SOMAnet is based on the Inception architecture, departing from the usual siamese framework. This spares expensive data preparation (pairing images across cameras) and allows the understanding of what the network learned. Second, and most notably, the training data consists of a synthetic 100K instance dataset, SOMAset, created by photorealistic human body generation software. Synthetic data represents a good compromise between realistic imagery, usually not required in re-identification since surveillance cameras capture low-resolution silhouettes, and complete control of the samples, which is useful in order to customize the data w.r.t. the surveillance scenario at-hand, e.g. ethnicity. SOMAnet, trained on SOMAset and fine-tuned on recent re-identification benchmarks, outperforms all competitors, matching subjects even with different apparel. The combination of synthetic data with Inception architectures opens up new research avenues in re-identification.Comment: 14 page

    Transformer Networks for Trajectory Forecasting

    Full text link
    Most recent successes on forecasting the people motion are based on LSTM models and all most recent progress has been achieved by modelling the social interaction among people and the people interaction with the scene. We question the use of the LSTM models and propose the novel use of Transformer Networks for trajectory forecasting. This is a fundamental switch from the sequential step-by-step processing of LSTMs to the only-attention-based memory mechanisms of Transformers. In particular, we consider both the original Transformer Network (TF) and the larger Bidirectional Transformer (BERT), state-of-the-art on all natural language processing tasks. Our proposed Transformers predict the trajectories of the individual people in the scene. These are "simple" model because each person is modelled separately without any complex human-human nor scene interaction terms. In particular, the TF model without bells and whistles yields the best score on the largest and most challenging trajectory forecasting benchmark of TrajNet. Additionally, its extension which predicts multiple plausible future trajectories performs on par with more engineered techniques on the 5 datasets of ETH + UCY. Finally, we show that Transformers may deal with missing observations, as it may be the case with real sensor data. Code is available at https://github.com/FGiuliari/Trajectory-Transformer.Comment: 18 pages, 3 figure
    corecore