337 research outputs found

    From Indoor To Outdoor: Unsupervised Domain Adaptive Gait Recognition

    Full text link
    Gait recognition is an important AI task, which has been progressed rapidly with the development of deep learning. However, existing learning based gait recognition methods mainly focus on the single domain, especially the constrained laboratory environment. In this paper, we study a new problem of unsupervised domain adaptive gait recognition (UDA-GR), that learns a gait identifier with supervised labels from the indoor scenes (source domain), and is applied to the outdoor wild scenes (target domain). For this purpose, we develop an uncertainty estimation and regularization based UDA-GR method. Specifically, we investigate the characteristic of gaits in the indoor and outdoor scenes, for estimating the gait sample uncertainty, which is used in the unsupervised fine-tuning on the target domain to alleviate the noises of the pseudo labels. We also establish a new benchmark for the proposed problem, experimental results on which show the effectiveness of the proposed method. We will release the benchmark and source code in this work to the public

    Unveiling the Power of Self-supervision for Multi-view Multi-human Association and Tracking

    Full text link
    Multi-view multi-human association and tracking (MvMHAT), is a new but important problem for multi-person scene video surveillance, aiming to track a group of people over time in each view, as well as to identify the same person across different views at the same time, which is different from previous MOT and multi-camera MOT tasks only considering the over-time human tracking. This way, the videos for MvMHAT require more complex annotations while containing more information for self learning. In this work, we tackle this problem with a self-supervised learning aware end-to-end network. Specifically, we propose to take advantage of the spatial-temporal self-consistency rationale by considering three properties of reflexivity, symmetry and transitivity. Besides the reflexivity property that naturally holds, we design the self-supervised learning losses based on the properties of symmetry and transitivity, for both appearance feature learning and assignment matrix optimization, to associate the multiple humans over time and across views. Furthermore, to promote the research on MvMHAT, we build two new large-scale benchmarks for the network training and testing of different algorithms. Extensive experiments on the proposed benchmarks verify the effectiveness of our method. We have released the benchmark and code to the public

    Combining the Silhouette and Skeleton Data for Gait Recognition

    Full text link
    Gait recognition, a promising long-distance biometric technology, has aroused intense interest in computer vision. Existing works on gait recognition can be divided into appearance-based methods and model-based methods, which extract features from silhouettes and skeleton data, respectively. However, since appearance-based methods are greatly affected by clothing changing and carrying condition, and model-based methods are limited by the accuracy of pose estimation approaches, gait recognition remains challenging in practical applications. In order to integrate the advantages of such two approaches, a two-branch neural network (NN) is proposed in this paper. Our method contains two branches, namely a CNN-based branch taking silhouettes as input and a GCN-based branch taking skeletons as input. In addition, two new modules are proposed in the GCN-based branch for better gait representation. First, we present a simple yet effective fully connected graph convolution operator to integrate the multi-scale graph convolutions and alleviate the dependence on natural human joint connections. Second, we deploy a multi-dimension attention module named STC-Att to learn spatial, temporal and channel-wise attention simultaneously. We evaluated the proposed two-branch neural network on the CASIA-B dataset. The experimental results show that our method achieves state-of-the-art performance in various conditions.Comment: The paper is under consideration at Computer Vision and Image Understandin

    ComCLIP: Training-Free Compositional Image and Text Matching

    Full text link
    Contrastive Language-Image Pretraining (CLIP) has demonstrated great zero-shot performance for image-text matching because of its holistic use of natural language supervision that covers large-scale, open-world visual concepts. However, it is still challenging to adapt CLIP to compositional image and text matching -- a more challenging image and matching mask requiring the model understanding of compositional word concepts and visual components. Towards better compositional generalization in zero-shot image and text matching, in this paper, we study the problem from a causal perspective: the erroneous semantics of individual entities are essentially confounders that cause the matching failure. Therefore, we propose a novel training-free compositional CLIP model (ComCLIP). ComCLIP disentangles input images into subjects, objects, and action sub-images and composes CLIP's vision encoder and text encoder to perform evolving matching over compositional text embedding and sub-image embeddings. In this way, ComCLIP can mitigate spurious correlations introduced by the pretrained CLIP models and dynamically assess the contribution of each entity when performing image and text matching. Experiments on compositional image-text matching on SVO and ComVG and general image-text retrieval on Flickr8K demonstrate the effectiveness of our plug-and-play method, which boosts the zero-shot inference ability of CLIP even without further training or fine-tuning of CLIP

    Nearest Neighbor Machine Translation is Meta-Optimizer on Output Projection Layer

    Full text link
    Nearest Neighbor Machine Translation (kkNN-MT) has achieved great success in domain adaptation tasks by integrating pre-trained Neural Machine Translation (NMT) models with domain-specific token-level retrieval. However, the reasons underlying its success have not been thoroughly investigated. In this paper, we comprehensively analyze kkNN-MT through theoretical and empirical studies. Initially, we provide new insights into the working mechanism of kkNN-MT as an efficient technique to implicitly execute gradient descent on the output projection layer of NMT, indicating that it is a specific case of model fine-tuning. Subsequently, we conduct multi-domain experiments and word-level analysis to examine the differences in performance between kkNN-MT and entire-model fine-tuning. Our findings suggest that: (1) Incorporating kkNN-MT with adapters yields comparable translation performance to fine-tuning on in-domain test sets, while achieving better performance on out-of-domain test sets; (2) Fine-tuning significantly outperforms kkNN-MT on the recall of in-domain low-frequency words, but this gap could be bridged by optimizing the context representations with additional adapter layers.Comment: Accepted by EMNLP202

    A Benchmark of Video-Based Clothes-Changing Person Re-Identification

    Full text link
    Person re-identification (Re-ID) is a classical computer vision task and has achieved great progress so far. Recently, long-term Re-ID with clothes-changing has attracted increasing attention. However, existing methods mainly focus on image-based setting, where richer temporal information is overlooked. In this paper, we focus on the relatively new yet practical problem of clothes-changing video-based person re-identification (CCVReID), which is less studied. We systematically study this problem by simultaneously considering the challenge of the clothes inconsistency issue and the temporal information contained in the video sequence for the person Re-ID problem. Based on this, we develop a two-branch confidence-aware re-ranking framework for handling the CCVReID problem. The proposed framework integrates two branches that consider both the classical appearance features and cloth-free gait features through a confidence-guided re-ranking strategy. This method provides the baseline method for further studies. Also, we build two new benchmark datasets for CCVReID problem, including a large-scale synthetic video dataset and a real-world one, both containing human sequences with various clothing changes. We will release the benchmark and code in this work to the public

    Decompressing Dilithium\u27s Public Key with Fewer Signatures Using Side Channel Analysis

    Get PDF
    The CRYSTALS-Dilithium digital signature scheme, selected by NIST as a post-quantum cryptography (PQC) standard under the name ML-DSA, employs a public key compression technique intended for performance optimization. Specifically, the module learning with error instance (A,t)({\bf A}, {\bf t}) is compressed by omitting the low-order bits t0{\bf t_0} of the vector t{\bf t}. It was recently shown that knowledge of t0{\bf t_0} enables more effective side-channel attacks on Dilithium implementations. Another recent work demonstrated a method for reconstructing t0{\bf t_0} from multiple signatures. In this paper, we build on this method by applying profiled deep learning-assisted side-channel analysis to partially recover the least significant bit of t0{\bf t_0} from power traces. As a result, the number of signatures required for the reconstruction of t0{\bf t_0} can be reduced by roughly half. We demonstrate how the new t0{\bf t_0} reconstruction method enhances the efficiency of recovering the secret key component s1{\bf s}_1, and thus facilitates digital signature forgery, on an ARM Cortex-M4 implementation of Dilithium

    Side-Channel Analysis of Saber KEM Using Amplitude-Modulated EM Emanations

    Get PDF
    In the ongoing last round of NIST’s post-quantum cryptography standardization competition, side-channel analysis of finalists is a main focus of attention. While their resistance to timing, power and near field electromagnetic (EM) side-channels has been thoroughly investigated, amplitude-modulated EM emanations has not been considered so far. The attacks based on amplitude-modulated EM emanations are more stealthy because they exploit side-channels intertwined into the signal transmitted by an on-chip antenna. Thus, they can be mounted on a distance from the device under attack. In this paper, we present the first results of an amplitude-modulated EM side-channel analysis of one of the NIST PQ finalists, Saber key encapsulation mechanism (KEM), implemented on the nRF52832 (ARM Cortex-M4) system-on-chip supporting Bluetooth 5. By capturing amplitude-modulated EM emanations during decapsulation, we can recover each bit of the session key with 0.91 probability on average

    OmniDrones: An Efficient and Flexible Platform for Reinforcement Learning in Drone Control

    Full text link
    In this work, we introduce OmniDrones, an efficient and flexible platform tailored for reinforcement learning in drone control, built on Nvidia's Omniverse Isaac Sim. It employs a bottom-up design approach that allows users to easily design and experiment with various application scenarios on top of GPU-parallelized simulations. It also offers a range of benchmark tasks, presenting challenges ranging from single-drone hovering to over-actuated system tracking. In summary, we propose an open-sourced drone simulation platform, equipped with an extensive suite of tools for drone learning. It includes 4 drone models, 5 sensor modalities, 4 control modes, over 10 benchmark tasks, and a selection of widely used RL baselines. To showcase the capabilities of OmniDrones and to support future research, we also provide preliminary results on these benchmark tasks. We hope this platform will encourage further studies on applying RL to practical drone systems.Comment: Submitted to IEEE RA-

    Robust Collaborative Perception without External Localization and Clock Devices

    Full text link
    A consistent spatial-temporal coordination across multiple agents is fundamental for collaborative perception, which seeks to improve perception abilities through information exchange among agents. To achieve this spatial-temporal alignment, traditional methods depend on external devices to provide localization and clock signals. However, hardware-generated signals could be vulnerable to noise and potentially malicious attack, jeopardizing the precision of spatial-temporal alignment. Rather than relying on external hardwares, this work proposes a novel approach: aligning by recognizing the inherent geometric patterns within the perceptual data of various agents. Following this spirit, we propose a robust collaborative perception system that operates independently of external localization and clock devices. The key module of our system,~\emph{FreeAlign}, constructs a salient object graph for each agent based on its detected boxes and uses a graph neural network to identify common subgraphs between agents, leading to accurate relative pose and time. We validate \emph{FreeAlign} on both real-world and simulated datasets. The results show that, the ~\emph{FreeAlign} empowered robust collaborative perception system perform comparably to systems relying on precise localization and clock devices.Comment: 6pages, accepted to ICRA 202
    corecore