Search CORE

337 research outputs found

From Indoor To Outdoor: Unsupervised Domain Adaptive Gait Recognition

Author: Feng Wei
Han Ruize
Wang Likai
Wang Song
Publication venue
Publication date: 20/11/2022
Field of study

Gait recognition is an important AI task, which has been progressed rapidly with the development of deep learning. However, existing learning based gait recognition methods mainly focus on the single domain, especially the constrained laboratory environment. In this paper, we study a new problem of unsupervised domain adaptive gait recognition (UDA-GR), that learns a gait identifier with supervised labels from the indoor scenes (source domain), and is applied to the outdoor wild scenes (target domain). For this purpose, we develop an uncertainty estimation and regularization based UDA-GR method. Specifically, we investigate the characteristic of gaits in the indoor and outdoor scenes, for estimating the gait sample uncertainty, which is used in the unsupervised fine-tuning on the target domain to alleviate the noises of the pseudo labels. We also establish a new benchmark for the proposed problem, experimental results on which show the effectiveness of the proposed method. We will release the benchmark and source code in this work to the public

arXiv.org e-Print Archive

Unveiling the Power of Self-supervision for Multi-view Multi-human Association and Tracking

Author: Feng Wei
Han Ruize
Qian Zekun
Wang Feifan
Wang Song
Publication venue
Publication date: 31/01/2024
Field of study

Multi-view multi-human association and tracking (MvMHAT), is a new but important problem for multi-person scene video surveillance, aiming to track a group of people over time in each view, as well as to identify the same person across different views at the same time, which is different from previous MOT and multi-camera MOT tasks only considering the over-time human tracking. This way, the videos for MvMHAT require more complex annotations while containing more information for self learning. In this work, we tackle this problem with a self-supervised learning aware end-to-end network. Specifically, we propose to take advantage of the spatial-temporal self-consistency rationale by considering three properties of reflexivity, symmetry and transitivity. Besides the reflexivity property that naturally holds, we design the self-supervised learning losses based on the properties of symmetry and transitivity, for both appearance feature learning and assignment matrix optimization, to associate the multiple humans over time and across views. Furthermore, to promote the research on MvMHAT, we build two new large-scale benchmarks for the network training and testing of different algorithms. Extensive experiments on the proposed benchmarks verify the effectiveness of our method. We have released the benchmark and code to the public

arXiv.org e-Print Archive

Combining the Silhouette and Skeleton Data for Gait Recognition

Author: Chen Jinyan
Feng Wei
Han Ruize
Wang Likai
Publication venue
Publication date: 29/03/2022
Field of study

Gait recognition, a promising long-distance biometric technology, has aroused intense interest in computer vision. Existing works on gait recognition can be divided into appearance-based methods and model-based methods, which extract features from silhouettes and skeleton data, respectively. However, since appearance-based methods are greatly affected by clothing changing and carrying condition, and model-based methods are limited by the accuracy of pose estimation approaches, gait recognition remains challenging in practical applications. In order to integrate the advantages of such two approaches, a two-branch neural network (NN) is proposed in this paper. Our method contains two branches, namely a CNN-based branch taking silhouettes as input and a GCN-based branch taking skeletons as input. In addition, two new modules are proposed in the GCN-based branch for better gait representation. First, we present a simple yet effective fully connected graph convolution operator to integrate the multi-scale graph convolutions and alleviate the dependence on natural human joint connections. Second, we deploy a multi-dimension attention module named STC-Att to learn spatial, temporal and channel-wise attention simultaneously. We evaluated the proposed two-branch neural network on the CASIA-B dataset. The experimental results show that our method achieves state-of-the-art performance in various conditions.Comment: The paper is under consideration at Computer Vision and Image Understandin

arXiv.org e-Print Archive

ComCLIP: Training-Free Compositional Image and Text Matching

Author: He Xuehai
Jiang Kenan
Wang Xin Eric
Xu Ruize
Publication venue
Publication date: 24/11/2022
Field of study

Contrastive Language-Image Pretraining (CLIP) has demonstrated great zero-shot performance for image-text matching because of its holistic use of natural language supervision that covers large-scale, open-world visual concepts. However, it is still challenging to adapt CLIP to compositional image and text matching -- a more challenging image and matching mask requiring the model understanding of compositional word concepts and visual components. Towards better compositional generalization in zero-shot image and text matching, in this paper, we study the problem from a causal perspective: the erroneous semantics of individual entities are essentially confounders that cause the matching failure. Therefore, we propose a novel training-free compositional CLIP model (ComCLIP). ComCLIP disentangles input images into subjects, objects, and action sub-images and composes CLIP's vision encoder and text encoder to perform evolving matching over compositional text embedding and sub-image embeddings. In this way, ComCLIP can mitigate spurious correlations introduced by the pretrained CLIP models and dynamically assess the contribution of each entity when performing image and text matching. Experiments on compositional image-text matching on SVO and ComVG and general image-text retrieval on Flickr8K demonstrate the effectiveness of our plug-and-play method, which boosts the zero-shot inference ability of CLIP even without further training or fine-tuning of CLIP

arXiv.org e-Print Archive

Nearest Neighbor Machine Translation is Meta-Optimizer on Output Projection Layer

Author: Du Yichao
Gao Ruize
Liu Lemao
Wang Rui
Zhang Zhirui
Publication venue
Publication date: 24/10/2023
Field of study

Nearest Neighbor Machine Translation (

k

NN-MT) has achieved great success in domain adaptation tasks by integrating pre-trained Neural Machine Translation (NMT) models with domain-specific token-level retrieval. However, the reasons underlying its success have not been thoroughly investigated. In this paper, we comprehensively analyze

k

NN-MT through theoretical and empirical studies. Initially, we provide new insights into the working mechanism of

k

NN-MT as an efficient technique to implicitly execute gradient descent on the output projection layer of NMT, indicating that it is a specific case of model fine-tuning. Subsequently, we conduct multi-domain experiments and word-level analysis to examine the differences in performance between

k

NN-MT and entire-model fine-tuning. Our findings suggest that: (1) Incorporating

k

NN-MT with adapters yields comparable translation performance to fine-tuning on in-domain test sets, while achieving better performance on out-of-domain test sets; (2) Fine-tuning significantly outperforms

k

NN-MT on the recall of in-domain low-frequency words, but this gap could be bridged by optimizing the context representations with additional adapter layers.Comment: Accepted by EMNLP202

arXiv.org e-Print Archive

A Benchmark of Video-Based Clothes-Changing Person Re-Identification

Author: Feng Wei
Han Ruize
Li Xiaoyu
Wang Likai
Wang Song
Yang Jialin
Zhang Xiangqun
Publication venue
Publication date: 20/11/2022
Field of study

Person re-identification (Re-ID) is a classical computer vision task and has achieved great progress so far. Recently, long-term Re-ID with clothes-changing has attracted increasing attention. However, existing methods mainly focus on image-based setting, where richer temporal information is overlooked. In this paper, we focus on the relatively new yet practical problem of clothes-changing video-based person re-identification (CCVReID), which is less studied. We systematically study this problem by simultaneously considering the challenge of the clothes inconsistency issue and the temporal information contained in the video sequence for the person Re-ID problem. Based on this, we develop a two-branch confidence-aware re-ranking framework for handling the CCVReID problem. The proposed framework integrates two branches that consider both the classical appearance features and cloth-free gait features through a confidence-guided re-ranking strategy. This method provides the baseline method for further studies. Also, we build two new benchmark datasets for CCVReID problem, including a large-scale synthetic video dataset and a real-world one, both containing human sequences with various clothing changes. We will release the benchmark and code in this work to the public

arXiv.org e-Print Archive

Decompressing Dilithium\u27s Public Key with Fewer Signatures Using Side Channel Analysis

Author: Elena Dubrova
Joel Gärtner
Ruize Wang
Publication venue: International Association for Cryptologic Research (IACR)
Publication date: 18/12/2024
Field of study

The CRYSTALS-Dilithium digital signature scheme, selected by NIST as a post-quantum cryptography (PQC) standard under the name ML-DSA, employs a public key compression technique intended for performance optimization. Specifically, the module learning with error instance

({\bf A}, {\bf t})

is compressed by omitting the low-order bits

{\bf t_0}

of the vector

{\bf t}

. It was recently shown that knowledge of

{\bf t_0}

enables more effective side-channel attacks on Dilithium implementations. Another recent work demonstrated a method for reconstructing

{\bf t_0}

from multiple signatures. In this paper, we build on this method by applying profiled deep learning-assisted side-channel analysis to partially recover the least significant bit of

{\bf t_0}

from power traces. As a result, the number of signatures required for the reconstruction of

{\bf t_0}

can be reduced by roughly half. We demonstrate how the new

{\bf t_0}

reconstruction method enhances the efficiency of recovering the secret key component

{\bf s}_1

, and thus facilitates digital signature forgery, on an ARM Cortex-M4 implementation of Dilithium

Cryptology ePrint Archive

Side-Channel Analysis of Saber KEM Using Amplitude-Modulated EM Emanations

Author: Elena Dubrova
Kalle Ngo
Ruize Wang
Publication venue: International Association for Cryptologic Research (IACR)
Publication date: 21/06/2022
Field of study

In the ongoing last round of NIST’s post-quantum cryptography standardization competition, side-channel analysis of finalists is a main focus of attention. While their resistance to timing, power and near field electromagnetic (EM) side-channels has been thoroughly investigated, amplitude-modulated EM emanations has not been considered so far. The attacks based on amplitude-modulated EM emanations are more stealthy because they exploit side-channels intertwined into the signal transmitted by an on-chip antenna. Thus, they can be mounted on a distance from the device under attack. In this paper, we present the first results of an amplitude-modulated EM side-channel analysis of one of the NIST PQ finalists, Saber key encapsulation mechanism (KEM), implemented on the nRF52832 (ARM Cortex-M4) system-on-chip supporting Bluetooth 5. By capturing amplitude-modulated EM emanations during decapsulation, we can recover each bit of the session key with 0.91 probability on average

Cryptology ePrint Archive

OmniDrones: An Efficient and Flexible Platform for Reinforcement Learning in Drone Control

Author: Gao Feng
Wang Yu
Wu Yi
Xu Botian
Yu Chao
Zhang Ruize
Publication venue
Publication date: 22/09/2023
Field of study

In this work, we introduce OmniDrones, an efficient and flexible platform tailored for reinforcement learning in drone control, built on Nvidia's Omniverse Isaac Sim. It employs a bottom-up design approach that allows users to easily design and experiment with various application scenarios on top of GPU-parallelized simulations. It also offers a range of benchmark tasks, presenting challenges ranging from single-drone hovering to over-actuated system tracking. In summary, we propose an open-sourced drone simulation platform, equipped with an extensive suite of tools for drone learning. It includes 4 drone models, 5 sensor modalities, 4 control modes, over 10 benchmark tasks, and a selection of widely used RL baselines. To showcase the capabilities of OmniDrones and to support future research, we also provide preliminary results on these benchmark tasks. We hope this platform will encourage further studies on applying RL to practical drone systems.Comment: Submitted to IEEE RA-

arXiv.org e-Print Archive

Robust Collaborative Perception without External Localization and Clock Devices

Author: Chen Siheng
Feng Chen
Han Ruize
Lei Zixing
Ni Zhenyang
Tang Shuo
Wang Dingju
Wang Yanfeng
Publication venue
Publication date: 31/05/2024
Field of study

A consistent spatial-temporal coordination across multiple agents is fundamental for collaborative perception, which seeks to improve perception abilities through information exchange among agents. To achieve this spatial-temporal alignment, traditional methods depend on external devices to provide localization and clock signals. However, hardware-generated signals could be vulnerable to noise and potentially malicious attack, jeopardizing the precision of spatial-temporal alignment. Rather than relying on external hardwares, this work proposes a novel approach: aligning by recognizing the inherent geometric patterns within the perceptual data of various agents. Following this spirit, we propose a robust collaborative perception system that operates independently of external localization and clock devices. The key module of our system,~\emph{FreeAlign}, constructs a salient object graph for each agent based on its detected boxes and uses a graph neural network to identify common subgraphs between agents, leading to accurate relative pose and time. We validate \emph{FreeAlign} on both real-world and simulated datasets. The results show that, the ~\emph{FreeAlign} empowered robust collaborative perception system perform comparably to systems relying on precise localization and clock devices.Comment: 6pages, accepted to ICRA 202

arXiv.org e-Print Archive