337 research outputs found
From Indoor To Outdoor: Unsupervised Domain Adaptive Gait Recognition
Gait recognition is an important AI task, which has been progressed rapidly
with the development of deep learning. However, existing learning based gait
recognition methods mainly focus on the single domain, especially the
constrained laboratory environment. In this paper, we study a new problem of
unsupervised domain adaptive gait recognition (UDA-GR), that learns a gait
identifier with supervised labels from the indoor scenes (source domain), and
is applied to the outdoor wild scenes (target domain). For this purpose, we
develop an uncertainty estimation and regularization based UDA-GR method.
Specifically, we investigate the characteristic of gaits in the indoor and
outdoor scenes, for estimating the gait sample uncertainty, which is used in
the unsupervised fine-tuning on the target domain to alleviate the noises of
the pseudo labels. We also establish a new benchmark for the proposed problem,
experimental results on which show the effectiveness of the proposed method. We
will release the benchmark and source code in this work to the public
Unveiling the Power of Self-supervision for Multi-view Multi-human Association and Tracking
Multi-view multi-human association and tracking (MvMHAT), is a new but
important problem for multi-person scene video surveillance, aiming to track a
group of people over time in each view, as well as to identify the same person
across different views at the same time, which is different from previous MOT
and multi-camera MOT tasks only considering the over-time human tracking. This
way, the videos for MvMHAT require more complex annotations while containing
more information for self learning. In this work, we tackle this problem with a
self-supervised learning aware end-to-end network. Specifically, we propose to
take advantage of the spatial-temporal self-consistency rationale by
considering three properties of reflexivity, symmetry and transitivity. Besides
the reflexivity property that naturally holds, we design the self-supervised
learning losses based on the properties of symmetry and transitivity, for both
appearance feature learning and assignment matrix optimization, to associate
the multiple humans over time and across views. Furthermore, to promote the
research on MvMHAT, we build two new large-scale benchmarks for the network
training and testing of different algorithms. Extensive experiments on the
proposed benchmarks verify the effectiveness of our method. We have released
the benchmark and code to the public
Combining the Silhouette and Skeleton Data for Gait Recognition
Gait recognition, a promising long-distance biometric technology, has aroused
intense interest in computer vision. Existing works on gait recognition can be
divided into appearance-based methods and model-based methods, which extract
features from silhouettes and skeleton data, respectively. However, since
appearance-based methods are greatly affected by clothing changing and carrying
condition, and model-based methods are limited by the accuracy of pose
estimation approaches, gait recognition remains challenging in practical
applications. In order to integrate the advantages of such two approaches, a
two-branch neural network (NN) is proposed in this paper. Our method contains
two branches, namely a CNN-based branch taking silhouettes as input and a
GCN-based branch taking skeletons as input. In addition, two new modules are
proposed in the GCN-based branch for better gait representation. First, we
present a simple yet effective fully connected graph convolution operator to
integrate the multi-scale graph convolutions and alleviate the dependence on
natural human joint connections. Second, we deploy a multi-dimension attention
module named STC-Att to learn spatial, temporal and channel-wise attention
simultaneously. We evaluated the proposed two-branch neural network on the
CASIA-B dataset. The experimental results show that our method achieves
state-of-the-art performance in various conditions.Comment: The paper is under consideration at Computer Vision and Image
Understandin
ComCLIP: Training-Free Compositional Image and Text Matching
Contrastive Language-Image Pretraining (CLIP) has demonstrated great
zero-shot performance for image-text matching because of its holistic use of
natural language supervision that covers large-scale, open-world visual
concepts. However, it is still challenging to adapt CLIP to compositional image
and text matching -- a more challenging image and matching mask requiring the
model understanding of compositional word concepts and visual components.
Towards better compositional generalization in zero-shot image and text
matching, in this paper, we study the problem from a causal perspective: the
erroneous semantics of individual entities are essentially confounders that
cause the matching failure. Therefore, we propose a novel training-free
compositional CLIP model (ComCLIP). ComCLIP disentangles input images into
subjects, objects, and action sub-images and composes CLIP's vision encoder and
text encoder to perform evolving matching over compositional text embedding and
sub-image embeddings. In this way, ComCLIP can mitigate spurious correlations
introduced by the pretrained CLIP models and dynamically assess the
contribution of each entity when performing image and text matching.
Experiments on compositional image-text matching on SVO and ComVG and general
image-text retrieval on Flickr8K demonstrate the effectiveness of our
plug-and-play method, which boosts the zero-shot inference ability of CLIP even
without further training or fine-tuning of CLIP
Nearest Neighbor Machine Translation is Meta-Optimizer on Output Projection Layer
Nearest Neighbor Machine Translation (NN-MT) has achieved great success in
domain adaptation tasks by integrating pre-trained Neural Machine Translation
(NMT) models with domain-specific token-level retrieval. However, the reasons
underlying its success have not been thoroughly investigated. In this paper, we
comprehensively analyze NN-MT through theoretical and empirical studies.
Initially, we provide new insights into the working mechanism of NN-MT as an
efficient technique to implicitly execute gradient descent on the output
projection layer of NMT, indicating that it is a specific case of model
fine-tuning. Subsequently, we conduct multi-domain experiments and word-level
analysis to examine the differences in performance between NN-MT and
entire-model fine-tuning. Our findings suggest that: (1) Incorporating NN-MT
with adapters yields comparable translation performance to fine-tuning on
in-domain test sets, while achieving better performance on out-of-domain test
sets; (2) Fine-tuning significantly outperforms NN-MT on the recall of
in-domain low-frequency words, but this gap could be bridged by optimizing the
context representations with additional adapter layers.Comment: Accepted by EMNLP202
A Benchmark of Video-Based Clothes-Changing Person Re-Identification
Person re-identification (Re-ID) is a classical computer vision task and has
achieved great progress so far. Recently, long-term Re-ID with clothes-changing
has attracted increasing attention. However, existing methods mainly focus on
image-based setting, where richer temporal information is overlooked. In this
paper, we focus on the relatively new yet practical problem of clothes-changing
video-based person re-identification (CCVReID), which is less studied. We
systematically study this problem by simultaneously considering the challenge
of the clothes inconsistency issue and the temporal information contained in
the video sequence for the person Re-ID problem. Based on this, we develop a
two-branch confidence-aware re-ranking framework for handling the CCVReID
problem. The proposed framework integrates two branches that consider both the
classical appearance features and cloth-free gait features through a
confidence-guided re-ranking strategy. This method provides the baseline method
for further studies. Also, we build two new benchmark datasets for CCVReID
problem, including a large-scale synthetic video dataset and a real-world one,
both containing human sequences with various clothing changes. We will release
the benchmark and code in this work to the public
Decompressing Dilithium\u27s Public Key with Fewer Signatures Using Side Channel Analysis
The CRYSTALS-Dilithium digital signature scheme, selected by NIST as a post-quantum cryptography (PQC) standard under the name ML-DSA, employs a public key compression technique intended for performance optimization. Specifically, the module learning with error instance is compressed by omitting the low-order bits of the vector . It was recently shown that knowledge of enables more effective side-channel attacks on Dilithium implementations. Another recent work demonstrated a method for reconstructing from multiple signatures. In this paper, we build on this method by applying profiled deep learning-assisted side-channel analysis to partially recover the least significant bit of from power traces. As a result, the number of signatures required for the reconstruction of can be reduced by roughly half. We demonstrate how the new reconstruction method enhances the efficiency of recovering the secret key component , and thus facilitates digital signature forgery, on an ARM Cortex-M4 implementation of Dilithium
Side-Channel Analysis of Saber KEM Using Amplitude-Modulated EM Emanations
In the ongoing last round of NIST’s post-quantum cryptography standardization competition, side-channel analysis of finalists is a main focus of attention. While their resistance to timing, power and near field electromagnetic (EM) side-channels has been thoroughly investigated, amplitude-modulated EM emanations has not been considered so far. The attacks based on amplitude-modulated EM emanations are more stealthy because they exploit side-channels intertwined into the signal transmitted by an on-chip antenna. Thus, they can be mounted on a distance from the device under attack. In this paper, we present the first results of an amplitude-modulated EM side-channel analysis of one of the NIST PQ finalists, Saber key encapsulation mechanism (KEM), implemented on the nRF52832 (ARM Cortex-M4) system-on-chip supporting Bluetooth 5. By capturing amplitude-modulated EM emanations during decapsulation, we can recover each bit of the session key with 0.91 probability on average
OmniDrones: An Efficient and Flexible Platform for Reinforcement Learning in Drone Control
In this work, we introduce OmniDrones, an efficient and flexible platform
tailored for reinforcement learning in drone control, built on Nvidia's
Omniverse Isaac Sim. It employs a bottom-up design approach that allows users
to easily design and experiment with various application scenarios on top of
GPU-parallelized simulations. It also offers a range of benchmark tasks,
presenting challenges ranging from single-drone hovering to over-actuated
system tracking. In summary, we propose an open-sourced drone simulation
platform, equipped with an extensive suite of tools for drone learning. It
includes 4 drone models, 5 sensor modalities, 4 control modes, over 10
benchmark tasks, and a selection of widely used RL baselines. To showcase the
capabilities of OmniDrones and to support future research, we also provide
preliminary results on these benchmark tasks. We hope this platform will
encourage further studies on applying RL to practical drone systems.Comment: Submitted to IEEE RA-
Robust Collaborative Perception without External Localization and Clock Devices
A consistent spatial-temporal coordination across multiple agents is
fundamental for collaborative perception, which seeks to improve perception
abilities through information exchange among agents. To achieve this
spatial-temporal alignment, traditional methods depend on external devices to
provide localization and clock signals. However, hardware-generated signals
could be vulnerable to noise and potentially malicious attack, jeopardizing the
precision of spatial-temporal alignment. Rather than relying on external
hardwares, this work proposes a novel approach: aligning by recognizing the
inherent geometric patterns within the perceptual data of various agents.
Following this spirit, we propose a robust collaborative perception system that
operates independently of external localization and clock devices. The key
module of our system,~\emph{FreeAlign}, constructs a salient object graph for
each agent based on its detected boxes and uses a graph neural network to
identify common subgraphs between agents, leading to accurate relative pose and
time. We validate \emph{FreeAlign} on both real-world and simulated datasets.
The results show that, the ~\emph{FreeAlign} empowered robust collaborative
perception system perform comparably to systems relying on precise localization
and clock devices.Comment: 6pages, accepted to ICRA 202
- …
