996 research outputs found
MegDet: A Large Mini-Batch Object Detector
The improvements in recent CNN-based object detection works, from R-CNN [11],
Fast/Faster R-CNN [10, 31] to recent Mask R-CNN [14] and RetinaNet [24], mainly
come from new network, new framework, or novel loss design. But mini-batch
size, a key factor in the training, has not been well studied. In this paper,
we propose a Large MiniBatch Object Detector (MegDet) to enable the training
with much larger mini-batch size than before (e.g. from 16 to 256), so that we
can effectively utilize multiple GPUs (up to 128 in our experiments) to
significantly shorten the training time. Technically, we suggest a learning
rate policy and Cross-GPU Batch Normalization, which together allow us to
successfully train a large mini-batch detector in much less time (e.g., from 33
hours to 4 hours), and achieve even better accuracy. The MegDet is the backbone
of our submission (mmAP 52.5%) to COCO 2017 Challenge, where we won the 1st
place of Detection task
Boosting Jailbreak Attack with Momentum
Large Language Models (LLMs) have achieved remarkable success across diverse
tasks, yet they remain vulnerable to adversarial attacks, notably the
well-documented \textit{jailbreak} attack. Recently, the Greedy Coordinate
Gradient (GCG) attack has demonstrated efficacy in exploiting this
vulnerability by optimizing adversarial prompts through a combination of
gradient heuristics and greedy search. However, the efficiency of this attack
has become a bottleneck in the attacking process. To mitigate this limitation,
in this paper we rethink the generation of adversarial prompts through an
optimization lens, aiming to stabilize the optimization process and harness
more heuristic insights from previous iterations. Specifically, we introduce
the \textbf{M}omentum \textbf{A}ccelerated G\textbf{C}G (\textbf{MAC}) attack,
which incorporates a momentum term into the gradient heuristic. Experimental
results showcase the notable enhancement achieved by MAP in gradient-based
attacks on aligned language models. Our code is available at
https://github.com/weizeming/momentum-attack-llm.Comment: ICLR 2024 Workshop on Reliable and Responsible Foundation Model
Automata Extraction from Transformers
In modern machine (ML) learning systems, Transformer-based architectures have
achieved milestone success across a broad spectrum of tasks, yet understanding
their operational mechanisms remains an open problem. To improve the
transparency of ML systems, automata extraction methods, which interpret
stateful ML models as automata typically through formal languages, have proven
effective for explaining the mechanism of recurrent neural networks (RNNs).
However, few works have been applied to this paradigm to Transformer models. In
particular, understanding their processing of formal languages and identifying
their limitations in this area remains unexplored. In this paper, we propose an
automata extraction algorithm specifically designed for Transformer models.
Treating the Transformer model as a black-box system, we track the model
through the transformation process of their internal latent representations
during their operations, and then use classical pedagogical approaches like L*
algorithm to interpret them as deterministic finite-state automata (DFA).
Overall, our study reveals how the Transformer model comprehends the structure
of formal languages, which not only enhances the interpretability of the
Transformer-based ML systems but also marks a crucial step toward a deeper
understanding of how ML systems process formal languages. Code and data are
available at https://github.com/Zhang-Yihao/Transfomer2DFA
GPU-Accelerated Optimization-Based Collision Avoidance
This paper proposes a GPU-accelerated optimization framework for collision
avoidance problems where the controlled objects and the obstacles can be
modeled as the finite union of convex polyhedra. A novel collision avoidance
constraint is proposed based on scale-based collision detection and the strong
duality of convex optimization. Under this constraint, the high-dimensional
non-convex optimization problems of collision avoidance can be decomposed into
several low-dimensional quadratic programmings (QPs) following the paradigm of
alternating direction method of multipliers (ADMM). Furthermore, these
low-dimensional QPs can be solved parallel with GPUs, significantly reducing
computational time. High-fidelity simulations are conducted to validate the
proposed method's effectiveness and practicality
Multi3DRefer: Grounding Text Description to Multiple 3D Objects
We introduce the task of localizing a flexible number of objects in
real-world 3D scenes using natural language descriptions. Existing 3D visual
grounding tasks focus on localizing a unique object given a text description.
However, such a strict setting is unnatural as localizing potentially multiple
objects is a common need in real-world scenarios and robotic tasks (e.g.,
visual navigation and object rearrangement). To address this setting we propose
Multi3DRefer, generalizing the ScanRefer dataset and task. Our dataset contains
61926 descriptions of 11609 objects, where zero, single or multiple target
objects are referenced by each description. We also introduce a new evaluation
metric and benchmark methods from prior work to enable further investigation of
multi-modal 3D scene understanding. Furthermore, we develop a better baseline
leveraging 2D features from CLIP by rendering object proposals online with
contrastive learning, which outperforms the state of the art on the ScanRefer
benchmark.Comment: ICCV 202
MatrixVT: Efficient Multi-Camera to BEV Transformation for 3D Perception
This paper proposes an efficient multi-camera to Bird's-Eye-View (BEV) view
transformation method for 3D perception, dubbed MatrixVT. Existing view
transformers either suffer from poor transformation efficiency or rely on
device-specific operators, hindering the broad application of BEV models. In
contrast, our method generates BEV features efficiently with only convolutions
and matrix multiplications (MatMul). Specifically, we propose describing the
BEV feature as the MatMul of image feature and a sparse Feature Transporting
Matrix (FTM). A Prime Extraction module is then introduced to compress the
dimension of image features and reduce FTM's sparsity. Moreover, we propose the
Ring \& Ray Decomposition to replace the FTM with two matrices and reformulate
our pipeline to reduce calculation further. Compared to existing methods,
MatrixVT enjoys a faster speed and less memory footprint while remaining
deploy-friendly. Extensive experiments on the nuScenes benchmark demonstrate
that our method is highly efficient but obtains results on par with the SOTA
method in object detection and map segmentation task
- …
