543 research outputs found
FedDisco: Federated Learning with Discrepancy-Aware Collaboration
This work considers the category distribution heterogeneity in federated
learning. This issue is due to biased labeling preferences at multiple clients
and is a typical setting of data heterogeneity. To alleviate this issue, most
previous works consider either regularizing local models or fine-tuning the
global model, while they ignore the adjustment of aggregation weights and
simply assign weights based on the dataset size. However, based on our
empirical observations and theoretical analysis, we find that the dataset size
is not optimal and the discrepancy between local and global category
distributions could be a beneficial and complementary indicator for determining
aggregation weights. We thus propose a novel aggregation method, Federated
Learning with Discrepancy-aware Collaboration (FedDisco), whose aggregation
weights not only involve both the dataset size and the discrepancy value, but
also contribute to a tighter theoretical upper bound of the optimization error.
FedDisco also promotes privacy-preservation, communication and computation
efficiency, as well as modularity. Extensive experiments show that our FedDisco
outperforms several state-of-the-art methods and can be easily incorporated
with many existing methods to further enhance the performance. Our code will be
available at https://github.com/MediaBrain-SJTU/FedDisco.Comment: Accepted by International Conference on Machine Learning (ICML2023
Learning to Estimate 6DoF Pose from Limited Data: A Few-Shot, Generalizable Approach using RGB Images
The accurate estimation of six degrees-of-freedom (6DoF) object poses is
essential for many applications in robotics and augmented reality. However,
existing methods for 6DoF pose estimation often depend on CAD templates or
dense support views, restricting their usefulness in realworld situations. In
this study, we present a new cascade framework named Cas6D for few-shot 6DoF
pose estimation that is generalizable and uses only RGB images. To address the
false positives of target object detection in the extreme few-shot setting, our
framework utilizes a selfsupervised pre-trained ViT to learn robust feature
representations. Then, we initialize the nearest top-K pose candidates based on
similarity score and refine the initial poses using feature pyramids to
formulate and update the cascade warped feature volume, which encodes context
at increasingly finer scales. By discretizing the pose search range using
multiple pose bins and progressively narrowing the pose search range in each
stage using predictions from the previous stage, Cas6D can overcome the large
gap between pose candidates and ground truth poses, which is a common failure
mode in sparse-view scenarios. Experimental results on the LINEMOD and GenMOP
datasets demonstrate that Cas6D outperforms state-of-the-art methods by 9.2%
and 3.8% accuracy (Proj-5) under the 32-shot setting compared to OnePose++ and
Gen6D
Auxiliary Tasks Benefit 3D Skeleton-based Human Motion Prediction
Exploring spatial-temporal dependencies from observed motions is one of the
core challenges of human motion prediction. Previous methods mainly focus on
dedicated network structures to model the spatial and temporal dependencies.
This paper considers a new direction by introducing a model learning framework
with auxiliary tasks. In our auxiliary tasks, partial body joints' coordinates
are corrupted by either masking or adding noise and the goal is to recover
corrupted coordinates depending on the rest coordinates. To work with auxiliary
tasks, we propose a novel auxiliary-adapted transformer, which can handle
incomplete, corrupted motion data and achieve coordinate recovery via capturing
spatial-temporal dependencies. Through auxiliary tasks, the auxiliary-adapted
transformer is promoted to capture more comprehensive spatial-temporal
dependencies among body joints' coordinates, leading to better feature
learning. Extensive experimental results have shown that our method outperforms
state-of-the-art methods by remarkable margins of 7.2%, 3.7%, and 9.4% in terms
of 3D mean per joint position error (MPJPE) on the Human3.6M, CMU Mocap, and
3DPW datasets, respectively. We also demonstrate that our method is more robust
under data missing cases and noisy data cases. Code is available at
https://github.com/MediaBrain-SJTU/AuxFormer.Comment: Accpeted to ICCV202
EqMotion: Equivariant Multi-agent Motion Prediction with Invariant Interaction Reasoning
Learning to predict agent motions with relationship reasoning is important
for many applications. In motion prediction tasks, maintaining motion
equivariance under Euclidean geometric transformations and invariance of agent
interaction is a critical and fundamental principle. However, such equivariance
and invariance properties are overlooked by most existing methods. To fill this
gap, we propose EqMotion, an efficient equivariant motion prediction model with
invariant interaction reasoning. To achieve motion equivariance, we propose an
equivariant geometric feature learning module to learn a Euclidean
transformable feature through dedicated designs of equivariant operations. To
reason agent's interactions, we propose an invariant interaction reasoning
module to achieve a more stable interaction modeling. To further promote more
comprehensive motion features, we propose an invariant pattern feature learning
module to learn an invariant pattern feature, which cooperates with the
equivariant geometric feature to enhance network expressiveness. We conduct
experiments for the proposed model on four distinct scenarios: particle
dynamics, molecule dynamics, human skeleton motion prediction and pedestrian
trajectory prediction. Experimental results show that our method is not only
generally applicable, but also achieves state-of-the-art prediction
performances on all the four tasks, improving by 24.0/30.1/8.6/9.2%. Code is
available at https://github.com/MediaBrain-SJTU/EqMotion.Comment: Accepted to CVPR 202
Decentralized and Lifelong-Adaptive Multi-Agent Collaborative Learning
Decentralized and lifelong-adaptive multi-agent collaborative learning aims
to enhance collaboration among multiple agents without a central server, with
each agent solving varied tasks over time. To achieve efficient collaboration,
agents should: i) autonomously identify beneficial collaborative relationships
in a decentralized manner; and ii) adapt to dynamically changing task
observations. In this paper, we propose DeLAMA, a decentralized multi-agent
lifelong collaborative learning algorithm with dynamic collaboration graphs. To
promote autonomous collaboration relationship learning, we propose a
decentralized graph structure learning algorithm, eliminating the need for
external priors. To facilitate adaptation to dynamic tasks, we design a memory
unit to capture the agents' accumulated learning history and knowledge, while
preserving finite storage consumption. To further augment the system's
expressive capabilities and computational efficiency, we apply algorithm
unrolling, leveraging the advantages of both mathematical optimization and
neural networks. This allows the agents to `learn to collaborate' through the
supervision of training tasks. Our theoretical analysis verifies that
inter-agent collaboration is communication efficient under a small number of
communication rounds. The experimental results verify its ability to facilitate
the discovery of collaboration strategies and adaptation to dynamic learning
scenarios, achieving a 98.80% reduction in MSE and a 188.87% improvement in
classification accuracy. We expect our work can serve as a foundational
technique to facilitate future works towards an intelligent, decentralized, and
dynamic multi-agent system. Code is available at
https://github.com/ShuoTang123/DeLAMA.Comment: 23 pages, 15 figure
Language-Driven Interactive Traffic Trajectory Generation
Realistic trajectory generation with natural language control is pivotal for
advancing autonomous vehicle technology. However, previous methods focus on
individual traffic participant trajectory generation, thus failing to account
for the complexity of interactive traffic dynamics. In this work, we propose
InteractTraj, the first language-driven traffic trajectory generator that can
generate interactive traffic trajectories. InteractTraj interprets abstract
trajectory descriptions into concrete formatted interaction-aware numerical
codes and learns a mapping between these formatted codes and the final
interactive trajectories. To interpret language descriptions, we propose a
language-to-code encoder with a novel interaction-aware encoding strategy. To
produce interactive traffic trajectories, we propose a code-to-trajectory
decoder with interaction-aware feature aggregation that synergizes vehicle
interactions with the environmental map and the vehicle moves. Extensive
experiments show our method demonstrates superior performance over previous
SoTA methods, offering a more realistic generation of interactive traffic
trajectories with high controllability via diverse natural language commands.
Our code is available at https://github.com/X1a-jk/InteractTraj.gi
Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents
Scene simulation in autonomous driving has gained significant attention
because of its huge potential for generating customized data. However, existing
editable scene simulation approaches face limitations in terms of user
interaction efficiency, multi-camera photo-realistic rendering and external
digital assets integration. To address these challenges, this paper introduces
ChatSim, the first system that enables editable photo-realistic 3D driving
scene simulations via natural language commands with external digital assets.
To enable editing with high command flexibility,~ChatSim leverages a large
language model (LLM) agent collaboration framework. To generate photo-realistic
outcomes, ChatSim employs a novel multi-camera neural radiance field method.
Furthermore, to unleash the potential of extensive high-quality digital assets,
ChatSim employs a novel multi-camera lighting estimation method to achieve
scene-consistent assets' rendering. Our experiments on Waymo Open Dataset
demonstrate that ChatSim can handle complex language commands and generate
corresponding photo-realistic scene videos.Comment: CVPR 2024(Highlight
Knowledge-Based Approach to Assembly Sequence Planning for Wind-Driven Generator
Assembly sequence planning plays an essential role in the manufacturing industry. However, there still exist some challenges for the research of assembly planning, one of which is the weakness in effective description of assembly knowledge and information. In order to reduce the computational task, this paper presents a novel approach based on engineering assembly knowledge to the assembly sequence planning problem and provides an appropriate way to express both geometric information and nongeometric knowledge. In order to increase the sequence planning efficiency, the assembly connection graph is built according to the knowledge in engineering, design, and manufacturing fields. Product semantic information model could offer much useful information for the designer to finish the assembly (process) design and make the right decision in that process. Therefore, complex and low-efficient computation in the assembly design process could be avoided. Finally, a product assembly planning example is presented to illustrate the effectiveness of the proposed approach. Initial experience with the approach indicates the potential to reduce lead times and thereby can help in completing new product launch projects on time.</jats:p
- …
