562 research outputs found
FreeKD: Free-direction Knowledge Distillation for Graph Neural Networks
Knowledge distillation (KD) has demonstrated its effectiveness to boost the
performance of graph neural networks (GNNs), where its goal is to distill
knowledge from a deeper teacher GNN into a shallower student GNN. However, it
is actually difficult to train a satisfactory teacher GNN due to the well-known
over-parametrized and over-smoothing issues, leading to invalid knowledge
transfer in practical applications. In this paper, we propose the first
Free-direction Knowledge Distillation framework via Reinforcement learning for
GNNs, called FreeKD, which is no longer required to provide a deeper
well-optimized teacher GNN. The core idea of our work is to collaboratively
build two shallower GNNs in an effort to exchange knowledge between them via
reinforcement learning in a hierarchical way. As we observe that one typical
GNN model often has better and worse performances at different nodes during
training, we devise a dynamic and free-direction knowledge transfer strategy
that consists of two levels of actions: 1) node-level action determines the
directions of knowledge transfer between the corresponding nodes of two
networks; and then 2) structure-level action determines which of the local
structures generated by the node-level actions to be propagated. In essence,
our FreeKD is a general and principled framework which can be naturally
compatible with GNNs of different architectures. Extensive experiments on five
benchmark datasets demonstrate our FreeKD outperforms two base GNNs in a large
margin, and shows its efficacy to various GNNs. More surprisingly, our FreeKD
has comparable or even better performance than traditional KD algorithms that
distill knowledge from a deeper and stronger teacher GNN.Comment: Accepted to KDD 202
ANAct: Adaptive Normalization for Activation Functions
In this paper, we investigate the negative effect of activation functions on
forward and backward propagation and how to counteract this effect. First, We
examine how activation functions affect the forward and backward propagation of
neural networks and derive a general form for gradient variance that extends
the previous work in this area. We try to use mini-batch statistics to
dynamically update the normalization factor to ensure the normalization
property throughout the training process, rather than only accounting for the
state of the neural network after weight initialization. Second, we propose
ANAct, a method that normalizes activation functions to maintain consistent
gradient variance across layers and demonstrate its effectiveness through
experiments. We observe that the convergence rate is roughly related to the
normalization property. We compare ANAct with several common activation
functions on CNNs and residual networks and show that ANAct consistently
improves their performance. For instance, normalized Swish achieves 1.4\%
higher top-1 accuracy than vanilla Swish on ResNet50 with the Tiny ImageNet
dataset and more than 1.2\% higher with CIFAR-100.Comment: 14 pages, 6 figure
On the Road to Portability: Compressing End-to-End Motion Planner for Autonomous Driving
End-to-end motion planning models equipped with deep neural networks have
shown great potential for enabling full autonomous driving. However, the
oversized neural networks render them impractical for deployment on
resource-constrained systems, which unavoidably requires more computational
time and resources during reference.To handle this, knowledge distillation
offers a promising approach that compresses models by enabling a smaller
student model to learn from a larger teacher model. Nevertheless, how to apply
knowledge distillation to compress motion planners has not been explored so
far. In this paper, we propose PlanKD, the first knowledge distillation
framework tailored for compressing end-to-end motion planners. First,
considering that driving scenes are inherently complex, often containing
planning-irrelevant or even noisy information, transferring such information is
not beneficial for the student planner. Thus, we design an information
bottleneck based strategy to only distill planning-relevant information, rather
than transfer all information indiscriminately. Second, different waypoints in
an output planned trajectory may hold varying degrees of importance for motion
planning, where a slight deviation in certain crucial waypoints might lead to a
collision. Therefore, we devise a safety-aware waypoint-attentive distillation
module that assigns adaptive weights to different waypoints based on the
importance, to encourage the student to accurately mimic more crucial
waypoints, thereby improving overall safety. Experiments demonstrate that our
PlanKD can boost the performance of smaller planners by a large margin, and
significantly reduce their reference time.Comment: Accepted by CVPR 202
Learning to Generate Parameters of ConvNets for Unseen Image Data
Typical Convolutional Neural Networks (ConvNets) depend heavily on large
amounts of image data and resort to an iterative optimization algorithm (e.g.,
SGD or Adam) to learn network parameters, which makes training very time- and
resource-intensive. In this paper, we propose a new training paradigm and
formulate the parameter learning of ConvNets into a prediction task: given a
ConvNet architecture, we observe there exists correlations between image
datasets and their corresponding optimal network parameters, and explore if we
can learn a hyper-mapping between them to capture the relations, such that we
can directly predict the parameters of the network for an image dataset never
seen during the training phase. To do this, we put forward a new hypernetwork
based model, called PudNet, which intends to learn a mapping between datasets
and their corresponding network parameters, and then predicts parameters for
unseen data with only a single forward propagation. Moreover, our model
benefits from a series of adaptive hyper recurrent units sharing weights to
capture the dependencies of parameters among different network layers.
Extensive experiments demonstrate that our proposed method achieves good
efficacy for unseen image datasets on two kinds of settings: Intra-dataset
prediction and Inter-dataset prediction. Our PudNet can also well scale up to
large-scale datasets, e.g., ImageNet-1K. It takes 8967 GPU seconds to train
ResNet-18 on the ImageNet-1K using GC from scratch and obtain a top-5 accuracy
of 44.65 %. However, our PudNet costs only 3.89 GPU seconds to predict the
network parameters of ResNet-18 achieving comparable performance (44.92 %),
more than 2,300 times faster than the traditional training paradigm
Robust Knowledge Adaptation for Dynamic Graph Neural Networks
Graph structured data often possess dynamic characters in nature, e.g., the
addition of links and nodes, in many real-world applications. Recent years have
witnessed the increasing attentions paid to dynamic graph neural networks for
modelling such graph data, where almost all the existing approaches assume that
when a new link is built, the embeddings of the neighbor nodes should be
updated by learning the temporal dynamics to propagate new information.
However, such approaches suffer from the limitation that if the node introduced
by a new connection contains noisy information, propagating its knowledge to
other nodes is not reliable and even leads to the collapse of the model. In
this paper, we propose AdaNet: a robust knowledge Adaptation framework via
reinforcement learning for dynamic graph neural Networks. In contrast to
previous approaches immediately updating the embeddings of the neighbor nodes
once adding a new link, AdaNet attempts to adaptively determine which nodes
should be updated because of the new link involved. Considering that the
decision whether to update the embedding of one neighbor node will have great
impact on other neighbor nodes, we thus formulate the selection of node update
as a sequence decision problem, and address this problem via reinforcement
learning. By this means, we can adaptively propagate knowledge to other nodes
for learning robust node embedding representations. To the best of our
knowledge, our approach constitutes the first attempt to explore robust
knowledge adaptation via reinforcement learning for dynamic graph neural
networks. Extensive experiments on three benchmark datasets demonstrate that
AdaNet achieves the state-of-the-art performance. In addition, we perform the
experiments by adding different degrees of noise into the dataset,
quantitatively and qualitatively illustrating the robustness of AdaNet.Comment: 14 pages, 6 figure
DREAM: Domain-free Reverse Engineering Attributes of Black-box Model
Deep learning models are usually black boxes when deployed on machine
learning platforms. Prior works have shown that the attributes (, the
number of convolutional layers) of a target black-box neural network can be
exposed through a sequence of queries. There is a crucial limitation: these
works assume the dataset used for training the target model to be known
beforehand and leverage this dataset for model attribute attack. However, it is
difficult to access the training dataset of the target black-box model in
reality. Therefore, whether the attributes of a target black-box model could be
still revealed in this case is doubtful. In this paper, we investigate a new
problem of Domain-agnostic Reverse Engineering the Attributes of a black-box
target Model, called DREAM, without requiring the availability of the target
model's training dataset, and put forward a general and principled framework by
casting this problem as an out of distribution (OOD) generalization problem. In
this way, we can learn a domain-agnostic model to inversely infer the
attributes of a target black-box model with unknown training data. This makes
our method one of the kinds that can gracefully apply to an arbitrary domain
for model attribute reverse engineering with strong generalization ability.
Extensive experimental studies are conducted and the results validate the
superiority of our proposed method over the baselines
In vitro cytotoxicity and induction of apoptosis by silica nanoparticles in human HepG2 hepatoma cells
Not All Weights Are Created Equal: Enhancing Energy Efficiency in On-Device Streaming Speech Recognition
Power consumption plays an important role in on-device streaming speech
recognition, as it has a direct impact on the user experience. This study
delves into how weight parameters in speech recognition models influence the
overall power consumption of these models. We discovered that the impact of
weight parameters on power consumption varies, influenced by factors including
how often they are invoked and their placement in memory. Armed with this
insight, we developed design guidelines aimed at optimizing on-device speech
recognition models. These guidelines focus on minimizing power use without
substantially affecting accuracy. Our method, which employs targeted
compression based on the varying sensitivities of weight parameters,
demonstrates superior performance compared to state-of-the-art compression
methods. It achieves a reduction in energy usage of up to 47% while maintaining
similar model accuracy and improving the real-time factor
- …
