161 research outputs found
Learning to infer: RL-based search for DNN primitive selection on Heterogeneous Embedded Systems
Deep Learning is increasingly being adopted by industry for computer vision
applications running on embedded devices. While Convolutional Neural Networks'
accuracy has achieved a mature and remarkable state, inference latency and
throughput are a major concern especially when targeting low-cost and low-power
embedded platforms. CNNs' inference latency may become a bottleneck for Deep
Learning adoption by industry, as it is a crucial specification for many
real-time processes. Furthermore, deployment of CNNs across heterogeneous
platforms presents major compatibility issues due to vendor-specific technology
and acceleration libraries. In this work, we present QS-DNN, a fully automatic
search based on Reinforcement Learning which, combined with an inference engine
optimizer, efficiently explores through the design space and empirically finds
the optimal combinations of libraries and primitives to speed up the inference
of CNNs on heterogeneous embedded devices. We show that, an optimized
combination can achieve 45x speedup in inference latency on CPU compared to a
dependency-free baseline and 2x on average on GPGPU compared to the best vendor
library. Further, we demonstrate that, the quality of results and time
"to-solution" is much better than with Random Search and achieves up to 15x
better results for a short-time search
Shortest Path Distance in Manhattan Poisson Line Cox Process
While the Euclidean distance characteristics of the Poisson line Cox process
(PLCP) have been investigated in the literature, the analytical
characterization of the path distances is still an open problem. In this paper,
we solve this problem for the stationary Manhattan Poisson line Cox process
(MPLCP), which is a variant of the PLCP. Specifically, we derive the exact
cumulative distribution function (CDF) for the length of the shortest path to
the nearest point of the MPLCP in the sense of path distance measured from two
reference points: (i) the typical intersection of the Manhattan Poisson line
process (MPLP), and (ii) the typical point of the MPLCP. We also discuss the
application of these results in infrastructure planning, wireless
communication, and transportation networks
Self-aligned insulated gate FET technology for InP : an interface engineering approach
Surface passivation -- Thermal S passivation of InP -- A universal model for the fromation of sulfide layers on InP -- Fabrication technology for sag fets -- Chemical cleaning and etching -- The gate insulator -- Ion implantation -- Interface engineered mis diodes -- Fabrication of Interface engineered MIS capacitors -- The dielectric on S treated InP -- Dielectric leakage -- The InP/ntride interface -- Interface trap characteristics -- Temperature stability of the passivated capacitors -- Mask set design -- Criteria for diagnostic schip design -- Device fabrication -- Dc electrical performance -- Performance of MISFETs -- HIGFETS
Wafer-Scale Fast Fourier Transforms
We have implemented fast Fourier transforms for one, two, and
three-dimensional arrays on the Cerebras CS-2, a system whose memory and
processing elements reside on a single silicon wafer. The wafer-scale engine
(WSE) encompasses a two-dimensional mesh of roughly 850,000 processing elements
(PEs) with fast local memory and equally fast nearest-neighbor
interconnections.
Our wafer-scale FFT (wsFFT) parallelizes a problem with up to
PEs. At this point a PE processes only a single vector of the 3D domain (known
as a pencil) per superstep, where each of the three supersteps performs FFT
along one of the three axes of the input array. Between supersteps, wsFFT
redistributes (transposes) the data to bring all elements of each
one-dimensional pencil being transformed into the memory of a single PE. Each
redistribution causes an all-to-all communication along one of the mesh
dimensions. Given the level of parallelism, the size of the messages
transmitted between pairs of PEs can be as small as a single word. In theory, a
mesh is not ideal for all-to-all communication due to its limited bisection
bandwidth. However, the mesh interconnecting PEs on the WSE lies entirely
on-wafer and achieves nearly peak bandwidth even with tiny messages.
This high efficiency on fine-grain communication allow wsFFT to achieve
unprecedented levels of parallelism and performance. We analyse in detail
computation and communication time, as well as the weak and strong scaling,
using both FP16 and FP32 precision. With 32-bit arithmetic on the CS-2, we
achieve 959 microseconds for 3D FFT of a complex input array using a
512x512 subgrid of the on-wafer PEs. This is the largest ever parallelization
for this problem size and the first implementation that breaks the millisecond
barrier
- …
