Search CORE

161 research outputs found

Learning to infer: RL-based search for DNN primitive selection on Heterogeneous Embedded Systems

Author: abadi
anderson
baker
chetlur
cortes
dong
he
hsu
kim
li
real
sutton
tan
watkins
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 18/11/2018
Field of study

Deep Learning is increasingly being adopted by industry for computer vision applications running on embedded devices. While Convolutional Neural Networks' accuracy has achieved a mature and remarkable state, inference latency and throughput are a major concern especially when targeting low-cost and low-power embedded platforms. CNNs' inference latency may become a bottleneck for Deep Learning adoption by industry, as it is a crucial specification for many real-time processes. Furthermore, deployment of CNNs across heterogeneous platforms presents major compatibility issues due to vendor-specific technology and acceleration libraries. In this work, we present QS-DNN, a fully automatic search based on Reinforcement Learning which, combined with an inference engine optimizer, efficiently explores through the design space and empirically finds the optimal combinations of libraries and primitives to speed up the inference of CNNs on heterogeneous embedded devices. We show that, an optimized combination can achieve 45x speedup in inference latency on CPU compared to a dependency-free baseline and 2x on average on GPGPU compared to the best vendor library. Further, we demonstrate that, the quality of results and time "to-solution" is much better than with Random Search and achieves up to 15x better results for a short-time search

arXiv.org e-Print Archive

Crossref

Shortest Path Distance in Manhattan Poisson Line Cox Process

Author: Chetlur Vishnu Vardhan
Dettmann Carl P.
Dhillon Harpreet S.
Publication venue
Publication date: 05/06/2020
Field of study

While the Euclidean distance characteristics of the Poisson line Cox process (PLCP) have been investigated in the literature, the analytical characterization of the path distances is still an open problem. In this paper, we solve this problem for the stationary Manhattan Poisson line Cox process (MPLCP), which is a variant of the PLCP. Specifically, we derive the exact cumulative distribution function (CDF) for the length of the shortest path to the nearest point of the MPLCP in the sense of path distance measured from two reference points: (i) the typical intersection of the Manhattan Poisson line process (MPLP), and (ii) the typical point of the MPLCP. We also discuss the application of these results in infrastructure planning, wireless communication, and transportation networks

arXiv.org e-Print Archive

Explore Bristol Research

Self-aligned insulated gate FET technology for InP : an interface engineering approach

Author: Sundararaman Chetlur S.
Publication venue: École polytechnique de Montréal
Publication date: 01/01/1993
Field of study

Surface passivation -- Thermal S passivation of InP -- A universal model for the fromation of sulfide layers on InP -- Fabrication technology for sag fets -- Chemical cleaning and etching -- The gate insulator -- Ion implantation -- Interface engineered mis diodes -- Fabrication of Interface engineered MIS capacitors -- The dielectric on S treated InP -- Dielectric leakage -- The InP/ntride interface -- Interface trap characteristics -- Temperature stability of the passivated capacitors -- Mask set design -- Criteria for diagnostic schip design -- Device fabrication -- Dc electrical performance -- Performance of MISFETs -- HIGFETS

PolyPublie

Wafer-Scale Fast Fourier Transforms

Author: Chetlur Sharan
Jacquelin Mathias
Orenes-Vera Marcelo
Schreiber Robert
Sharapov Ilya
Vandermersch Philippe
Publication venue
Publication date: 29/09/2022
Field of study

We have implemented fast Fourier transforms for one, two, and three-dimensional arrays on the Cerebras CS-2, a system whose memory and processing elements reside on a single silicon wafer. The wafer-scale engine (WSE) encompasses a two-dimensional mesh of roughly 850,000 processing elements (PEs) with fast local memory and equally fast nearest-neighbor interconnections. Our wafer-scale FFT (wsFFT) parallelizes a

n^3

problem with up to

n^2

PEs. At this point a PE processes only a single vector of the 3D domain (known as a pencil) per superstep, where each of the three supersteps performs FFT along one of the three axes of the input array. Between supersteps, wsFFT redistributes (transposes) the data to bring all elements of each one-dimensional pencil being transformed into the memory of a single PE. Each redistribution causes an all-to-all communication along one of the mesh dimensions. Given the level of parallelism, the size of the messages transmitted between pairs of PEs can be as small as a single word. In theory, a mesh is not ideal for all-to-all communication due to its limited bisection bandwidth. However, the mesh interconnecting PEs on the WSE lies entirely on-wafer and achieves nearly peak bandwidth even with tiny messages. This high efficiency on fine-grain communication allow wsFFT to achieve unprecedented levels of parallelism and performance. We analyse in detail computation and communication time, as well as the weak and strong scaling, using both FP16 and FP32 precision. With 32-bit arithmetic on the CS-2, we achieve 959 microseconds for 3D FFT of a

512^3

complex input array using a 512x512 subgrid of the on-wafer PEs. This is the largest ever parallelization for this problem size and the first implementation that breaks the millisecond barrier

arXiv.org e-Print Archive