29 research outputs found
3D finite difference computation on GPUs using CUDA
In this paper we describe a GPU parallelization of the 3D finite difference computation using CUDA. Data access redundancy is used as the metric to determine the optimal implementation for both the stencil-only computation, as well as the discretization of the wave equation, which is currently of great interest in seismic computing. For the larger stencils, the described approach achieves the throughput of between 2,400 to over 3,000 million of output points per second on a single Tesla 10-series GPU. This is roughly an order of magnitude higher than a 4-core Harpertown CPU running a similar code from seismic industry. Multi-GPU parallelization is also described, achieving linear scaling with GPUs by overlapping inter-GPU communication with computation
MLPerf Inference Benchmark
Machine-learning (ML) hardware and software system demand is burgeoning.
Driven by ML applications, the number of different ML inference systems has
exploded. Over 100 organizations are building ML inference chips, and the
systems that incorporate existing models span at least three orders of
magnitude in power consumption and five orders of magnitude in performance;
they range from embedded devices to data-center solutions. Fueling the hardware
are a dozen or more software frameworks and libraries. The myriad combinations
of ML hardware and ML software make assessing ML-system performance in an
architecture-neutral, representative, and reproducible manner challenging.
There is a clear need for industry-wide standard ML benchmarking and evaluation
criteria. MLPerf Inference answers that call. In this paper, we present our
benchmarking method for evaluating ML inference systems. Driven by more than 30
organizations as well as more than 200 ML engineers and practitioners, MLPerf
prescribes a set of rules and best practices to ensure comparability across
systems with wildly differing architectures. The first call for submissions
garnered more than 600 reproducible inference-performance measurements from 14
organizations, representing over 30 systems that showcase a wide range of
capabilities. The submissions attest to the benchmark's flexibility and
adaptability.Comment: ISCA 202
FP8 Formats for Deep Learning
FP8 is a natural progression for accelerating deep learning training
inference beyond the 16-bit formats common in modern processors. In this paper
we propose an 8-bit floating point (FP8) binary interchange format consisting
of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit
exponent and 2-bit mantissa). While E5M2 follows IEEE 754 conventions for
representatio of special values, E4M3's dynamic range is extended by not
representing infinities and having only one mantissa bit-pattern for NaNs. We
demonstrate the efficacy of the FP8 format on a variety of image and language
tasks, effectively matching the result quality achieved by 16-bit training
sessions. Our study covers the main modern neural network architectures - CNNs,
RNNs, and Transformer-based models, leaving all the hyperparameters unchanged
from the 16-bit baseline training sessions. Our training experiments include
large, up to 175B parameter, language models. We also examine FP8
post-training-quantization of language models trained using 16-bit formats that
resisted fixed point int8 quantization
Microscaling Data Formats for Deep Learning
Narrow bit-width data formats are key to reducing the computational and
storage costs of modern deep learning applications. This paper evaluates
Microscaling (MX) data formats that combine a per-block scaling factor with
narrow floating-point and integer types for individual elements. MX formats
balance the competing needs of hardware efficiency, model accuracy, and user
friction. Empirical results on over two dozen benchmarks demonstrate
practicality of MX data formats as a drop-in replacement for baseline FP32 for
AI inference and training with low user friction. We also show the first
instance of training generative language models at sub-8-bit weights,
activations, and gradients with minimal accuracy loss and no modifications to
the training recipe
Exploring Topological Properties Of Nmr Graphs
In this paper we explore the topological properties of graphs derived from Nuclear Magnetic Resonance (NMR) restraint data. Only the distance bound data is considered, connecting nodes only if the file contains a bound for the corresponding inter-atomic distance. Since NMR provides bounds for spatially-close atom pairs, the structure of the NMR graph depends heavily on the molecule\u27s 3D shape. Therefore, understanding NMR graph topology is relevant to 3D structure prediction. We examine NMR graphs for nine molecules, sizes ranging between 195 and 1178 nodes. Three categories of topological properties are studied: connected components, diameter, and node-degree sequence. Several interesting observations emerged when relating the diameter and maximum degree to the size of the graph. ©2007 IEEE
