31 research outputs found
goSLP: Globally Optimized Superword Level Parallelism Framework
Modern microprocessors are equipped with single instruction multiple data
(SIMD) or vector instruction sets which allow compilers to exploit superword
level parallelism (SLP), a type of fine-grained parallelism. Current SLP
auto-vectorization techniques use heuristics to discover vectorization
opportunities in high-level language code. These heuristics are fragile, local
and typically only present one vectorization strategy that is either accepted
or rejected by a cost model. We present goSLP, a novel SLP auto-vectorization
framework which solves the statement packing problem in a pairwise optimal
manner. Using an integer linear programming (ILP) solver, goSLP searches the
entire space of statement packing opportunities for a whole function at a time,
while limiting total compilation time to a few minutes. Furthermore, goSLP
optimally solves the vector permutation selection problem using dynamic
programming. We implemented goSLP in the LLVM compiler infrastructure,
achieving a geometric mean speedup of 7.58% on SPEC2017fp, 2.42% on SPEC2006fp
and 4.07% on NAS benchmarks compared to LLVM's existing SLP auto-vectorizer.Comment: Published at OOPSLA 201
Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks
Predicting the number of clock cycles a processor takes to execute a block of
assembly instructions in steady state (the throughput) is important for both
compiler designers and performance engineers. Building an analytical model to
do so is especially complicated in modern x86-64 Complex Instruction Set
Computer (CISC) machines with sophisticated processor microarchitectures in
that it is tedious, error prone, and must be performed from scratch for each
processor generation. In this paper we present Ithemal, the first tool which
learns to predict the throughput of a set of instructions. Ithemal uses a
hierarchical LSTM--based approach to predict throughput based on the opcodes
and operands of instructions in a basic block. We show that Ithemal is more
accurate than state-of-the-art hand-written tools currently used in compiler
backends and static machine code analyzers. In particular, our model has less
than half the error of state-of-the-art analytical models (LLVM's llvm-mca and
Intel's IACA). Ithemal is also able to predict these throughput values just as
fast as the aforementioned tools, and is easily ported across a variety of
processor microarchitectures with minimal developer effort.Comment: Published at 36th International Conference on Machine Learning (ICML)
201
Dias: Dynamic Rewriting of Pandas Code
In recent years, dataframe libraries, such as pandas have exploded in
popularity. Due to their flexibility, they are increasingly used in ad-hoc
exploratory data analysis (EDA) workloads. These workloads are diverse,
including custom functions which can span libraries or be written in pure
Python. The majority of systems available to accelerate EDA workloads focus on
bulk-parallel workloads, which contain vastly different computational patterns,
typically within a single library. As a result, they can introduce excessive
overheads for ad-hoc EDA workloads due to their expensive optimization
techniques. Instead, we identify program rewriting as a lightweight technique
which can offer substantial speedups while also avoiding slowdowns. We
implemented our techniques in Dias, which rewrites notebook cells to be more
efficient for ad-hoc EDA workloads. We develop techniques for efficient
rewrites in Dias, including dynamic checking of preconditions under which
rewrites are correct and just-in-time rewrites for notebook environments. We
show that Dias can rewrite individual cells to be 57 faster compared to
pandas and 1909 faster compared to optimized systems such as modin.
Furthermore, Dias can accelerate whole notebooks by up to 3.6 compared
to pandas and 26.4 compared to modin.Comment: 16 pages, 22 figure
ConstraintFlow: A DSL for Specification and Verification of Neural Network Analyses
The uninterpretability of DNNs hinders their deployment to safety-critical
applications. Recent works have shown that Abstract-Interpretation-based formal
certification techniques provide promising avenues for building trust in DNNs
to some extent. The intricate mathematical background of Abstract
Interpretation poses two challenges: (i) easily designing the algorithms that
capture the intricate DNN behavior by balancing cost vs. precision tradeoff,
and (ii) maintaining the over-approximation-based soundness of these
certifiers.
General-purpose programming languages like C++ provide extensive
functionality, however, verifying the soundness of the algorithms written in
them can be impractical. The most commonly used DNN certification libraries
like auto_LiRPA and ERAN prove the correctness of their analyses. However, they
consist of only a few hard-coded abstract domains and abstract transformers (or
transfer functions) and do not allow the user to define new analyses. Further,
these libraries can handle only specific DNN architectures.
To address these issues, we develop a declarative DSL -- ConstraintFlow --
that can be used to specify Abstract Interpretation-based DNN certifiers. In
ConstraintFlow, programmers can easily define various existing and new abstract
domains and transformers, all within just a few 10s of Lines of Code as opposed
to 1000s of LOCs of existing libraries. We also provide lightweight automatic
verification, which can be used to ensure the over-approximation-based
soundness of the certifier code written in ConstraintFlow for arbitrary (but
bounded) DNN architectures. Using this automated verification procedure, for
the first time, we can verify the soundness of state-of-the-art DNN certifiers
for arbitrary DNN architectures, all within a few minutes. We prove the
soundness of our verification procedure and the completeness of a subset of
ConstraintFlow
CoMEt: x86 Cost Model Explanation Framework
ML-based program cost models have been shown to yield highly accurate
predictions. They have the capability to replace heavily-engineered analytical
program cost models in mainstream compilers, but their black-box nature
discourages their adoption. In this work, we propose the first method for
obtaining faithful and intuitive explanations for the throughput predictions
made by ML-based cost models. We demonstrate our explanations for the
state-of-the-art ML-based cost model, Ithemal. We compare the explanations for
Ithemal with the explanations for a hand-crafted, accurate analytical model,
uiCA. Our empirical findings show that high similarity between explanations for
Ithemal and uiCA usually corresponds to high similarity between their
predictions
FLuRKA: Fast fused Low-Rank & Kernel Attention
Many efficient approximate self-attention techniques have become prevalent
since the inception of the transformer architecture. Two popular classes of
these techniques are low-rank and kernel methods. Each of these methods has its
own strengths. We observe these strengths synergistically complement each other
and exploit these synergies to fuse low-rank and kernel methods, producing a
new class of transformers: FLuRKA (Fast Low-Rank and Kernel Attention). FLuRKA
provide sizable performance gains over these approximate techniques and are of
high quality. We theoretically and empirically evaluate both the runtime
performance and quality of FLuRKA. Our runtime analysis posits a variety of
parameter configurations where FLuRKA exhibit speedups and our accuracy
analysis bounds the error of FLuRKA with respect to full-attention. We
instantiate three FLuRKA variants which experience empirical speedups of up to
3.3x and 1.7x over low-rank and kernel methods respectively. This translates to
speedups of up to 30x over models with full-attention. With respect to model
quality, FLuRKA can match the accuracy of low-rank and kernel methods on GLUE
after pre-training on wiki-text 103. When pre-training on a fixed time budget,
FLuRKA yield better perplexity scores than models with full-attention.Comment: 9 pages, 4 figure
Input-sensitive dense-sparse primitive compositions for GNN acceleration
Graph neural networks (GNN) have become an important class of neural network
models that have gained popularity in domains such as social and financial
network analysis. Different phases of GNN computations can be modeled using
both dense and sparse matrix operations. There have been many frameworks and
optimization techniques proposed in the literature to accelerate GNNs. However,
getting consistently high performance across many input graphs with different
sparsity patterns and GNN embedding sizes has remained difficult.
In this paper, we propose different algebraic reassociations of GNN
computations that lead to novel dense and sparse matrix primitive selections
and compositions. We show that the profitability of these compositions depends
on the input graph, embedding size, and the target hardware. We developed
SENSEi, a system that uses a data-driven adaptive strategy to select the best
composition given the input graph and GNN embedding sizes. Our evaluations on a
wide range of graphs and embedding sizes show that SENSEi achieves geomean
speedups of (up to ) and (up to
) on graph convolutional networks and geomean speedups of
(up to ) and (up to ) on
graph attention networks on CPUs and GPUs respectively over the widely used
Deep Graph Library. Further, we show that the compositions yield notable
synergistic performance benefits on top of other established sparse
optimizations such as sparse matrix tiling by evaluating against a well-tuned
baseline
Learning Large Graph Property Prediction via Graph Segment Training
Learning to predict properties of large graphs is challenging because each
prediction requires the knowledge of an entire graph, while the amount of
memory available during training is bounded. Here we propose Graph Segment
Training (GST), a general framework that utilizes a divide-and-conquer approach
to allow learning large graph property prediction with a constant memory
footprint. GST first divides a large graph into segments and then
backpropagates through only a few segments sampled per training iteration. We
refine the GST paradigm by introducing a historical embedding table to
efficiently obtain embeddings for segments not sampled for backpropagation. To
mitigate the staleness of historical embeddings, we design two novel
techniques. First, we finetune the prediction head to fix the input
distribution shift. Second, we introduce Stale Embedding Dropout to drop some
stale embeddings during training to reduce bias. We evaluate our complete
method GST-EFD (with all the techniques together) on two large graph property
prediction benchmarks: MalNet and TpuGraphs. Our experiments show that GST-EFD
is both memory-efficient and fast, while offering a slight boost on test
accuracy over a typical full graph training regime
Automatically Harnessing Sparse Acceleration
Sparse linear algebra is central to many scientific programs, yet compilers
fail to optimize it well. High-performance libraries are available, but
adoption costs are significant. Moreover, libraries tie programs into
vendor-specific software and hardware ecosystems, creating non-portable code.
In this paper, we develop a new approach based on our specification Language
for implementers of Linear Algebra Computations (LiLAC). Rather than requiring
the application developer to (re)write every program for a given library, the
burden is shifted to a one-off description by the library implementer. The
LiLAC-enabled compiler uses this to insert appropriate library routines without
source code changes.
LiLAC provides automatic data marshaling, maintaining state between calls and
minimizing data transfers. Appropriate places for library insertion are
detected in compiler intermediate representation, independent of source
languages.
We evaluated on large-scale scientific applications written in FORTRAN;
standard C/C++ and FORTRAN benchmarks; and C++ graph analytics kernels. Across
heterogeneous platforms, applications and data sets we show speedups of
1.1 to over 10 without user intervention.Comment: Accepted to CC 202
