20 research outputs found
Balancing Performance and Productivity for the Development of Dynamic Binary Instrumentation Tools - A Case Study on Arm Systems
Dynamic Binary Instrumentation (DBI) is a well-established approach for analysing the execution of applications at the level of machine code. DBI frameworks implement a runtime system capable of modifying running applications without access to their source code. These frameworks provide APIs used by DBI tools to plug in their specific analysis and instrumentation routines. However, the dynamic instrumentation needed by these DBI tools is either challenging to implement, and/or introduces a significant performance overhead.An added complexity beyond the well studied scenario of x86 and x86-64, is that state-of-the-art Arm systems (i.e. Arm v8) introduced a distinct 64-bit execution mode with a new redesigned instruction set. Thus, Arm v8 is a computer architecture which contains three instruction sets. This further complicates the development of DBI tools which can work for both 32-bit Arm (includes the A32 and T32 instruction sets), and 64-bit (the A64 instruction set).This paper presents the design of a novel DBI framework API that provides support both for portable (across A32, T32 and A64), and for native-code-level analysis and instrumentation, which can be intermixed freely. This API allows DBI tool developers to balance performance and productivity at a fine-grain level. The API is implemented on top of the MAMBO DBI system
Navigating the Landscape for Real-time Localisation and Mapping for Robotics, Virtual and Augmented Reality
Visual understanding of 3D environments in real-time, at low power, is a huge
computational challenge. Often referred to as SLAM (Simultaneous Localisation
and Mapping), it is central to applications spanning domestic and industrial
robotics, autonomous vehicles, virtual and augmented reality. This paper
describes the results of a major research effort to assemble the algorithms,
architectures, tools, and systems software needed to enable delivery of SLAM,
by supporting applications specialists in selecting and configuring the
appropriate algorithm and the appropriate hardware, and compilation pathway, to
meet their performance, accuracy, and energy consumption goals. The major
contributions we present are (1) tools and methodology for systematic
quantitative evaluation of SLAM algorithms, (2) automated,
machine-learning-guided exploration of the algorithmic and implementation
design space with respect to multiple objectives, (3) end-to-end simulation
tools to enable optimisation of heterogeneous, accelerated architectures for
the specific algorithmic requirements of the various SLAM algorithmic
approaches, and (4) tools for delivering, where appropriate, accelerated,
adaptive SLAM solutions in a managed, JIT-compiled, adaptive runtime context.Comment: Proceedings of the IEEE 201
Optimising Dynamic Binary Modification Across ARM Microarchitectures
Dynamic Binary Modification (DBM) is a technique for modifying applications transparently while they are executed, working at the level of native code. However, DBM introduces a performance overhead, which in some cases can dominate execution time, making many uses impractical. The ARM hardware ecosystem poses unique challenges for high performance DBM systems because of the large number and wide range of capabilities of the commercially available implementations: from single issue, in order cores up to 6-issue out-of-order cores and including less traditional implementations. These variations raise the question of whether it is possible to develop DBM optimisations which either improve or, at the very least, do not affect performance on all available systems and microarchitectures. To answer this question, the performance of three new optimisations for the MAMBO DBM system has been evaluated on five systems using different microarchitectures. For comparison, the overhead of DynamoRIO, a high performance DBM system which was recently ported to the ARM architecture, is also evaluated
Low overhead dynamic binary translation on ARM
The ARMv8 architecture introduced AArch64, a 64-bit execution mode with a new instruction set, while retaining binary compatibility with previous versions of the ARM architecture through AArch32, a 32-bit execution mode. Most hardware implementations of ARMv8 processors support both AArch32 and AArch64, which comes at a cost in hardware complexity.
We present MAMBO-X64, a dynamic binary translator for Linux which executes 32-bit ARM binaries using only the AArch64 instruction set. We have evaluated the performance of MAMBO-X64 on three existing ARMv8 processors which support both AArch32 and AArch64 instruction sets. The performance was measured by comparing the running time of 32-bit benchmarks running under MAMBO-X64 with the same benchmark running natively. On SPEC CPU2006, we achieve a geometric mean overhead of less than 7.5% on in-order Cortex-A53 processors and a performance improvement of 1% on out-of-order X-Gene 1 processors.
MAMBO-X64 achieves such low overhead by novel optimizations to map AArch32 floating-point registers to AArch64 registers dynamically, handle overflowing address calculations efficiently, generate traces that harness hardware return address prediction, and handle operating system signals accurately.</jats:p
MAMBO: A low overhead dynamic binary modification tool for ARM
As the ARM architecture expands beyond its traditional embedded domain, there is a growing interest in dynamic binary modification (DBM) tools for general-purpose multicore processors that are part of the ARM family. Existing DBM tools for ARM suffer from introducing large overheads in the execution of applications. The specific questions that this article addresses are (i) how to develop such DBM tools for the ARM architecture and (ii) whether new optimisations are plausible and needed.
We describe the general design of MAMBO, a new DBM tool for ARM, which we release together with this publication, and introduce novel optimisations to handle indirect branches. In addition, we explore scenarios in which it may be possible to relax the transparency offered by DBM tools to allow extra optimisations to be applied. These scenarios arise from analysing the most typical usages: for example, application binaries without handcrafted assembly.
The performance evaluation shows that MAMBO introduces small overheads for SPEC CPU2006 and PARSEC 3.0 when comparing with the execution times of the unmodified programs: a geometric mean overhead of 28% on a Cortex-A9 and of 34% on a Cortex-A15 for CPU2006, and between 27% and 32%, depending on the number of threads, for PARSEC on a Cortex-A15.</jats:p
Evaluating the Impact of Optimizations for Dynamic Binary Modification on 64-bit RISC-V
Dynamic Binary Modification (DBM) is an important technique used in computer architecture simulators, virtualization, and program analysis, to name a few examples. The software ecosystem of RISC-V is maturing at pace, but is still missing a high-performance, optimized DBM. Addressing this requirement is key to improving the overall software ecosystem.This paper presents a comprehensive performance evaluation study for a DBM (MAMBO) which has been ported and optimized for 64-bit RISC-V. The main optimizations for DBM on RISC architectures have been implemented and tuned for RISC-V to address specific architectural features. For example, jump trampolines have been specifically developed to address the short direct branch range specified by the RISC-V ISA.The evaluation shows that for SPEC CPU2006 the geometric mean overhead is of 14.5%, with SPECint having the largest contribution with a geometric mean of 28.5%, while SPECfp has only an overhead of 5.6%. Concretely, this results in a reduction in runtime for h264ref from over 75 hours using the baseline DBM, to 2.2 hours with optimizations applied
Optimizing Indirect Branches in Dynamic Binary Translators
Dynamic binary translation is a technology for transparently translating and modifying a program at the machine code level as it is running. A significant factor in the performance of a dynamic binary translator is its handling of indirect branches. Unlike direct branches, which have a known target at translation time, an indirect branch requires translating a source program counter address to a translated program counter address every time the branch is executed. This translation can impose a serious runtime penalty if it is not handled efficiently.
MAMBO-X64, a dynamic binary translator that translates 32-bit ARM (AArch32) code to 64-bit ARM (AArch64) code, uses three novel techniques to improve the performance of indirect branch translation. Together, these techniques allow MAMBO-X64 to achieve a very low performance overhead of only 10% on average compared to native execution of 32-bit programs.
Hardware-assisted function returns
use a software return address stack to predict the targets of function returns, making use of several novel optimizations while also exploiting hardware return address prediction. This technique has a significant impact on most benchmarks, reducing binary translation overhead compared to native execution by 40% on average and by 90% on some benchmarks.
Branch table inference
, an algorithm for detecting and translating branch tables, can reduce the overhead of translated code by up to 40% on some SPEC CPU2006 benchmarks. The remaining indirect branches are handled using a
fast atomic hash table
, which is optimized to work with multiple threads. This last technique translates indirect branches using a single shared hash table while avoiding expensive synchronization in performance-critical lookup code. This allows the performance to be on par with thread-private hash tables while having superior memory scalability.
</jats:p
