1,192 research outputs found
AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture
CPU-FPGA heterogeneous architectures are attracting ever-increasing attention
in an attempt to advance computational capabilities and energy efficiency in
today's datacenters. These architectures provide programmers with the ability
to reprogram the FPGAs for flexible acceleration of many workloads.
Nonetheless, this advantage is often overshadowed by the poor programmability
of FPGAs whose programming is conventionally a RTL design practice. Although
recent advances in high-level synthesis (HLS) significantly improve the FPGA
programmability, it still leaves programmers facing the challenge of
identifying the optimal design configuration in a tremendous design space.
This paper aims to address this challenge and pave the path from software
programs towards high-quality FPGA accelerators. Specifically, we first propose
the composable, parallel and pipeline (CPP) microarchitecture as a template of
accelerator designs. Such a well-defined template is able to support efficient
accelerator designs for a broad class of computation kernels, and more
importantly, drastically reduce the design space. Also, we introduce an
analytical model to capture the performance and resource trade-offs among
different design configurations of the CPP microarchitecture, which lays the
foundation for fast design space exploration. On top of the CPP
microarchitecture and its analytical model, we develop the AutoAccel framework
to make the entire accelerator generation automated. AutoAccel accepts a
software program as an input and performs a series of code transformations
based on the result of the analytical-model-based design space exploration to
construct the desired CPP microarchitecture. Our experiments show that the
AutoAccel-generated accelerators outperform their corresponding software
implementations by an average of 72x for a broad class of computation kernels
Placement-Driven Technology Mapping for LUT-Based FPGAs
In this paper, we study the problem of placement-driven technology mapping for table-lookup based FPGA architectures to optimize circuit performance. Early work on technology mapping for FPGAs such as Chortle-d[14] and Flowmap[3] aim to optimize the depth of the mapped solution without consideration of interconnect delay. Later works such as Flowmap-d[7], Bias-Clus[4] and EdgeMap consider interconnect delays during mapping, but do not take into consideration the effects of their mapping solution on the final placement. Our work focuses on the interaction between the mapping and placement stages. First, the interconnect delay information is estimated from the placement, and used during the labeling process. A placement-based mapping solution which considers both global cell congestion and local cell congestion is then developed. Finally, a legalization step and detailed placement is performed to realize the design. We have implemented our algorithm in a LUT based FPGA technology mapping package named PDM (Placement-Driven Mapping) and tested the implementation on a set of MCNC benchmarks. We use the tool VPR[1][2] for placement and routing of the mapped netlist. Experimental results show the longest path delay on a set of large MCNC benchmarks decreased by 12.3 % on the average
Angular clustering of galaxies at 3.6 microns from the Spitzer Wide-area Infrared Extragalactic (SWIRE) Survey
We present the first analysis of large-scale clustering from the Spitzer Wide-area Infrared Extragalactic legacy survey (SWIRE). We compute the angular correlation function of galaxies selected to have 3.6 m fluxes brighter than 32 Jy in three fields totaling 2 deg2 in area. In each field we detect clustering with a high level of significance. The amplitude and slope of the correlation function is consistent between the three fields and is modeled as w() ¼ A1 with A ¼ (0:6 0:3) ; 10 3; ¼ 2:03 0:10. With a fixed slope of ¼ 1:8, we obtain an amplitude of A ¼ (1:7 0:1) ; 10 3. Assuming an equivalent depth of K 18:7 mag we find that our errors are smaller but our results are consistent with existing clustering measurements in K-band surveys and with stable clustering models. We estimate our median redshift z ’ 0:75, and this allows us to obtain an estimate of the three-dimensional correlation function (r), for which we find r0 ¼ 4:4 0:1 h 1 Mpc
Staff Time and Motion Assessment for Administration of Erythropoiesis-Stimulating Agents: A Two-Phase Pilot Study in Clinical Oncology Practices
BACKGROUND: Erythropoiesis-stimulating agents (ESAs) are used for the management of anaemia in patients with non-myeloid malignancies where anaemia is due to the effect of concomitant myelosuppressive chemotherapy. Assessing the impact of different ESA dosing regimens on office staff time and projected labour costs is an important component of understanding the potential for optimization of oncology practice efficiencies. OBJECTIVES: A two-phase study was conducted to evaluate staff time and labour costs directly associated with ESA administration in real-world oncology practice settings among cancer patients undergoing chemotherapy. The objective of Phase 1 was to determine the mean staff time required for the process of ESA administration in patients with anaemia due to concomitantly administered chemotherapy. The objective of Phase 2 was to quantify and compare the mean staff time and mean labour costs of ESA administered once weekly (qw) with ESA once every 3 weeks (q3w) over an entire course of chemotherapy. METHODS: Phase 1 was a prospective, cross-sectional time and motion study conducted in six private oncology practices in the US based on nine steps associated with ESA administration. Using findings from Phase 1, Phase 2 was conducted as a retrospective chart review to collect data on the number and types of visits in two private oncology practices for patients receiving a complete course of myelosuppressive chemotherapy. RESULTS: In Phase 1, the mean total time that clinic staff spent on ESA administration was 23.2 min for patient visits that included chemotherapy administration (n(chemo) = 37) and 21.5 min when only ESA was administered (n(ESAonly) = 36). In Phase 2, the mean duration of treatment was significantly longer for q3w than qw (53.84 days for qw vs. 113.38 for q3w, p < 0.0001); thus, analyses were adjusted using analysis of covariance (ANCOVA) for episode duration for between-group comparisons. Following adjustment by ANCOVA, qw darbepoetin alfa (DA) patients (n(qw) = 83) required more staff time for ESA + chemotherapy visits and ESA-only visits than q3w patients (n(q3w) = 118) over a course of chemotherapy. Overall, mean total staff time expended per chemotherapy course was greater for patients receiving qw versus q3w DA. Weekly DA dosing was associated with greater projected mean labour costs (US31.20 [average for 2007–2010]). CONCLUSIONS: The results from this real-world study demonstrate that oncology practices can attain staff time and labour costs savings through the use of q3w ESA. The degree of savings depends on the individual oncology practice’s staffing model and ESA administration processes, including those that allow for optimized synchronization of patient visits for ESA and chemotherapy administration. These findings indicate that additional research using standard ESA administration protocols for longer periods of time with a larger number of oncology practices and patients should be conducted to confirm these findings
Optimal Layout Synthesis for Quantum Computing
Recent years have witnessed the fast development of quantum computing.
Researchers around the world are eager to run larger and larger quantum
algorithms that promise speedups impossible to any classical algorithm.
However, the available quantum computers are still volatile and error-prone.
Thus, layout synthesis, which transforms quantum programs to meet these
hardware limitations, is a crucial step in the realization of quantum
computing. In this paper, we present two synthesizers, one optimal and one
approximate but nearly optimal. Although a few optimal approaches to this
problem have been published, our optimal synthesizer explores a larger solution
space, thus is optimal in a stronger sense. In addition, it reduces time and
space complexity exponentially compared to some leading optimal approaches. The
key to this success is a more efficient spacetime-based variable encoding of
the layout synthesis problem as a mathematical programming problem. By slightly
changing our formulation, we arrive at an approximate synthesizer that is even
more efficient and outperforms some leading heuristic approaches, in terms of
additional gate cost, by up to 100%, and also fidelity by up to 10x on a
comprehensive set of benchmark programs and architectures. For a specific
family of quantum programs named QAOA, which is deemed to be a promising
application for near-term quantum computers, we further adjust the approximate
synthesizer by taking commutation into consideration, achieving up to 75%
reduction in depth and up to 65% reduction in additional cost compared to the
tool used in a leading QAOA study.Comment: to appear in ICCAD'2
Optimal Qubit Mapping with Simultaneous Gate Absorption
Before quantum error correction (QEC) is achieved, quantum computers focus on
noisy intermediate-scale quantum (NISQ) applications. Compared to the
well-known quantum algorithms requiring QEC, like Shor's or Grover's algorithm,
NISQ applications have different structures and properties to exploit in
compilation. A key step in compilation is mapping the qubits in the program to
physical qubits on a given quantum computer, which has been shown to be an
NP-hard problem. In this paper, we present OLSQ-GA, an optimal qubit mapper
with a key feature of simultaneous SWAP gate absorption during qubit mapping,
which we show to be a very effective optimization technique for NISQ
applications. For the class of quantum approximate optimization algorithm
(QAOA), an important NISQ application, OLSQ-GA reduces depth by up to 50.0% and
SWAP count by 100% compared to other state-of-the-art methods, which translates
to 55.9% fidelity improvement. The solution optimality of OLSQ-GA is achieved
by the exact SMT formulation. For better scalability, we augment our approach
with additional constraints in the form of initial mapping or alternating
matching, which speeds up OLSQ-GA by up to 272X with no or little loss of
optimality.Comment: 8 pages, 8 figures, to appear in ICCAD'2
Thermal-aware Cell and through-silicon-via Co-placement for 3D ICs
Existing thermal-aware 3D placement methods assume that the temperature of 3D ICs can be optimized by properly distributing the power dissipations and ignoring the heat conductivity of though-silicon-vias (TSVs). However, our study indicates that this is not exactly correct. While considering the thermal effect of TSVs during placement appears to be quite complicated, we are able to prove that when the TSV area in each bin is proportional to the lumped power consumption in that bin, together with the bins in all the tiers directly above it, the peak temperature is minimized. based on this criterion, we implement a thermal-aware 3D placement tool. Compared to the methods that prefer a uniform power distribution that only results in an 8% peak temperature reduction, our method reduces the peak temperature by 34% on average with even slightly less wirelength overhead. These results suggest that considering thermal effects of TSVs is necessary and effective during the placement stage. to the best of the authors\u27 knowledge, this is the first thermal-aware 3D placement tool that directly takes into consideration the thermal and area impact of TSVs. © 2011 ACM
Automatic Hardware Pragma Insertion in High-Level Synthesis: A Non-Linear Programming Approach
High-Level Synthesis enables the rapid prototyping of hardware accelerators,
by combining a high-level description of the functional behavior of a kernel
with a set of micro-architecture optimizations as inputs. Such optimizations
can be described by inserting pragmas e.g. pipelining and replication of units,
or even higher level transformations for HLS such as automatic data caching
using the AMD/Xilinx Merlin compiler. Selecting the best combination of
pragmas, even within a restricted set, remains particularly challenging and the
typical state-of-practice uses design-space exploration to navigate this space.
But due to the highly irregular performance distribution of pragma
configurations, typical DSE approaches are either extremely time consuming, or
operating on a severely restricted search space. This work proposes a framework
to automatically insert HLS pragmas in regular loop-based programs, supporting
pipelining, unit replication, and data caching. We develop an analytical
performance and resource model as a function of the input program properties
and pragmas inserted, using non-linear constraints and objectives. We prove
this model provides a lower bound on the actual performance after HLS. We then
encode this model as a Non-Linear Program, by making the pragma configuration
unknowns of the system, which is computed optimally by solving this NLP. This
approach can also be used during DSE, to quickly prune points with a (possibly
partial) pragma configuration, driven by lower bounds on achievable latency. We
extensively evaluate our end-to-end, fully implemented system, showing it can
effectively manipulate spaces of billions of designs in seconds to minutes for
the kernels evaluated
- …
