1,192 research outputs found

    AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture

    Full text link
    CPU-FPGA heterogeneous architectures are attracting ever-increasing attention in an attempt to advance computational capabilities and energy efficiency in today's datacenters. These architectures provide programmers with the ability to reprogram the FPGAs for flexible acceleration of many workloads. Nonetheless, this advantage is often overshadowed by the poor programmability of FPGAs whose programming is conventionally a RTL design practice. Although recent advances in high-level synthesis (HLS) significantly improve the FPGA programmability, it still leaves programmers facing the challenge of identifying the optimal design configuration in a tremendous design space. This paper aims to address this challenge and pave the path from software programs towards high-quality FPGA accelerators. Specifically, we first propose the composable, parallel and pipeline (CPP) microarchitecture as a template of accelerator designs. Such a well-defined template is able to support efficient accelerator designs for a broad class of computation kernels, and more importantly, drastically reduce the design space. Also, we introduce an analytical model to capture the performance and resource trade-offs among different design configurations of the CPP microarchitecture, which lays the foundation for fast design space exploration. On top of the CPP microarchitecture and its analytical model, we develop the AutoAccel framework to make the entire accelerator generation automated. AutoAccel accepts a software program as an input and performs a series of code transformations based on the result of the analytical-model-based design space exploration to construct the desired CPP microarchitecture. Our experiments show that the AutoAccel-generated accelerators outperform their corresponding software implementations by an average of 72x for a broad class of computation kernels

    Placement-Driven Technology Mapping for LUT-Based FPGAs

    Get PDF
    In this paper, we study the problem of placement-driven technology mapping for table-lookup based FPGA architectures to optimize circuit performance. Early work on technology mapping for FPGAs such as Chortle-d[14] and Flowmap[3] aim to optimize the depth of the mapped solution without consideration of interconnect delay. Later works such as Flowmap-d[7], Bias-Clus[4] and EdgeMap consider interconnect delays during mapping, but do not take into consideration the effects of their mapping solution on the final placement. Our work focuses on the interaction between the mapping and placement stages. First, the interconnect delay information is estimated from the placement, and used during the labeling process. A placement-based mapping solution which considers both global cell congestion and local cell congestion is then developed. Finally, a legalization step and detailed placement is performed to realize the design. We have implemented our algorithm in a LUT based FPGA technology mapping package named PDM (Placement-Driven Mapping) and tested the implementation on a set of MCNC benchmarks. We use the tool VPR[1][2] for placement and routing of the mapped netlist. Experimental results show the longest path delay on a set of large MCNC benchmarks decreased by 12.3 % on the average

    Angular clustering of galaxies at 3.6 microns from the Spitzer Wide-area Infrared Extragalactic (SWIRE) Survey

    Get PDF
    We present the first analysis of large-scale clustering from the Spitzer Wide-area Infrared Extragalactic legacy survey (SWIRE). We compute the angular correlation function of galaxies selected to have 3.6 m fluxes brighter than 32 Jy in three fields totaling 2 deg2 in area. In each field we detect clustering with a high level of significance. The amplitude and slope of the correlation function is consistent between the three fields and is modeled as w() ¼ A1 with A ¼ (0:6 0:3) ; 10 3; ¼ 2:03 0:10. With a fixed slope of ¼ 1:8, we obtain an amplitude of A ¼ (1:7 0:1) ; 10 3. Assuming an equivalent depth of K 18:7 mag we find that our errors are smaller but our results are consistent with existing clustering measurements in K-band surveys and with stable clustering models. We estimate our median redshift z ’ 0:75, and this allows us to obtain an estimate of the three-dimensional correlation function (r), for which we find r0 ¼ 4:4 0:1 h 1 Mpc

    Staff Time and Motion Assessment for Administration of Erythropoiesis-Stimulating Agents: A Two-Phase Pilot Study in Clinical Oncology Practices

    Get PDF
    BACKGROUND: Erythropoiesis-stimulating agents (ESAs) are used for the management of anaemia in patients with non-myeloid malignancies where anaemia is due to the effect of concomitant myelosuppressive chemotherapy. Assessing the impact of different ESA dosing regimens on office staff time and projected labour costs is an important component of understanding the potential for optimization of oncology practice efficiencies. OBJECTIVES: A two-phase study was conducted to evaluate staff time and labour costs directly associated with ESA administration in real-world oncology practice settings among cancer patients undergoing chemotherapy. The objective of Phase 1 was to determine the mean staff time required for the process of ESA administration in patients with anaemia due to concomitantly administered chemotherapy. The objective of Phase 2 was to quantify and compare the mean staff time and mean labour costs of ESA administered once weekly (qw) with ESA once every 3 weeks (q3w) over an entire course of chemotherapy. METHODS: Phase 1 was a prospective, cross-sectional time and motion study conducted in six private oncology practices in the US based on nine steps associated with ESA administration. Using findings from Phase 1, Phase 2 was conducted as a retrospective chart review to collect data on the number and types of visits in two private oncology practices for patients receiving a complete course of myelosuppressive chemotherapy. RESULTS: In Phase 1, the mean total time that clinic staff spent on ESA administration was 23.2 min for patient visits that included chemotherapy administration (n(chemo) = 37) and 21.5 min when only ESA was administered (n(ESAonly) = 36). In Phase 2, the mean duration of treatment was significantly longer for q3w than qw (53.84 days for qw vs. 113.38 for q3w, p < 0.0001); thus, analyses were adjusted using analysis of covariance (ANCOVA) for episode duration for between-group comparisons. Following adjustment by ANCOVA, qw darbepoetin alfa (DA) patients (n(qw) = 83) required more staff time for ESA + chemotherapy visits and ESA-only visits than q3w patients (n(q3w) = 118) over a course of chemotherapy. Overall, mean total staff time expended per chemotherapy course was greater for patients receiving qw versus q3w DA. Weekly DA dosing was associated with greater projected mean labour costs (US38.16vs.US38.16 vs. US31.20 [average for 2007–2010]). CONCLUSIONS: The results from this real-world study demonstrate that oncology practices can attain staff time and labour costs savings through the use of q3w ESA. The degree of savings depends on the individual oncology practice’s staffing model and ESA administration processes, including those that allow for optimized synchronization of patient visits for ESA and chemotherapy administration. These findings indicate that additional research using standard ESA administration protocols for longer periods of time with a larger number of oncology practices and patients should be conducted to confirm these findings

    Optimal Layout Synthesis for Quantum Computing

    Full text link
    Recent years have witnessed the fast development of quantum computing. Researchers around the world are eager to run larger and larger quantum algorithms that promise speedups impossible to any classical algorithm. However, the available quantum computers are still volatile and error-prone. Thus, layout synthesis, which transforms quantum programs to meet these hardware limitations, is a crucial step in the realization of quantum computing. In this paper, we present two synthesizers, one optimal and one approximate but nearly optimal. Although a few optimal approaches to this problem have been published, our optimal synthesizer explores a larger solution space, thus is optimal in a stronger sense. In addition, it reduces time and space complexity exponentially compared to some leading optimal approaches. The key to this success is a more efficient spacetime-based variable encoding of the layout synthesis problem as a mathematical programming problem. By slightly changing our formulation, we arrive at an approximate synthesizer that is even more efficient and outperforms some leading heuristic approaches, in terms of additional gate cost, by up to 100%, and also fidelity by up to 10x on a comprehensive set of benchmark programs and architectures. For a specific family of quantum programs named QAOA, which is deemed to be a promising application for near-term quantum computers, we further adjust the approximate synthesizer by taking commutation into consideration, achieving up to 75% reduction in depth and up to 65% reduction in additional cost compared to the tool used in a leading QAOA study.Comment: to appear in ICCAD'2

    Optimal Qubit Mapping with Simultaneous Gate Absorption

    Full text link
    Before quantum error correction (QEC) is achieved, quantum computers focus on noisy intermediate-scale quantum (NISQ) applications. Compared to the well-known quantum algorithms requiring QEC, like Shor's or Grover's algorithm, NISQ applications have different structures and properties to exploit in compilation. A key step in compilation is mapping the qubits in the program to physical qubits on a given quantum computer, which has been shown to be an NP-hard problem. In this paper, we present OLSQ-GA, an optimal qubit mapper with a key feature of simultaneous SWAP gate absorption during qubit mapping, which we show to be a very effective optimization technique for NISQ applications. For the class of quantum approximate optimization algorithm (QAOA), an important NISQ application, OLSQ-GA reduces depth by up to 50.0% and SWAP count by 100% compared to other state-of-the-art methods, which translates to 55.9% fidelity improvement. The solution optimality of OLSQ-GA is achieved by the exact SMT formulation. For better scalability, we augment our approach with additional constraints in the form of initial mapping or alternating matching, which speeds up OLSQ-GA by up to 272X with no or little loss of optimality.Comment: 8 pages, 8 figures, to appear in ICCAD'2

    Thermal-aware Cell and through-silicon-via Co-placement for 3D ICs

    Get PDF
    Existing thermal-aware 3D placement methods assume that the temperature of 3D ICs can be optimized by properly distributing the power dissipations and ignoring the heat conductivity of though-silicon-vias (TSVs). However, our study indicates that this is not exactly correct. While considering the thermal effect of TSVs during placement appears to be quite complicated, we are able to prove that when the TSV area in each bin is proportional to the lumped power consumption in that bin, together with the bins in all the tiers directly above it, the peak temperature is minimized. based on this criterion, we implement a thermal-aware 3D placement tool. Compared to the methods that prefer a uniform power distribution that only results in an 8% peak temperature reduction, our method reduces the peak temperature by 34% on average with even slightly less wirelength overhead. These results suggest that considering thermal effects of TSVs is necessary and effective during the placement stage. to the best of the authors\u27 knowledge, this is the first thermal-aware 3D placement tool that directly takes into consideration the thermal and area impact of TSVs. © 2011 ACM

    Automatic Hardware Pragma Insertion in High-Level Synthesis: A Non-Linear Programming Approach

    Full text link
    High-Level Synthesis enables the rapid prototyping of hardware accelerators, by combining a high-level description of the functional behavior of a kernel with a set of micro-architecture optimizations as inputs. Such optimizations can be described by inserting pragmas e.g. pipelining and replication of units, or even higher level transformations for HLS such as automatic data caching using the AMD/Xilinx Merlin compiler. Selecting the best combination of pragmas, even within a restricted set, remains particularly challenging and the typical state-of-practice uses design-space exploration to navigate this space. But due to the highly irregular performance distribution of pragma configurations, typical DSE approaches are either extremely time consuming, or operating on a severely restricted search space. This work proposes a framework to automatically insert HLS pragmas in regular loop-based programs, supporting pipelining, unit replication, and data caching. We develop an analytical performance and resource model as a function of the input program properties and pragmas inserted, using non-linear constraints and objectives. We prove this model provides a lower bound on the actual performance after HLS. We then encode this model as a Non-Linear Program, by making the pragma configuration unknowns of the system, which is computed optimally by solving this NLP. This approach can also be used during DSE, to quickly prune points with a (possibly partial) pragma configuration, driven by lower bounds on achievable latency. We extensively evaluate our end-to-end, fully implemented system, showing it can effectively manipulate spaces of billions of designs in seconds to minutes for the kernels evaluated
    corecore