1,006 research outputs found

    Performance Evaluation of Parallel Sparse Matrix–Vector Products on SGI Altix3700

    Get PDF
    Abstract. The present paper discusses scalable implementations of sparse matrix-vector products, which are crucial for high performance solutions of large-scale linear equations, on a cc-NUMA machine SGI Altix3700. Three storage formats for sparse matrices are evaluated, and scalability is attained by implementations considering the page allocation mechanism of the NUMA machine. Influences of the cache/memory bus architectures on the optimum choice of the storage format are examined, and scalable converters between storage formats shown to facilitate exploitation of storage formats of higher performance.

    Linpack evaluation on a supercomputer with heterogeneous accelerators

    Full text link
    Abstract—We report Linpack benchmark results on the TSUBAME supercomputer, a large scale heterogeneous system equipped with NVIDIA Tesla GPUs and ClearSpeed SIMD accelerators. With all of 10,480 Opteron cores, 640 Xeon cores, 648 ClearSpeed accelerators and 624 NVIDIA Tesla GPUs, we have achieved 87.01TFlops, which is the third record as a heterogeneous system in the world. This paper describes careful tuning and load balancing method required to achieve this performance. On the other hand, since the peak speed is 163 TFlops, the efficiency is 53%, which is lower than other systems. This paper also analyses this gap from the aspect of system architecture. I

    Efficient high-precision integer multiplication on the GPU

    Get PDF
    Dieguez AP, Amor M, Doallo R, Nukada A, Matsuoka S. Efficient high precision integer multiplication on the GPU. The International Journal of High Performance Computing Applications. 2022;36(3):356-369.© The Author(s) 2022. Publisher: SAGE Publications. https://doi.org/10.1177/10943420221077964[Abstract]: The multiplication of large integers, which has many applications in computer science, is an operation that can be expressed as a polynomial multiplication followed by a carry normalization. This work develops two approaches for efficient polynomial multiplication: one approach is based on tiling the classical convolution algorithm, but taking advantage of new CUDA architectures, a novelty approach to compute the multiplication using integers without accuracy lossless; the other one is based on the Strassen algorithm, an algorithm that multiplies large polynomials using the FFT operation, but adapting the fastest FFT libraries for current GPUs and working on the complex field. Previous studies reported that the Strassen algorithm is an effective implementation for “large enough” integers on GPUs. Additionally, most previous studies do not examine the implementation of the carry normalization, but this work describes a parallel implementation for this operation. Our results show the efficiency of our approaches for short, medium, and large sizes.The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work has been supported by the Ministry of Science and Innovation of Spain (PID2019-104184RB-I00), by the Galician Government and FEDER funds under the Consolidation Program of Competitive Reference Groups (UDC/GI-000265, ref. ED431C 2021/30), by the Consolidation Program of Competitive Research Units (ED431G2019/01), and by the FPU Program of the Ministry of Education of Spain (FPU14/02801). It is also partially supported by JST CREST [JPMJCR1303 and JPMJCR1687] and NVIDIA GPU Center of Excellence and conducted as research activities of AIST-TokyoTech Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL).Xunta de Galicia; ED431C 2021/3

    Efficient Solving of Scan Primitive on Multi-GPU Systems

    Get PDF
    This version of the article has been accepted for publication, after peer review. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The Version of Record is available online at: https://doi.org/10.1109/IPDPS.2018.00089Presented at: 32nd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2018, Vancouver, 21-25 May 2018[Abstract]: GPUs fulfill high computation demands, but it is necessary to develop code carefully, selecting algorithms well suited to the GPU architecture and applying different optimizations. This article presents a GPU-suitable algorithm and a tuning strategy for performing the scan primitive over large problem sizes in CUDA. This tuning strategy defines different performance premises to find the GPU execution parameters that maximize performance. Taking these premises into consideration, we easily develop the kernels using CUDA skeletons to ensure efficiency and portability. Based on this, we describe an optimal proposal analyzed over different multiple GPU environments, the first multiple-GPU batch scan proposal to the best of our knowledge. The resulting implementations outperform other well-known libraries in most cases, such as CUDPP, ModernGPU, Thrust, CUB and LightScan.This work was cofunded by the Government of Galicia and ERDF funds from the EU, under the Consolidation Programme of Competitive Reference Groups [ED431C 2017/04] and Competitive Research Units [R2014/049 and R2016/037]; by the Ministry of Economy and Competitiveness of Spain and ERDF funds [TIN2016-75845-P]; and by the Ministry of Education of Spain (FPU14/02801).Xunta de Galicia; ED431C 2017/04Xunta de Galicia; R2014/049Xunta de Galicia; R2016/03

    An adaptive SC line equalizer for four-wire full-duplex and multirate digital transmission

    Get PDF
    金沢大学大学院自然科学研究科情報システム金沢大学工学部An adaptive switched-capacitor (SC) line equalizer system that can be applied to four-wire full-duplex and multirate digital transmission is described. Several kinds of noises can exist in programmable high-gain SC equalizers, such as switching noise, DC offset jump, and transient response. To avoid their effects, an adaptive SC filter is proposed. Two identical SC circuits are used in parallel. An input signal is fed into both SC circuits, and output signals from both circuits are alternately sent after the above noises are eliminated. Furthermore, many different data rates can be handled by slightly modifying an equalizer circuit. A bridged-tap echo canceller and a DC offset canceller are modified so as to be applied to the proposed adaptive SC filter. A line equalizer system, that handles data rates ranging from 3.2 to 64 kb/s was designed. An LSI was fabricated using a 3-μm CMOS process. Experimental results show that noise effects are mostly eliminated, and frequency responses and eye openings are close to the designed values
    corecore