28 research outputs found

    Accounting for Memory Bank Contention and Delay in High-Bandwidth Multiprocessors

    Get PDF
    This paper considers issues of memory performance in shared memory multiprocessors that provide a high-bandwidth network and in which the memory banks are slower than the processors. We are concerned with the effects of memory bank contention, memory bank delay, and the bank expansion factor (the ratio of number of banks to number of processors) on performance, particularly for irregular memory access patterns. This work was motivated by observed discrepancies between predicted and actual performance in a number of irregular algorithms implemented for the cray C90 when the memory contention at a particular location is high. We develop a formal framework for studying memory bank contention and delay, and show several results, both experimental and theoretical. We first show experimentally that our framework is a good predictor of performance on the cray C90 and J90, providing a good accounting of bank contention and delay. Second, we show that it often improves performance to have addi..

    Optimizing the NAS parallel BT application for the POWER CHALLENGEarray

    Full text link

    Radix Sort For Vector Multiprocessors

    No full text
    We have designed a radix sort algorithm for vector multiprocessors and have implemented the algorithm on the CRAY Y-MP. On one processor of the Y-MP, our sort is over 5 times faster on large sorting problems than the optimized library sort provided by CRAY Research. On eight processors we achieve an additional speedup of almost 5, yielding a routine over 25 times faster than the library sort. Using this multiprocessor version, we can sort at a rate of 15 million 64-bit keys per second. Our sorting algorithm is adapted from a data-parallel algorithm previously designed for a highly parallel Single Instruction Multiple Data (SIMD) computer, the Connection Machine CM-2. To develop our version we introduce three general techniques for mapping data-parallel algorithms ontovector multiprocessors. These techniques allow us to fully vectorize and parallelize the algorithm. The paper also derives equations that model the performance of our algorithm on the Y-MP. These equations are then used t..

    Performance Evaluation of a New Parallel Preconditioner

    No full text
    The linear systems associated with large, sparse, symmetric, positive definite matrices are often solved iteratively using the preconditioned conjugate gradient method. We have developed a new class of preconditioners, support tree preconditioners, that are based on the connectivity of the graphs corresponding to the matrices and are well-structured for parallel implementation. In this paper, we evaluate the performance of support tree preconditioners by comparing them against two common types of preconditioners: diagonal scaling, and incomplete Cholesky. Support tree preconditioners require less overall storage and less work per iteration than incomplete Cholesky preconditioners. In terms of total execution time, support tree preconditioners outperform both diagonal scaling and incomplete Cholesky preconditioners.

    Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors

    No full text
    In this paper we present a new technique for sparse matrix multiplication on vector multiprocessors based on the efficient implementation of a segmented sum operation. We describe how the segmented sum can be implemented on vector multiprocessors such that it both fully vectorizes within each processor and parallelizes across processors. Because of our method's insensitivity to relative row size, it is better suited than the Ellpack/Itpack or the Jagged Diagonal algorithms for matrices which have a varying number of non-zero elements in each row. Furthermore, our approach requires less preprocessing (no more time than a single sparse matrix-vector multiplication), less auxiliary storage, and uses a more convenient data representation (an augmented form of the standard compressed sparse row format). We have implemented our algorithm (SEGMV) on the Cray Y-MP C90, and have compared its performance with other methods on a variety of sparse matrices from the Harwell-Boeing collection and in..

    New Parallel Preconditioner

    No full text
    Solution of partial differential equations by either the finite element or the finite difference methods often requires the solution of large, sparse linear systems. When the coefficient matrices associated with these linear systems are symmetric and positive definite, the systems are often solved iteratively using the preconditioned conjugate gradient method. We have developed a new class of preconditioners, which we call support tree preconditioners, that are based on the connectivity of the graphs corresponding to the coefficient matrices of the linear systems. These new preconditioners have the advantage of being well-structured for parallel implementation, both in construction and in evaluation. In this paper, we evaluate the performance of support tree preconditioners by comparing them against two common types of preconditioners: those arising from diagonal scaling, and from the incomplete Cholesky decomposition. We solved linear systems corresponding to both regular and irregula..

    Scan Primitives for Vector Computers

    No full text
    The authors describe an optimized implementation of a set of scan (also called all-prefix-sums) primitives on a single processor of a CRAY Y-MP, and demonstrate that their use leads to greatly improved performance for several applications that cannot be vectorized with existing computer technology. The algorithm used to implement the scans is based on an algorithm for parallel computers. A set of segmented versions of these scans is only marginally more expensive than the unsegmented versions. The authors describe a radix sorting routine based on the scans that is 13 times faster than a Fortran version and within 20% of a highly optimized library sort routine, three operations on trees that are between 10 to 20 times faster than the corresponding C versions, and a connectionist learning algorithm that is 10 times faster than the corresponding C version for sparse and irregular network

    NESL User's Manual (for NESL Version 3.1).

    No full text
    corecore