Search CORE

127 research outputs found

Pessimistic Software Lock-Elision

Author: A. Adl-Tabatabai
C. Fetzer
D. Dice
H. Attiya
H. Attiya
I. Keidar
J. Mellor-Crummey
M. Kapalka
M. Spear
T. Harris
T. Riegel
T. Shpeisman
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Read-write locks are one of the most prevalent lock forms in concurrent applications because they allow read accesses to locked code to proceed in parallel. However, they do not offer any parallelism between reads and writes. This paper introduces pessimistic lock-elision (PLE), a new approach for non-speculatively replacing read-write locks with pessimistic (i.e. non-aborting) software transactional code that allows read-write concurrency even for contended code and even if the code includes system calls. On systems with hardware transactional support, PLE will allow failed transactions, or ones that contain system calls, to preserve read-write concurrency. Our PLE algorithm is based on a novel encounter-order design of a fully pessimistic STM system that in a variety of benchmarks spanning from counters to trees, even when up to 40% of calls are mutating the locked structure, provides up to 5 times the performance of a state-of-the-art read-write lock.National Science Foundation (U.S.) (Grant 1217921

CiteSeerX

DSpace@MIT

Crossref

Cache-Integrated Network Interfaces: Flexible On-Chip Communication and Synchronization for Large-Scale CMPs

Author: D. Lenoski
D. Wentzlaff
Dimitrios S. Nikolopoulos
H. Shan
I. Schoinas
J. Leverich
J.A. Kahle
J.M. Mellor-Crummey
K. Gharachorloo
M. Wen
M.M.K. Martin
Manolis Katevenis
Michail Zampetakis
P.S. Magnusson
S.L. Scott
S.P. Amarasinghe
S.W. Keckler
Stamatis Kavadias
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Queen's University Belfast Research Portal

Crossref

Springer - Publisher Connector

LNCS

Author: A Sánchez
CS Ellis
D Dice
G Evangelidis
I Jaluta
JM Mellor-Crummey
M Desnoyers
M Herlihy
O Rodeh
R Bayer
V Srinivasan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

Concurrent accesses to shared data structures must be synchronized to avoid data races. Coarse-grained synchronization, which locks the entire data structure, is easy to implement but does not scale. Fine-grained synchronization can scale well, but can be hard to reason about. Hand-over-hand locking, in which operations are pipelined as they traverse the data structure, combines fine-grained synchronization with ease of use. However, the traditional implementation suffers from inherent overheads. This paper introduces snapshot-based synchronization (SBS), a novel hand-over-hand locking mechanism. SBS decouples the synchronization state from the data, significantly improving cache utilization. Further, it relies on guarantees provided by pipelining to minimize synchronization that requires cross-thread communication. Snapshot-based synchronization thus scales much better than traditional hand-over-hand locking, while maintaining the same ease of use

Crossref

IST Austria: PubRep (Institute of Science and Technology)

Eraser

Author: BERSHAD B. N.
DINNING A.
DINNING BERG
Greg Nelson
LEE E. K.
MELLOR-CRUMMEY
MELLOR-CRUMMEY
Michael Burrows
Patrick Sobalvarro
PERKOVIC D.
SCALES D. J.
SRIVASTAVA A.
Stefan Savage
Thomas Anderson
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Efficient Data Race Detection for Async-Finish Parallelism

Author: C. Flanagan
C. Sadowski
D. Lea
D. Leijen
E.A. Lee
J. Mellor-Crummey
J.-D. Choi
J.K. Lee
M. Feng
R. Barik
R. Barik
R.D. Blumofe
S. Agarwal
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Abstract. A major productivity hurdle for parallel programming is the presence of data races. Data races can lead to all kinds of harmful program behaviors, includ-ing determinism violations and corrupted memory. However, runtime overheads of current dynamic data race detectors are still prohibitively large (often incurring slowdowns of 10 × or larger) for use in mainstream software development. In this paper, we present an efficient dynamic race detector algorithm targeting the async-finish task-parallel parallel programming model. The async and finish constructs are at the core of languages such as X10 and Habanero Java (HJ). These constructs generalize the spawn-sync constructs used in Cilk, while still ensuring that all computation graphs are deadlock-free. We have implemented our algorithm in a tool called TASKCHECKER and eval-uated it on a suite of 12 benchmarks. To reduce overhead of the dynamic analysis, we have also implemented various static optimizations in the tool. Our experi-mental results indicate that our approach performs well in practice, incurring an average slowdown of 3.05 × compared to a serial execution in the optimized case.

CiteSeerX

Crossref

High-throughput sequence alignment using Graphics Processing Units

Author: AL Delcher
AL Delcher
Amitabh Varshney
Arthur L Delcher
C Shaffer
Cole Trapnell
D Gusfield
E Ukkonen
EW Myers
I Buck
J Mellor-Crummey
JD Owens
M Brudno
M Charalambous
M Hohl
M Pop
Michael C Schatz
MJ Harris
NK Govindaraju
nVidia
P Weiner
S Kurtz
S Kurtz
SF Atschul
W Liu
W Pearson
WJ Dally
Y Juekuan
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background The recent availability of new, less expensive high-throughput DNA sequencing technologies has yielded a dramatic increase in the volume of sequence data that must be analyzed. These data are being generated for several purposes, including genotyping, genome resequencing, metagenomics, and <it>de novo </it>genome assembly projects. Sequence alignment programs such as MUMmer have proven essential for analysis of these data, but researchers will need ever faster, high-throughput alignment tools running on inexpensive hardware to keep up with new sequence technologies. Results This paper describes MUMmerGPU, an open-source high-throughput parallel pairwise local sequence alignment program that runs on commodity Graphics Processing Units (GPUs) in common workstations. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA) from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. By processing the queries in parallel on the highly parallel graphics card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence alignment kernel, and outperforms the exact alignment component of MUMmer on a high end CPU by 3.5-fold in total application time when aligning reads from recent sequencing projects using Solexa/Illumina, 454, and Sanger sequencing technologies. Conclusion MUMmerGPU is a low cost, ultra-fast sequence alignment program designed to handle the increasing volume of data produced by new, high-throughput sequencing technologies. MUMmerGPU demonstrates that even memory-intensive applications can run significantly faster on the relatively low-cost GPU than on the CPU.</p

CiteSeerX

Crossref

Cold Spring Harbor Laboratory Institutional Repository

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Digital Repository at the University of Maryland

Disrupted nitric oxide signaling due to GUCY1A3 mutations increases risk for moyamoya disease, achalasia and hypertension

Author: Bamshad M
Berka V
Dauser R
Ganesan V
Guo D-C
Hanchard N
Marom R
Martin E
Mellor-Crummey L
Milewicz DM
Morris SA
Nickerson DA
Regalado E
Saunders D
Sharina I
Wallace S
Publication venue: WILEY-BLACKWELL
Publication date: 01/10/2016
Field of study

Moyamoya disease (MMD) is a progressive vasculopathy characterized by occlusion of the terminal portion of the internal carotid arteries and its branches, and the formation of compensatory moyamoya collateral vessels. Homozygous mutations in GUCY1A3 have been reported as a cause of MMD and achalasia. Probands (n = 96) from unrelated families underwent sequencing of GUCY1A3. Functional studies were performed to confirm the pathogenicity of identified GUCY1A3 variants. Two affected individuals from the unrelated families were found to have compound heterozygous mutations in GUCY1A3. MM041 was diagnosed with achalasia at 4 years of age, hypertension and MMD at 18 years of age. MM149 was diagnosed with MMD and hypertension at the age of 20 months. Both individuals carry one allele that is predicted to lead to haploinsufficiency and a second allele that is predicted to produce a mutated protein. Biochemical studies of one of these alleles, GUCY1A3 Cys517Tyr, showed that the mutant protein (a subunit of soluble guanylate cyclase) has a significantly blunted signaling response with exposure to nitric oxide (NO). GUCY1A3 missense and haploinsufficiency mutations disrupt NO signaling leading to MMD and hypertension, with or without achalasia

UCL Discovery

Fast, contention-free combining tree barriers for shared-memory multiprocessors

Author: D. Hensgen
E. D. Brooks III
G. Graunke
J. M. Mellor-Crummery
John M. Mellor-Crummey
M. Herlihy
Michael L. Scott
P. L. Lehman
P.-C Yew
R. Gupta
T. E. Anderson
Y. Sagiv
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Modeling the Office of Science Ten Year Facilities Plan: The PERI Architecture Tiger Team

Author: Alam Sadaf
Bailey David H.
Carrington Laura
Daley Chris
de Supinski Bronis R.
Dubey Anshu
Gamblin Todd
Gunter Dan
Hovland Paul D.
Jagode Heike
Karavanic Karen
Marin Gabriel
Mellor-Crummey John
Moore Shirley
Norris Boyana
Oliker Leonid
Olschanowsky Catherine
Roth Philip C.
Schulz Martin
Shende Sameer
Snavely Allan
Spear Wyatt
Tikir Mustafa
Vetter Jeff
Worley Pat
Wright Nicholas
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 15/06/2009
Field of study

The Performance Engineering Institute (PERI) originally proposed a tiger team activity as a mechanism to target significant effort optimizing key Office of Science applications, a model that was successfully realized with the assistance of two JOULE metric teams. However, the Office of Science requested a new focus beginning in 2008: assistance in forming its ten year facilities plan. To meet this request, PERI formed the Architecture Tiger Team, which is modeling the performance of key science applications on future architectures, with S3D, FLASH and GTC chosen as the first application targets. In this activity, we have measured the performance of these applications on current systems in order to understand their baseline performance and to ensure that our modeling activity focuses on the right versions and inputs of the applications. We have applied a variety of modeling techniques to anticipate the performance of these applications on a range of anticipated systems. While our initial findings predict that Office of Science applications will continue to perform well on future machines from major hardware vendors, we have also encountered several areas in which we must extend our modeling techniques in order to fulfill our mission accurately and completely. In addition, we anticipate that models of a wider range of applications will reveal critical differences between expected future systems, thus providing guidance for future Office of Science procurement decisions, and will enable DOE applications to exploit machines in future facilities fully

Crossref

UNT (University of North Texas) Digital Library

Active memory controller

Author: A Ailamaki
A Gottlieb
A Saulsbury
Ali Ibrahim
C Batten
C Cascaval
D Kim
D Patterson
DH Albonesi
DJ Sorin
DJ Sorin
DS Nikolopoulos
F Petrini
G Blelloch
G Marin
I Zotov
J Kuskin
J Laudon
J Torrellas
J Torrellas
JB Brockman
JH Ahn
JM Mellor-Crummey
John B. Carter
K Keeton
KM Chandy
L Zhang
L Zhang
L Zhao
LA Barroso
Lixin Zhang
M Garzaran
M Hall
M Hao
M Oskin
Michael A. Parker
P Kogge
PA Boncz
R Kalla
RE Kessler
S Chatterjee
S Kumar
S Scott
Sally A. McKee
T Anderson
T Eicken von
V Tipparaju
Xiaowei Jiang
Y Solihin
Y Solihin
Z Fang
Zhen Fang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intelligent memory and cache coherence controller (AMC) that can execute Active Memory Operations (AMOs). AMOs are select operations sent to and executed on the home memory controller of data. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips. In this paper, we present the microarchitecture design of AMC, and the programming model of AMOs. We compare AMOs\u27 performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50x faster barriers, 12x faster spinlocks, 8.5x-15x faster stream/array operations, and 3x faster database queries. We also present an analytical model that can predict the performance benefits of using AMOs with decent accuracy. The silicon cost required to support AMOs is less than 1% of the die area of a typical high performance processor, based on a standard cell implementation

Crossref

Chalmers Research