163 research outputs found
Multi-GPU Acceleration of the iPIC3D Implicit Particle-in-Cell Code
iPIC3D is a widely used massively parallel Particle-in-Cell code for the
simulation of space plasmas. However, its current implementation does not
support execution on multiple GPUs. In this paper, we describe the porting of
iPIC3D particle mover to GPUs and the optimization steps to increase the
performance and parallel scaling on multiple GPUs. We analyze the strong
scaling of the mover on two GPU clusters and evaluate its performance and
acceleration. The optimized GPU version which uses pinned memory and
asynchronous data prefetching outperform their corresponding CPU versions by
5-10x on two different systems equipped with NVIDIA K80 and V100 GPUs.Comment: Accepted for publication in ICCS 201
Where should MMS look for electron diffusion regions?
A great possible achievement for the MMS mission would be crossing electron
diffusion regions (EDR). EDR are regions in proximity of reconnection sites
where electrons decouple from field lines, breaking the frozen in condition.
Decades of research on reconnection have produced a widely shared map of where
EDRs are. We expect reconnection to take place around a so called x-point
formed by the intersection of the separatrices dividing inflowing from
outflowing plasma. The EDR forms around this x-point as a small electron scale
box nested inside a larger ion diffusion region. But this point of view is
based on a 2D mentality. We have recently proposed that once the problem is
considered in full 3D, secondary reconnection events can form [Lapenta et al.,
Nature Physics, 11, 690, 2015] in the outflow regions even far downstream from
the primary reconnection site. We revisit here this new idea confirming that
even using additional indicators of reconnection and even considering longer
periods and wider distances the conclusion remains true: secondary reconnection
sites form downstream of a reconnection outflow causing a sort of chain
reaction of cascading reconnection sites. If we are right, MMS will have an
interesting journey even when not crossing necessarily the primary site. The
chances are greatly increased that even if missing a primary site during an
orbit, MMS could stumble instead on one of these secondary sites.Comment: submitted to the Astronum 2015 Conference Proceeding
TensorFlow Doing HPC
TensorFlow is a popular emerging open-source programming framework supporting
the execution of distributed applications on heterogeneous hardware. While
TensorFlow has been initially designed for developing Machine Learning (ML)
applications, in fact TensorFlow aims at supporting the development of a much
broader range of application kinds that are outside the ML domain and can
possibly include HPC applications. However, very few experiments have been
conducted to evaluate TensorFlow performance when running HPC workloads on
supercomputers. This work addresses this lack by designing four traditional HPC
benchmark applications: STREAM, matrix-matrix multiply, Conjugate Gradient (CG)
solver and Fast Fourier Transform (FFT). We analyze their performance on two
supercomputers with accelerators and evaluate the potential of TensorFlow for
developing HPC applications. Our tests show that TensorFlow can fully take
advantage of high performance networks and accelerators on supercomputers.
Running our TensorFlow STREAM benchmark, we obtain over 50% of theoretical
communication bandwidth on our testing platform. We find an approximately 2x,
1.7x and 1.8x performance improvement when increasing the number of GPUs from
two to four in the matrix-matrix multiply, CG and FFT applications
respectively. All our performance results demonstrate that TensorFlow has high
potential of emerging also as HPC programming framework for heterogeneous
supercomputers.Comment: Accepted for publication at The Ninth International Workshop on
Accelerators and Hybrid Exascale Systems (AsHES'19
Čerenkov emission of quasiparallel whistlers by fast electron phase-space holes during magnetic reconnection
Kinetic simulations of magnetotail reconnection have revealed electromagnetic whistlers originating near the exhaust boundary and propagating into the inflow region. The whistler production mechanism is not a linear instability, but rather is Cerenkov emission of almost parallel whistlers from localized moving clumps of charge (finite-size quasiparticles) associated with nonlinear coherent electron phase space holes. Whistlers are strongly excited by holes without ever growing exponentially. In the simulation the whistlers are emitted in the source region from holes that accelerate down the magnetic separatrix towards the x line. The phase velocity of the whistlers upsilon(phi) in the source region is everywhere well matched to the hole velocity upsilon(H) as required by the Cerenkov condition. The simulation shows emission is most efficient near the theoretical maximum upsilon(phi) = half the electron Alfven speed, consistent with the new theoretical prediction that faster holes radiate more efficiently. While transferring energy to whistlers the holes lose coherence and dissipate over a few local ion inertial lengths. The whistlers, however, propagate to the x line and out over many 10's of ion inertial lengths into the inflow region of reconnection. As the whistlers pass near the x line they modulate the rate at which magnetic field lines reconnect.</p
NVIDIA Tensor Core Programmability, Performance & Precision
The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called
"Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices
per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta
microarchitecture, provides 640 Tensor Cores with a theoretical peak
performance of 125 Tflops/s in mixed precision. In this paper, we investigate
current approaches to program NVIDIA Tensor Cores, their performances and the
precision loss due to computation in mixed precision.
Currently, NVIDIA provides three different ways of programming
matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply
Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS
GEMM. After experimenting with different approaches, we found that NVIDIA
Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100
GPU, seven and three times the performance in single and half precision
respectively. A WMMA implementation of batched GEMM reaches a performance of 4
Tflops/s. While precision loss due to matrix multiplication with half precision
input might be critical in many HPC applications, it can be considerably
reduced at the cost of increased computation. Our results indicate that HPC
applications using matrix multiplications can strongly benefit from using of
NVIDIA Tensor Cores.Comment: This paper has been accepted by the Eighth International Workshop on
Accelerators and Hybrid Exascale Systems (AsHES) 201
Signatures of Secondary Collisionless Magnetic Reconnection Driven by Kink Instability of a Flux Rope
The kinetic features of secondary magnetic reconnection in a single flux rope
undergoing internal kink instability are studied by means of three-dimensional
Particle-in-Cell simulations. Several signatures of secondary magnetic
reconnection are identified in the plane perpendicular to the flux rope: a
quadrupolar electron and ion density structure and a bipolar Hall magnetic
field develop in proximity of the reconnection region. The most intense
electric fields form perpendicularly to the local magnetic field, and a
reconnection electric field is identified in the plane perpendicular to the
flux rope. An electron current develops along the reconnection line in the
opposite direction of the electron current supporting the flux rope magnetic
field structure. Along the reconnection line, several bipolar structures of the
electric field parallel to the magnetic field occur making the magnetic
reconnection region turbulent. The reported signatures of secondary magnetic
reconnection can help to localize magnetic reconnection events in space,
astrophysical and fusion plasmas
Nonlinear evolution of the magnetized Kelvin-Helmholtz instability: from fluid to kinetic modeling
The nonlinear evolution of collisionless plasmas is typically a multi-scale
process where the energy is injected at large, fluid scales and dissipated at
small, kinetic scales. Accurately modelling the global evolution requires to
take into account the main micro-scale physical processes of interest. This is
why comparison of different plasma models is today an imperative task aiming at
understanding cross-scale processes in plasmas. We report here the first
comparative study of the evolution of a magnetized shear flow, through a
variety of different plasma models by using magnetohydrodynamic, Hall-MHD,
two-fluid, hybrid kinetic and full kinetic codes. Kinetic relaxation effects
are discussed to emphasize the need for kinetic equilibriums to study the
dynamics of collisionless plasmas in non trivial configurations. Discrepancies
between models are studied both in the linear and in the nonlinear regime of
the magnetized Kelvin-Helmholtz instability, to highlight the effects of small
scale processes on the nonlinear evolution of collisionless plasmas. We
illustrate how the evolution of a magnetized shear flow depends on the relative
orientation of the fluid vorticity with respect to the magnetic field direction
during the linear evolution when kinetic effects are taken into account. Even
if we found that small scale processes differ between the different models, we
show that the feedback from small, kinetic scales to large, fluid scales is
negligable in the nonlinear regime. This study show that the kinetic modeling
validates the use of a fluid approach at large scales, which encourages the
development and use of fluid codes to study the nonlinear evolution of
magnetized fluid flows, even in the colisionless regime
A body at the edge of language: writing anorexia, bulimia and recovering
This practice-led life writing project explores this writer-scholar's experience of her eating disorder through a series of poetic essays developed from material and somatic writing methods including ink-and-paper, found text, and movement. Through these particular methods, and the episodic acts of the writing itself, this PhD discovers a form of somatic life writing that both demonstrates and analyses the lived experience of this psycho-somatic disorder. This research project responds to the challenges of writing anorexia, bulimia and recovering, by developing material writing methods to negotiate self-erasure, narrative authority and embodied memory on the page. The PhD examines the symbiotic relation between writing and (not) eating in ways that are analogous, metaphoric and mutually affective. It draws on a range of writers and feminist materialist scholars to propose that when the tensions of eating disorder are transposed to language and navigated on the page, moments can be found where bodies and writing are constituted and de-constituted. In locating their life-affirming entanglement, this writing practice counteracts the erasure and containment of the condition
Desynchronization and Wave Pattern Formation in MPI-Parallel and Hybrid Memory-Bound Programs
Analytic, first-principles performance modeling of distributed-memory
parallel codes is notoriously imprecise. Even for applications with extremely
regular and homogeneous compute-communicate phases, simply adding communication
time to computation time does often not yield a satisfactory prediction of
parallel runtime due to deviations from the expected simple lockstep pattern
caused by system noise, variations in communication time, and inherent load
imbalance. In this paper, we highlight the specific cases of provoked and
spontaneous desynchronization of memory-bound, bulk-synchronous pure MPI and
hybrid MPI+OpenMP programs. Using simple microbenchmarks we observe that
although desynchronization can introduce increased waiting time per process, it
does not necessarily cause lower resource utilization but can lead to an
increase in available bandwidth per core. In case of significant communication
overhead, even natural noise can shove the system into a state of automatic
overlap of communication and computation, improving the overall time to
solution. The saturation point, i.e., the number of processes per memory domain
required to achieve full memory bandwidth, is pivotal in the dynamics of this
process and the emerging stable wave pattern. We also demonstrate how hybrid
MPI-OpenMP programming can prevent desirable desynchronization by eliminating
the bandwidth bottleneck among processes. A Chebyshev filter diagonalization
application is used to demonstrate some of the observed effects in a realistic
setting.Comment: 18 pages, 8 figure
Kinetic simulations of magnetic reconnection in presence of a background O+ population
Particle-in-Cell simulations of magnetic reconnection with an H+ current
sheet and a mixed background plasma of H+ and O+ ions are completed using
physical mass ratios. Four main results are shown. First, the O+ presence
slightly decreases the reconnection rate and the magnetic reconnection
evolution depends mainly on the lighter H+ ion species in the presented
simulations. Second, the Hall magnetic field is characterized by a two-scale
structure in presence of O+ ions: it reaches sharp peak values in a small area
in proximity of the neutral line, and then decreases slowly over a large
region. Third, the two background species initially separate in the outflow
region because H+ and O+ ions are accelerated by different mechanisms occurring
on different time scales and with different strengths. Fourth, the effect of a
guide field on the O+ dynamics is studied: the O+ presence does not change the
reconnected flux and all the characteristic features of guide field magnetic
reconnection are still present. Moreover, the guide field introduces an O+
circulation pattern between separatrices that enhances high O+ density areas
and depletes low O+ density regions in proximity of the reconnection fronts.
The importance and the validity of these results are finally discussed
- …
