4,997 research outputs found
Tuning between singlet, triplet, and mixed pairing states in an extended Hubbard chain
We study spin-half fermions in a one-dimensional extended Hubbard chain at
low filling. We identify three triplet and one singlet pairing channels in the
system, which are independently tunable as a function of nearest-neighbor
charge and spin interactions. In a large-size system with translational
invariance, we derive gap equations for the corresponding pairing gaps and
obtain a Bogoliubov-de Gennes Hamiltonian with its non-trivial topology
determined by the interplay of these gaps. In an open-end system with a fixed
number of particles, we compute the exact many-body ground state and identify
the dominant pairing revealed by the pair density matrix. Both cases show
competition between the four pairing states, resulting in broad regions for
each of them and relatively narrow regions for mixed-pairing states in the
parameter space. Our results enable the possibility of tuning a nanowire
between singlet and triplet pairing states without breaking time-reversal or
SU(2) symmetry, accompanied by a change in the system's topology.Comment: 17 pages, 6 figure
Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation
TensorFlow has been the most widely adopted Machine/Deep Learning framework.
However, little exists in the literature that provides a thorough understanding
of the capabilities which TensorFlow offers for the distributed training of
large ML/DL models that need computation and communication at scale. Most
commonly used distributed training approaches for TF can be categorized as
follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand
Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu
Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this
paper, we provide an in-depth performance characterization and analysis of
these distributed training approaches on various GPU clusters including the Piz
Daint system (6 on Top500). We perform experiments to gain novel insights along
the following vectors: 1) Application-level scalability of DNN training, 2)
Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used
for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on
these experiments, we present two key insights: 1) Overall, No-gRPC designs
achieve better performance compared to gRPC-based approaches for most
configurations, and 2) The performance of No-gRPC is heavily influenced by the
gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware
MPI Allreduce design that exploits CUDA kernels and pointer caching to perform
large reductions efficiently. Our proposed designs offer 5-17X better
performance than NCCL2 for small and medium messages, and reduces latency by
29% for large messages. The proposed optimizations help Horovod-MPI to achieve
approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs.
Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native
gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint
cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie
The impact of foreign trading information on emerging futures markets: a study of Taiwan's unique data set
Using a unique dataset from the Taiwan Futures Exchange, this paper investigates whether trading imbalances by foreign investors affect emerging Taiwan futures market in terms of returns and volatility. First, this evidence demonstrates a positive relation between contemporaneous futures returns and net purchases by foreign investors when other market factor effects are controlled. Second, this failure to detect price reversals is inconsistent with the price pressure hypothesis. Third, foreign investors do not exhibit positive feedback trading patterns. Fourth, a bi-directional Granger-causality relationship exists between futures volatility and foreign trading flows. As found for other stock or foreign exchange markets, our empirical results demonstrate that foreign trading flows do have impacts on the return and volatility of developing futures market, suggesting that trading by foreign investors may enhance the information flow of the local futures market.Foreign trading
Engineering of many-body Majorana states in a topological insulator/s-wave superconductor heterostructure
We study a vortex chain in a thin film of a topological insulator with
proximity-induced superconductivity---a promising platform to realize Majorana
zero modes (MZMs)---by modeling it as a two-leg Majorana ladder. While each
pair of MZMs hybridizes through vortex tunneling, we hereby show that MZMs can
be stabilized on the ends of the ladder with the presence of tilted external
magnetic field and four-Majorana interaction. Furthermore, a fruitful phase
diagram is obtained by controlling the direction of magnetic field and the
thickness of the sample. We reveal many-body Majorana states and
interaction-induced topological phase transitions and also identify
trivial-superconducting and commensurate/incommensurate charge-density-wave
states in the phase diagram.Comment: 10 pages, 4 figure
AN OBSTACLE DETECTION SYSTEM USING DEPTH INFORMATION AND REGION GROWING FOR VISUALLY IMPAIRED PEOPLE
[[abstract]]This study proposes an obstacle detection method based on depth information to aid
the visually impaired people in avoiding obstacles as they move in an unfamiliar
environment. Firstly, we have applied dilation of morphology and erosion of
morphology to remove the crushing noise of the depth image and have used the Least
Squares Method (LSM) in a quadratic polynomial to approximate floor curves and
determine the floor height threshold in the V-disparity. Secondly, we have searched
for dramatic changes depth value in accordance with the floor height threshold to find
out suspicious stair edge points. Thirdly, we have used the Hough Transform to find
out the location of the drop line. In order to strengthen the characteristics of the
different objects to overcome the drawbacks of the region growing method, we have
applied edge detection to remove the edge. Fourthly, we have used the floor height
threshold and features of the ground to remove ground plane. And then our system has
used the region growing method to label the tags on different objects. It has analyzed
each object to determine whether the object is a stair. Fifthly, if the result is neither up
stair nor down stair, we have used K-SVD algorithm to determine whether the object
is people. Finally, the system has assisted the users to determine the stairs direction
and obstacle distance through a voice prompt by Text To Speech (TTS). Experimental
results show that the proposed system has great robustness and convenience.[[sponsorship]]National Taipei University[[conferencetype]]國際[[conferencedate]]20150718~20150719[[booktype]]電子版[[iscallforpapers]]Y[[conferencelocation]]Tokyo, Japa
Novel CMOS RFIC Layout Generation with Concurrent Device Placement and Fixed-Length Microstrip Routing
With advancing process technologies and booming IoT markets, millimeter-wave
CMOS RFICs have been widely developed in re- cent years. Since the performance
of CMOS RFICs is very sensi- tive to the precision of the layout, precise
placement of devices and precisely matched microstrip lengths to given values
have been a labor-intensive and time-consuming task, and thus become a major
bottleneck for time to market. This paper introduces a progressive
integer-linear-programming-based method to gener- ate high-quality RFIC layouts
satisfying very stringent routing requirements of microstrip lines, including
spacing/non-crossing rules, precise length, and bend number minimization,
within a given layout area. The resulting RFIC layouts excel in both per-
formance and area with much fewer bends compared with the simulation-tuning
based manual layout, while the layout gener- ation time is significantly
reduced from weeks to half an hour.Comment: ACM/IEEE Design Automation Conference (DAC), 201
Magnetic field structure in the Flattened Envelope and Jet in the young protostellar system HH 211
HH 211 is a young Class 0 protostellar system, with a flattened envelope, a
possible rotating disk, and a collimated jet. We have mapped it with the
Submillimeter Array in 341.6 GHz continuum and SiO J=8-7 at ~ 0.6 resolution.
The continuum traces the thermal dust emission in the flattened envelope and
the possible disk. Linear polarization is detected in the continuum in the
flattened envelope. The field lines implied from the polarization have
different orientations, but they are not incompatible with current
gravitational collapse models, which predict different orientation depending on
the region/distance. Also, we might have detected for the first time polarized
SiO line emission in the jet due to the Goldreich-Kylafis effect. Observations
at higher sensitivity are needed to determine the field morphology in the jet.Comment: 5 pages, 2 figure
Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?
Dense Multi-GPU systems have recently gained a lot of attention in the HPC
arena. Traditionally, MPI runtimes have been primarily designed for clusters
with a large number of nodes. However, with the advent of MPI+CUDA applications
and CUDA-Aware MPI runtimes like MVAPICH2 and OpenMPI, it has become important
to address efficient communication schemes for such dense Multi-GPU nodes. This
coupled with new application workloads brought forward by Deep Learning
frameworks like Caffe and Microsoft CNTK pose additional design constraints due
to very large message communication of GPU buffers during the training phase.
In this context, special-purpose libraries like NVIDIA NCCL have been proposed
for GPU-based collective communication on dense GPU systems. In this paper, we
propose a pipelined chain (ring) design for the MPI_Bcast collective operation
along with an enhanced collective tuning framework in MVAPICH2-GDR that enables
efficient intra-/inter-node multi-GPU communication. We present an in-depth
performance landscape for the proposed MPI_Bcast schemes along with a
comparative analysis of NVIDIA NCCL Broadcast and NCCL-based MPI_Bcast. The
proposed designs for MVAPICH2-GDR enable up to 14X and 16.6X improvement,
compared to NCCL-based solutions, for intra- and inter-node broadcast latency,
respectively. In addition, the proposed designs provide up to 7% improvement
over NCCL-based solutions for data parallel training of the VGG network on 128
GPUs using Microsoft CNTK.Comment: 8 pages, 3 figure
- …
