382 research outputs found
Sparse Communication for Distributed Gradient Descent
We make distributed stochastic gradient descent faster by exchanging sparse
updates instead of dense updates. Gradient updates are positively skewed as
most updates are near zero, so we map the 99% smallest updates (by absolute
value) to zero then exchange sparse matrices. This method can be combined with
quantization to further improve the compression. We explore different
configurations and apply them to neural machine translation and MNIST image
classification tasks. Most configurations work on MNIST, whereas different
configurations reduce convergence rate on the more complex translation task.
Our experiments show that we can achieve up to 49% speed up on MNIST and 22% on
NMT without damaging the final accuracy or BLEU.Comment: EMNLP 201
Copied Monolingual Data Improves Low-Resource Neural Machine Translation
We train a neural machine translation (NMT) system to both translate sourcelanguage text and copy target-language text, thereby exploiting monolingual corpora in the target language. Specifically, we create a bitext from the monolingual text in the target language so that each source sentence is identical to the target sentence. This copied data is then mixed with the parallel corpus and the NMT system is trained like normal, with no metadata to distinguish the two input languages. Our proposed method proves to be an effective way of incorporating monolingual data into low-resource NMT. On Turkish↔English and Romanian↔English translation tasks, we see gains of up to 1.2 BLEU over a strong baseline with back-translation. Further analysis shows that the linguistic phenomena behind these gains are different from and largely orthogonal to back-translation, with our copied corpus method improving accuracy on named entities and other words that should remain identical between the source and target languages
The University of Edinburgh’s Neural MT Systems for WMT17
This paper describes the University of Edinburgh's submissions to the WMT17
shared news translation and biomedical translation tasks. We participated in 12
translation directions for news, translating between English and Czech, German,
Latvian, Russian, Turkish and Chinese. For the biomedical task we submitted
systems for English to Czech, German, Polish and Romanian. Our systems are
neural machine translation systems trained with Nematus, an attentional
encoder-decoder. We follow our setup from last year and build BPE-based models
with parallel and back-translated monolingual training data. Novelties this
year include the use of deep architectures, layer normalization, and more
compact models due to weight tying and improvements in BPE segmentations. We
perform extensive ablative experiments, reporting on the effectivenes of layer
normalization, deep architectures, and different ensembling techniques.Comment: WMT 2017 shared task track; for Bibtex, see
http://homepages.inf.ed.ac.uk/rsennric/bib.html#uedin-nmt:201
Approaching Neural Chinese Word Segmentation as a Low-Resource Machine Translation Task
Supervised Chinese word segmentation has entered the deep learning era which
reduces the hassle of feature engineering. Recently, some researchers attempted
to treat it as character-level translation which further simplified model
designing and building, but there is still a performance gap between the
translation-based approach and other methods. In this work, we apply the best
practices from low-resource neural machine translation to Chinese word
segmentation. We build encoder-decoder models with attention, and examine a
series of techniques including regularization, data augmentation, objective
weighting, transfer learning and ensembling. Our method is generic for word
segmentation, without the need for feature engineering or model implementation.
In the closed test with constrained data, our method ties with the state of the
art on the MSR dataset and is comparable to other methods on the PKU dataset
Losing Heads in the Lottery: Pruning Transformer
The attention mechanism is the crucial component of the transformer architecture. Recent research shows that most attention heads are not confident in their decisions and can be pruned. However, removing them before training a model results in lower quality. In this paper, we apply the lottery ticket hypothesis to prune heads in the early stages of training. Our experiments on machine translation show that it is possible to remove up to three-quarters of attention heads from transformer-big during early training with an average -0.1 change in BLEU for Turkish→English. The pruned model is 1.5 times as fast at inference, albeit at the cost of longer training. Our method is complementary to other approaches, such as teacher-student, with English→German student model gaining an additional 10% speed-up with 75% encoder attention removed and 0.2 BLEU loss
Topological and site disorder in boron nitride networks
Amorphous boron nitride (a-BN) is modelled over a wide range of densities using a relatively simple potential model augmented with site charges. The local topology (defined, for example, through the total nearest-neighbour coordination number), appears near-constant across a wide range of densities and site charges. Furthermore, total scattering and total pair distribution functions also show few changes as a function of either density or site charge. Variation of the site charges directly controls the level of site (rather than topological) disorder meaning that although total pair functions may be near-constant, the underlying partial contributions may be very different. Direct contact is made with both experiment and (more recent) density-functional theory-based modelling work.
Approaching Neural Chinese Word Segmentation as a Low-Resource Machine Translation Task
Chinese word segmentation has entered the deep learning era which greatly reduces the hassle of feature engineering. Recently, some researchers attempted to treat it as characterlevel translation, which further simplified model designing, but there is a performance gap between the translation-based approach and other methods. This motivates our work, in which we apply the best practices from lowresource neural machine translation to supervised Chinese segmentation. We examine a series of techniques including regularization, data augmentation, objective weighting, transfer learning, and ensembling. Compared to previous works, our low-resource translationbased method maintains the effortless model design, yet achieves the same result as state of the art in the constrained evaluation without using additional data
Making Asynchronous Stochastic Gradient Descent Work for Transformers
Asynchronous stochastic gradient descent (SGD) is attractive from a speed
perspective because workers do not wait for synchronization. However, the
Transformer model converges poorly with asynchronous SGD, resulting in
substantially lower quality compared to synchronous SGD. To investigate why
this is the case, we isolate differences between asynchronous and synchronous
methods to investigate batch size and staleness effects. We find that summing
several asynchronous updates, rather than applying them immediately, restores
convergence behavior. With this hybrid method, Transformer training for neural
machine translation task reaches a near-convergence level 1.36x faster in
single-node multi-GPU training with no impact on model quality
- …
