Search CORE

382 research outputs found

Sparse Communication for Distributed Gradient Descent

Author: Aji Alham
Heafield Kenneth
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2017
Field of study

We make distributed stochastic gradient descent faster by exchanging sparse updates instead of dense updates. Gradient updates are positively skewed as most updates are near zero, so we map the 99% smallest updates (by absolute value) to zero then exchange sparse matrices. This method can be combined with quantization to further improve the compression. We explore different configurations and apply them to neural machine translation and MNIST image classification tasks. Most configurations work on MNIST, whereas different configurations reduce convergence rate on the more complex translation task. Our experiments show that we can achieve up to 49% speed up on MNIST and 22% on NMT without damaging the final accuracy or BLEU.Comment: EMNLP 201

arXiv.org e-Print Archive

Crossref

Edinburgh Research Explorer

Copied Monolingual Data Improves Low-Resource Neural Machine Translation

Author: Currey Anna
Heafield Kenneth
Miceli Barone Antonio
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2017
Field of study

We train a neural machine translation (NMT) system to both translate sourcelanguage text and copy target-language text, thereby exploiting monolingual corpora in the target language. Specifically, we create a bitext from the monolingual text in the target language so that each source sentence is identical to the target sentence. This copied data is then mixed with the parallel corpus and the NMT system is trained like normal, with no metadata to distinguish the two input languages. Our proposed method proves to be an effective way of incorporating monolingual data into low-resource NMT. On Turkish↔English and Romanian↔English translation tasks, we see gains of up to 1.2 BLEU over a strong baseline with back-translation. Further analysis shows that the linguistic phenomena behind these gains are different from and largely orthogonal to back-translation, with our copied corpus method improving accuracy on named entities and other words that should remain identical between the source and target languages

Crossref

Edinburgh Research Explorer

The University of Edinburgh’s Neural MT Systems for WMT17

Author: Birch Alexandra
Currey Anna
Germann Ulrich
Haddow Barry
Heafield Kenneth
Miceli Barone Antonio Valerio
Sennrich Rico
Williams Philip
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2017
Field of study

This paper describes the University of Edinburgh's submissions to the WMT17 shared news translation and biomedical translation tasks. We participated in 12 translation directions for news, translating between English and Czech, German, Latvian, Russian, Turkish and Chinese. For the biomedical task we submitted systems for English to Czech, German, Polish and Romanian. Our systems are neural machine translation systems trained with Nematus, an attentional encoder-decoder. We follow our setup from last year and build BPE-based models with parallel and back-translated monolingual training data. Novelties this year include the use of deep architectures, layer normalization, and more compact models due to weight tying and improvements in BPE segmentations. We perform extensive ablative experiments, reporting on the effectivenes of layer normalization, deep architectures, and different ensembling techniques.Comment: WMT 2017 shared task track; for Bibtex, see http://homepages.inf.ed.ac.uk/rsennric/bib.html#uedin-nmt:201

arXiv.org e-Print Archive

Crossref

Edinburgh Research Explorer

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Approaching Neural Chinese Word Segmentation as a Low-Resource Machine Translation Task

Author: Chen Pinzhen
Heafield Kenneth
Publication venue
Publication date: 12/01/2021
Field of study

Supervised Chinese word segmentation has entered the deep learning era which reduces the hassle of feature engineering. Recently, some researchers attempted to treat it as character-level translation which further simplified model designing and building, but there is still a performance gap between the translation-based approach and other methods. In this work, we apply the best practices from low-resource neural machine translation to Chinese word segmentation. We build encoder-decoder models with attention, and examine a series of techniques including regularization, data augmentation, objective weighting, transfer learning and ensembling. Our method is generic for word segmentation, without the need for feature engineering or model implementation. In the closed test with constrained data, our method ties with the state of the art on the MSR dataset and is comparable to other methods on the PKU dataset

arXiv.org e-Print Archive

Edinburgh Research Explorer

Losing Heads in the Lottery: Pruning Transformer

Author: Behnke Maximiliana
Heafield Kenneth
Publication venue
Publication date: 16/11/2020
Field of study

The attention mechanism is the crucial component of the transformer architecture. Recent research shows that most attention heads are not confident in their decisions and can be pruned. However, removing them before training a model results in lower quality. In this paper, we apply the lottery ticket hypothesis to prune heads in the early stages of training. Our experiments on machine translation show that it is possible to remove up to three-quarters of attention heads from transformer-big during early training with an average -0.1 change in BLEU for Turkish→English. The pruned model is 1.5 times as fast at inference, albeit at the cost of longer training. Our method is complementary to other approaches, such as teacher-student, with English→German student model gaining an additional 10% speed-up with 75% encoder attention removed and 0.2 BLEU loss

Edinburgh Research Explorer

Topological and site disorder in boron nitride networks

Author: Heafield Angus
Wilson Mark
Publication venue: IOP Publishing
Publication date: 17/10/2024
Field of study

Amorphous boron nitride (a-BN) is modelled over a wide range of densities using a relatively simple potential model augmented with site charges. The local topology (deﬁned, for example, through the total nearest-neighbour coordination number), appears near-constant across a wide range of densities and site charges. Furthermore, total scattering and total pair distribution functions also show few changes as a function of either density or site charge. Variation of the site charges directly controls the level of site (rather than topological) disorder meaning that although total pair functions may be near-constant, the underlying partial contributions may be very diﬀerent. Direct contact is made with both experiment and (more recent) density-functional theory-based modelling work.&#xD

Oxford University Research Archive

Zero-Resource Neural Machine Translation with Monolingual Pivot Data

Author: Currey Anna
Heafield Kenneth
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

Crossref

Edinburgh Research Explorer

Losing Heads in the Lottery: Pruning Transformer

Author: Behnke Maximiliana
Heafield Kenneth
Publication venue
Publication date: 16/11/2020
Field of study

Edinburgh Research Explorer

Approaching Neural Chinese Word Segmentation as a Low-Resource Machine Translation Task

Author: Chen Pinzhen
Heafield Kenneth
Publication venue
Publication date: 01/10/2022
Field of study

Chinese word segmentation has entered the deep learning era which greatly reduces the hassle of feature engineering. Recently, some researchers attempted to treat it as characterlevel translation, which further simplified model designing, but there is a performance gap between the translation-based approach and other methods. This motivates our work, in which we apply the best practices from lowresource neural machine translation to supervised Chinese segmentation. We examine a series of techniques including regularization, data augmentation, objective weighting, transfer learning, and ensembling. Compared to previous works, our low-resource translationbased method maintains the effortless model design, yet achieves the same result as state of the art in the constrained evaluation without using additional data

Edinburgh Research Explorer

Making Asynchronous Stochastic Gradient Descent Work for Transformers

Author: Aji Alham Fikri
Heafield Kenneth
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

Asynchronous stochastic gradient descent (SGD) is attractive from a speed perspective because workers do not wait for synchronization. However, the Transformer model converges poorly with asynchronous SGD, resulting in substantially lower quality compared to synchronous SGD. To investigate why this is the case, we isolate differences between asynchronous and synchronous methods to investigate batch size and staleness effects. We find that summing several asynchronous updates, rather than applying them immediately, restores convergence behavior. With this hybrid method, Transformer training for neural machine translation task reaches a near-convergence level 1.36x faster in single-node multi-GPU training with no impact on model quality

arXiv.org e-Print Archive

Crossref

Edinburgh Research Explorer