382 research outputs found

    Sparse Communication for Distributed Gradient Descent

    Get PDF
    We make distributed stochastic gradient descent faster by exchanging sparse updates instead of dense updates. Gradient updates are positively skewed as most updates are near zero, so we map the 99% smallest updates (by absolute value) to zero then exchange sparse matrices. This method can be combined with quantization to further improve the compression. We explore different configurations and apply them to neural machine translation and MNIST image classification tasks. Most configurations work on MNIST, whereas different configurations reduce convergence rate on the more complex translation task. Our experiments show that we can achieve up to 49% speed up on MNIST and 22% on NMT without damaging the final accuracy or BLEU.Comment: EMNLP 201

    Copied Monolingual Data Improves Low-Resource Neural Machine Translation

    Get PDF
    We train a neural machine translation (NMT) system to both translate sourcelanguage text and copy target-language text, thereby exploiting monolingual corpora in the target language. Specifically, we create a bitext from the monolingual text in the target language so that each source sentence is identical to the target sentence. This copied data is then mixed with the parallel corpus and the NMT system is trained like normal, with no metadata to distinguish the two input languages. Our proposed method proves to be an effective way of incorporating monolingual data into low-resource NMT. On Turkish↔English and Romanian↔English translation tasks, we see gains of up to 1.2 BLEU over a strong baseline with back-translation. Further analysis shows that the linguistic phenomena behind these gains are different from and largely orthogonal to back-translation, with our copied corpus method improving accuracy on named entities and other words that should remain identical between the source and target languages

    The University of Edinburgh’s Neural MT Systems for WMT17

    Get PDF
    This paper describes the University of Edinburgh's submissions to the WMT17 shared news translation and biomedical translation tasks. We participated in 12 translation directions for news, translating between English and Czech, German, Latvian, Russian, Turkish and Chinese. For the biomedical task we submitted systems for English to Czech, German, Polish and Romanian. Our systems are neural machine translation systems trained with Nematus, an attentional encoder-decoder. We follow our setup from last year and build BPE-based models with parallel and back-translated monolingual training data. Novelties this year include the use of deep architectures, layer normalization, and more compact models due to weight tying and improvements in BPE segmentations. We perform extensive ablative experiments, reporting on the effectivenes of layer normalization, deep architectures, and different ensembling techniques.Comment: WMT 2017 shared task track; for Bibtex, see http://homepages.inf.ed.ac.uk/rsennric/bib.html#uedin-nmt:201

    Approaching Neural Chinese Word Segmentation as a Low-Resource Machine Translation Task

    Get PDF
    Supervised Chinese word segmentation has entered the deep learning era which reduces the hassle of feature engineering. Recently, some researchers attempted to treat it as character-level translation which further simplified model designing and building, but there is still a performance gap between the translation-based approach and other methods. In this work, we apply the best practices from low-resource neural machine translation to Chinese word segmentation. We build encoder-decoder models with attention, and examine a series of techniques including regularization, data augmentation, objective weighting, transfer learning and ensembling. Our method is generic for word segmentation, without the need for feature engineering or model implementation. In the closed test with constrained data, our method ties with the state of the art on the MSR dataset and is comparable to other methods on the PKU dataset

    Losing Heads in the Lottery: Pruning Transformer

    Get PDF
    The attention mechanism is the crucial component of the transformer architecture. Recent research shows that most attention heads are not confident in their decisions and can be pruned. However, removing them before training a model results in lower quality. In this paper, we apply the lottery ticket hypothesis to prune heads in the early stages of training. Our experiments on machine translation show that it is possible to remove up to three-quarters of attention heads from transformer-big during early training with an average -0.1 change in BLEU for Turkish→English. The pruned model is 1.5 times as fast at inference, albeit at the cost of longer training. Our method is complementary to other approaches, such as teacher-student, with English→German student model gaining an additional 10% speed-up with 75% encoder attention removed and 0.2 BLEU loss

    Topological and site disorder in boron nitride networks

    Get PDF
    Amorphous boron nitride (a-BN) is modelled over a wide range of densities using a relatively simple potential model augmented with site charges. The local topology (defined, for example, through the total nearest-neighbour coordination number), appears near-constant across a wide range of densities and site charges. Furthermore, total scattering and total pair distribution functions also show few changes as a function of either density or site charge. Variation of the site charges directly controls the level of site (rather than topological) disorder meaning that although total pair functions may be near-constant, the underlying partial contributions may be very different. Direct contact is made with both experiment and (more recent) density-functional theory-based modelling work.&#xD

    Zero-Resource Neural Machine Translation with Monolingual Pivot Data

    Get PDF

    Approaching Neural Chinese Word Segmentation as a Low-Resource Machine Translation Task

    Get PDF
    Chinese word segmentation has entered the deep learning era which greatly reduces the hassle of feature engineering. Recently, some researchers attempted to treat it as characterlevel translation, which further simplified model designing, but there is a performance gap between the translation-based approach and other methods. This motivates our work, in which we apply the best practices from lowresource neural machine translation to supervised Chinese segmentation. We examine a series of techniques including regularization, data augmentation, objective weighting, transfer learning, and ensembling. Compared to previous works, our low-resource translationbased method maintains the effortless model design, yet achieves the same result as state of the art in the constrained evaluation without using additional data

    Making Asynchronous Stochastic Gradient Descent Work for Transformers

    Get PDF
    Asynchronous stochastic gradient descent (SGD) is attractive from a speed perspective because workers do not wait for synchronization. However, the Transformer model converges poorly with asynchronous SGD, resulting in substantially lower quality compared to synchronous SGD. To investigate why this is the case, we isolate differences between asynchronous and synchronous methods to investigate batch size and staleness effects. We find that summing several asynchronous updates, rather than applying them immediately, restores convergence behavior. With this hybrid method, Transformer training for neural machine translation task reaches a near-convergence level 1.36x faster in single-node multi-GPU training with no impact on model quality
    corecore