46 research outputs found
Stronger Baselines for Trustable Results in Neural Machine Translation
Interest in neural machine translation has grown rapidly as its effectiveness
has been demonstrated across language and data scenarios. New research
regularly introduces architectural and algorithmic improvements that lead to
significant gains over "vanilla" NMT implementations. However, these new
techniques are rarely evaluated in the context of previously published
techniques, specifically those that are widely used in state-of-theart
production and shared-task systems. As a result, it is often difficult to
determine whether improvements from research will carry over to systems
deployed for real-world use. In this work, we recommend three specific methods
that are relatively easy to implement and result in much stronger experimental
systems. Beyond reporting significantly higher BLEU scores, we conduct an
in-depth analysis of where improvements originate and what inherent weaknesses
of basic NMT models are being addressed. We then compare the relative gains
afforded by several other techniques proposed in the literature when starting
with vanilla systems versus our stronger baselines, showing that experimental
conclusions may change depending on the baseline chosen. This indicates that
choosing a strong baseline is crucial for reporting reliable experimental
results.Comment: To appear at the Workshop on Neural Machine Translation (WNMT
Evaluating MT systems with BEER
We present BEER, an open source implementation of a machine translation evaluation metric. BEER is a metric trained for high correlation with human ranking by using learning-to-rank training methods. For evaluation of lexical accuracy it uses sub-word units (character n-grams) while for measuring word order it uses hierarchical representations based on PETs (permutation trees). During the last WMT metrics tasks, BEER has shown high correlation with human judgments both on the sentence and the corpus levels. In this paper we will show how BEER can be used for (i) full evaluation of MT output, (ii) isolated evaluation of word order and (iii) tuning MT systems
Taking MT evaluation metrics to extremes : beyond correlation with human judgments
Automatic Machine Translation (MT) evaluation is an active field of research, with a handful of new metrics devised every year. Evaluation metrics are generally benchmarked against manual assessment of translation quality, with performance measured in terms of overall correlation with human scores. Much work has been dedicated to the improvement of evaluation metrics to achieve a higher correlation with human judgments. However, little insight has been provided regarding the weaknesses and strengths of existing approaches and their behavior in different settings. In this work we conduct a broad meta-evaluation study of the performance of a wide range of evaluation metrics focusing on three major aspects. First, we analyze the performance of the metrics when faced with different levels of translation quality, proposing a local dependency measure as an alternative to the standard, global correlation coefficient. We show that metric performance varies significantly across different levels of MT quality: Metrics perform poorly when faced with low-quality translations and are not able to capture nuanced quality distinctions. Interestingly, we show that evaluating low-quality translations is also more challenging for humans. Second, we show that metrics are more reliable when evaluating neural MT than the traditional statistical MT systems. Finally, we show that the difference in the evaluation accuracy for different metrics is maintained even if the gold standard scores are based on different criteria
Machine Translation for Human Translators
While machine translation is sometimes sufficient for conveying information across language barriers, many scenarios still require precise human-quality translation that MT is currently unable to deliver. Governments and international organizations such as the United Nations require accurate translations of content dealing with complex geopolitical issues. Community-driven projects such as Wikipedia rely on volunteer translators to bring accurate information to diverse language communities. As the amount of data requiring translation has continued to increase, the idea of using machine translation to improve the speed of human translation has gained interest. In the frequently employed practice of post-editing, a machine translation system outputs an initial translation and a human translator edits it for correctness, ideally saving time over translating from scratch. While general improvements in MT quality have led to productivity gains with this technique, there has been little work on designing translation systems specifically for post-editing. In this work, we propose improvements to key components of statistical machine translation systems aimed at directly reducing the amount of work required from human translators. We propose casting MT for post-editing as an online learning task where new training instances are created as humans edit system output, introducing an online translation model that immediately learns from post-editor feedback. We propose an extended translation feature set that allows this model to learn from multiple translation context
Machine Translation for Human Translators
While machine translation is sometimes sufficient for conveying information across language barriers, many scenarios still require precise human-quality translation that MT is currently unable to deliver. Governments and international organizations such as the United Nations require accurate translations of content dealing with complex geopolitical issues. Community-driven projects such as Wikipedia rely on volunteer translators to bring accurate information to diverse language communities. As the amount of data requiring translation has continued to increase, the idea of using machine translation to improve the speed of human translation has gained significant traction. In the frequently employed practice of post-editing, a MT system outputs an initial translation and a human translator edits it for correctness, ideally saving time over translating from scratch. While general improvements in MT quality have led to productivity gains with this technique, the idea of designing translation systems specifically for post-editing has only recently caught on in research and commercial communities.
In this work, we present extensions to key components of statistical machine translation systems aimed directly at reducing the amount of work required from human translators. We cast MT for post-editing as an online learning task where new training instances are created as humans edit system output and introduce an adaptive MT system that immediately learns from this human feedback. New translation rules are learned from the data and both feature scores and weights are updated after each sentence is post-edited. An extended feature set allows making fine-grained distinctions between background and post-editing data on a pertranslation basis. We describe a simulated post-editing paradigm wherein existing reference translations are used as a stand-in for human editing during system tuning, allowing our adaptive systems to be built and deployed without any seed post-editing data.
We present a highly tunable automatic evaluation metric that scores hypothesis-reference pairs according to several statistics that are directly interpretable as measures of post-editing effort. Once an adaptive system is deployed and sufficient post-editing data is collected, our metric can be tuned to fit editing effort for a specific translation task. This version of the metric can then be plugged back into the translation system for further optimization.
To both evaluate the impact of our techniques and collect post-editing data to refine our systems, we present a web-based post-editing interface that connects human translators to our adaptive systems and automatically collects several types of highly accurate data while they work. In a series of simulated and live post-editing experiments, we show that while many of our presented techniques yield significant improvement on their own, the true potential of adaptive MT is realized when all techniques are combined. Translation systems that update both the translation grammar and weight vector after each sentence is post-edited yield super-additive gains over baseline systems across languages and domains, including low resource scenarios. Optimizing systems toward custom, task-specific metrics further boosts performance. Compared to static baselines, our adaptive MT systems produce translations that require less mechanical effort to correct and are preferred by human translators. Every software component developed as part of this work is made publicly available under an open source license. </p
2012. Challenges in predicting machine translation utility for human posteditors
Abstract As machine translation quality continues to improve, the idea of using MT to assist human translators becomes increasingly attractive. In this work, we discuss and provide empirical evidence of the challenges faced when adapting traditional MT systems to provide automatic translations for human post-editors to correct. We discuss the differences between this task and traditional adequacy-based tasks and the challenges that arise when using automatic metrics to predict the amount of effort required to post-edit translations. A series of experiments simulating a real-world localization scenario shows that current metrics under-perform on this task, even when tuned to maximize correlation with expert translator judgments, illustrating the need to rethink traditional MT pipelines when addressing the challenges of this translation task
