120 research outputs found

    Comparison of Different Orthographies for Machine Translation of Under-Resourced Dravidian Languages

    Get PDF
    Under-resourced languages are a significant challenge for statistical approaches to machine translation, and recently it has been shown that the usage of training data from closely-related languages can improve machine translation quality of these languages. While languages within the same language family share many properties, many under-resourced languages are written in their own native script, which makes taking advantage of these language similarities difficult. In this paper, we propose to alleviate the problem of different scripts by transcribing the native script into common representation i.e. the Latin script or the International Phonetic Alphabet (IPA). In particular, we compare the difference between coarse-grained transliteration to the Latin script and fine-grained IPA transliteration. We performed experiments on the language pairs English-Tamil, English-Telugu, and English-Kannada translation task. Our results show improvements in terms of the BLEU, METEOR and chrF scores from transliteration and we find that the transliteration into the Latin script outperforms the fine-grained IPA transcription

    Bilingual lexicon induction across orthographically-distinct under-resourced Dravidian languages

    Get PDF
    Bilingual lexicons are a vital tool for under-resourced languages and recent state-of-the-art approaches to this leverage pretrained monolingual word embeddings using supervised or semi- supervised approaches. However, these approaches require cross-lingual information such as seed dictionaries to train the model and find a linear transformation between the word embedding spaces. Especially in the case of low-resourced languages, seed dictionaries are not readily available, and as such, these methods produce extremely weak results on these languages. In this work, we focus on the Dravidian languages, namely Tamil, Telugu, Kannada, and Malayalam, which are even more challenging as they are written in unique scripts. To take advantage of orthographic information and cognates in these languages, we bring the related languages into a single script. Previous approaches have used linguistically sub-optimal measures such as the Levenshtein edit distance to detect cognates, whereby we demonstrate that the longest common sub-sequence is linguistically more sound and improves the performance of bilingual lexicon induction. We show that our approach can increase the accuracy of bilingual lexicon induction methods on these languages many times, making bilingual lexicon induction approaches feasible for such under-resourced languages

    Substance use and dietary practices among students attending alternative high schools: results from a pilot study

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Substance use and poor dietary practices are prevalent among adolescents. The purpose of this study was to examine frequency of substance use and associations between cigarette, alcohol and marijuana use and selected dietary practices, such as sugar-sweetened beverages, high-fat foods, fruits and vegetables, and frequency of fast food restaurant use among alternative high school students. Associations between multi-substance use and the same dietary practices were also examined.</p> <p>Methods</p> <p>A convenience sample of adolescents (n = 145; 61% minority, 52% male) attending six alternative high schools in the St Paul/Minneapolis metropolitan area completed baseline surveys. Students were participants in the Team COOL (Controlling Overweight and Obesity for Life) pilot study, a group randomized obesity prevention pilot trial. Mixed model multivariate analyses procedures were used to assess associations of interest.</p> <p>Results</p> <p>Daily cigarette smoking was reported by 36% of students. Cigarette smoking was positively associated with consumption of regular soda (p = 0.019), high-fat foods (p = 0.037), and fast food restaurant use (p = 0.002). Alcohol (p = 0.005) and marijuana use (p = 0.035) were positively associated with high-fat food intake. With increasing numbers of substances, a positive trend was observed in high-fat food intake (p = 0.0003). There were no significant associations between substance use and fruit and vegetable intake.</p> <p>Conclusions</p> <p>Alternative high school students who use individual substances as well as multiple substances may be at high risk of unhealthful dietary practices. Comprehensive health interventions in alternative high schools have the potential of reducing health-compromising behaviors that are prevalent among this group of students. This study adds to the limited research examining substance use and diet among at-risk youth.</p> <p>Trial registration number</p> <p>ClinicalTrials.gov: <a href="http://www.clinicaltrials.gov/ct2/show/NCT01315743">NCT01315743</a></p

    NUIG at TIAD: Combining unsupervised NLP and graph metrics for translation inference

    Get PDF
    In this paper, we present the NUIG system at the TIAD shard task. This system includes graph-based metrics calculated using novel algorithms, with an unsupervised document embedding tool called ONETA and an unsupervised multi-way neural machine translation method. The results are an improvement over our previous system and produce the highest precision among all systems in the task as well as very competitive F-Measure results. Incorporating features from other systems should be easy in the framework we describe in this paper, suggesting this could very easily be extended to an even stronger result.This publication has emanated from research supported in part by a research grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289 P2, co-funded by the European Regional Development Fund, as well as by the H2020 project Pret- ˆ a-LLOD under Grant Agreement ` number 825182.peer-reviewe

    Linking knowledge graphs across languages with semantic similarity and machine translation

    Get PDF
    Knowledge graphs and ontologies underpin many natural language processing applications, and to apply these to new languages, these knowledge graphs must be translated. Up until now, this has been achieved either by direct label translation or by cross-lingual alignment, which matches the concepts in the graph to another graph in the target languages. We show that these two approaches can, in fact, be combined and that the combination of machine translation and crosslingual alignment can obtain improved results for translating a biomedical ontology from English to German.This work was supported by the Science Foundation Ireland under Grant Number SFI/12/RC/2289 (Insight).non-peer-reviewe
    corecore