70 research outputs found
A semi-automatic approach to identifying and unifying ambiguously encoded Arabic-based characters.
In this study, we outline a potential problem in normalising texts that are based on a modified version of the Arabic alphabet. One of the main resources available for processing resource-scarce languages is raw text collected from the Internet. Many less-resourced languages, such as Kurdish, Farsi, Urdu, Pashtu, etc., use a modified version of the Arabic writing system. Many characters in harvested data from the Internet may have exactly the same form but encoded with different Unicode values (ambiguous characters). The existence of ambiguous characters in words leads to word duplication, thus it is important to identify and unify ambiguous characters during the normalisation stage. Here, we demonstrate cases related to ambiguous Kurdish and Farsi characters and propose a semi-automatic approach to identifying and unifying them
The selection of classifiers for a data-driven parser.
There is a large number of classifiers that can be used for generating a parse model; i.e., as an oracle for guiding data-driven parsers when parsing natural languages. In this paper we present a general and simple approach for generating a parse model. Additionally, we present a large number of experiments on various classifiers. We also present the effect of various parse models,
which are generated from different classifiers, on a data-driven parser to see the way each model contributes to parsing performance
Deep Learning for Natural Language Parsing
Natural language processing problems (such as speech recognition, text-based data mining,
and text or speech generation) are becoming increasingly important. Before effectively approaching many
of these problems, it is necessary to process the syntactic structures of the sentences. Syntactic parsing
is the task of constructing a syntactic parse tree over a sentence which describes the structure of the
sentence. Parse trees are used as part of many language processing applications. In this paper, we present
a multi-lingual dependency parser. Using advanced deep learning techniques, our parser architecture
tackles common issues with parsing such as long-distance head attachment, while using ‘architecture
engineering’ to adapt to each target language in order to reduce the feature engineering often required for
parsing tasks. We implement a parser based on this architecture to utilize transfer learning techniques to
address important issues related with limited-resourced language. We exceed the accuracy of state-of-the-art
parsers on languages with limited training resources by a considerable margin. We present promising
results for solving core problems in natural language parsing, while also performing at state-of-the-art
accuracy on general parsing tasks
Deterministic choices in a data-driven parser.
Data-driven parsers rely on recommendations from parse models,
which are generated from a set of training data using a machine learning classifier,
to perform parse operations. However, in some cases a parse model cannot
recommend a parse action to a parser unless it learns from the training
data what parse action(s) to take in every possible situation. Therefore, it will
be hard for a parser to make an informed decision as to what parse operation
to perform when a parse model recommends no/several parse actions to a parser. Here we examine the effect of various deterministic choices on a datadriven
parser when it is presented with no/several recommendation from a
parse model
Towards the Development of a Hybrid Parser for Natural Languages
In order to understand natural languages, we have to be able to determine the relations between words, in other words we have to be able to \u27parse\u27 the input text. This is a difficult task, especially for Arabic, which has a number of properties that make it particularly difficult to handle.
There are two approaches to parsing natural languages: grammar-driven and data-driven. Each of these approaches poses its own set of problems, which we discuss in this paper. The goal of our work is to produce a hybrid parser, which retains the advantages of the data-driven approach but is guided by grammar rules in order to produce more accurate output. This work consists of two stages: the first stage is to develop a baseline data-driven parser, which is guided by a machine learning algorithm for establishing
dependency relations between words. The second stage is to integrate grammar rules into the baseline parser. In this paper, we describe the first stage of our work, which is now implemented, and a number of experiments that have been conducted on this parser. We also discuss the result of these experiments and highlight the different factors that are affecting parsing speed and the correctness of the parser results
- …
