1,658 research outputs found
Inductive Visual Localisation: Factorised Training for Superior Generalisation
End-to-end trained Recurrent Neural Networks (RNNs) have been successfully
applied to numerous problems that require processing sequences, such as image
captioning, machine translation, and text recognition. However, RNNs often
struggle to generalise to sequences longer than the ones encountered during
training. In this work, we propose to optimise neural networks explicitly for
induction. The idea is to first decompose the problem in a sequence of
inductive steps and then to explicitly train the RNN to reproduce such steps.
Generalisation is achieved as the RNN is not allowed to learn an arbitrary
internal state; instead, it is tasked with mimicking the evolution of a valid
state. In particular, the state is restricted to a spatial memory map that
tracks parts of the input image which have been accounted for in previous
steps. The RNN is trained for single inductive steps, where it produces updates
to the memory in addition to the desired output. We evaluate our method on two
different visual recognition problems involving visual sequences: (1) text
spotting, i.e. joint localisation and reading of text in images containing
multiple lines (or a block) of text, and (2) sequential counting of objects in
aerial images. We show that inductive training of recurrent models enhances
their generalisation ability on challenging image datasets.Comment: In BMVC 2018 (spotlight
Learning to Read by Spelling: Towards Unsupervised Text Recognition
This work presents a method for visual text recognition without using any
paired supervisory data. We formulate the text recognition task as one of
aligning the conditional distribution of strings predicted from given text
images, with lexically valid strings sampled from target corpora. This enables
fully automated, and unsupervised learning from just line-level text-images,
and unpaired text-string samples, obviating the need for large aligned
datasets. We present detailed analysis for various aspects of the proposed
method, namely - (1) impact of the length of training sequences on convergence,
(2) relation between character frequencies and the order in which they are
learnt, (3) generalisation ability of our recognition network to inputs of
arbitrary lengths, and (4) impact of varying the text corpus on recognition
accuracy. Finally, we demonstrate excellent text recognition accuracy on both
synthetically generated text images, and scanned images of real printed books,
using no labelled training examples
A Deep Generative Framework for Paraphrase Generation
Paraphrase generation is an important problem in NLP, especially in question
answering, information retrieval, information extraction, conversation systems,
to name a few. In this paper, we address the problem of generating paraphrases
automatically. Our proposed method is based on a combination of deep generative
models (VAE) with sequence-to-sequence models (LSTM) to generate paraphrases,
given an input sentence. Traditional VAEs when combined with recurrent neural
networks can generate free text but they are not suitable for paraphrase
generation for a given sentence. We address this problem by conditioning the
both, encoder and decoder sides of VAE, on the original sentence, so that it
can generate the given sentence's paraphrases. Unlike most existing models, our
model is simple, modular and can generate multiple paraphrases, for a given
sentence. Quantitative evaluation of the proposed method on a benchmark
paraphrase dataset demonstrates its efficacy, and its performance improvement
over the state-of-the-art methods by a significant margin, whereas qualitative
human evaluation indicate that the generated paraphrases are well-formed,
grammatically correct, and are relevant to the input sentence. Furthermore, we
evaluate our method on a newly released question paraphrase dataset, and
establish a new baseline for future research
Recommended from our members
RARE-30. PEDIATRIC GLIOBLASTOMA IN THE POST-TEMOZOLOMIDE ERA: OUTCOMES AND CHARACTERISTICS
Abstract
INTRODUCTION
Glioblastoma (GBM) is the most common brain tumor, however, is a rare occurrence in children and is poorly characterized. We evaluated the characteristics and outcomes of pediatric GBM (pGBM).
METHODS
Retrospective analysis of pediatric (age< 18) patients diagnosed with GBM undergoing first glioblastoma resection at our brain tumor center (2005- 2016).
RESULTS
From 1457 GBM patients, we identified twenty-four (1.65%) pGBMs (Median Age=9 years, Females=45.8%). Median overall survival (OS) was 32.1 months, while the median progression-free survival was 11.5 months. The commonest symptoms at presentation were headaches (54.2%,n=13) and motor symptoms (50%,n=12). Mean tumor diameter was 4.5 cm and 25% of the cohort underwent gross total resection (GTR) of their tumor. Univariate analysis revealed median OS significantly associated with tumor extent of resection (GTR=56.4 months; STR/Biopsy=13.7 months, p=0.001), age at surgery (>10 years=43.9 months, < 10 years= 17.2 months, p=0.01), tumor size (> 4cm= 9.1 months, < 4cm=56.9 months, p=0.01),motor symptoms at presentation (present=14.9 months, absent=41.04 months, p=0.02) and infratentorial tumors (infratentorial=17.4 vs supratentorial=53.4 months, p=0.02). Multivariate analysis revealed GTR (HR 0.2[95% CI 0.07–0.72]; p=0.03), Age >10 years (HR 0.6[95% CI 0.02–0.64]; p=0.002), tumor >4 cm (HR 2.89[95% CI 1.88–4.11]; p=0.001) and EGFR amplification (HR 3.48[95% CI 0.82–17.4]; p=0.005) to be independent predictors of OS. Comparing patients under and over 10 years, we found that older patients had smaller tumors at presentation (4.9 vs 3.6 cms, p=0.03), greater rates of preoperative temozolomide (n=1,7.7% vs n=6, 54.5%) and bevacizumab (n=1,7.7% vs n=4, 36.4%) treatment, and lower rates of EGFR amplification (66.7% vs 11.1%) that could explain survival disparities between groups.
CONCLUSION
Motor symptoms, larger tumors at presentation and tumor EGFR amplification may be indictive of poorer outcomes in pGBM. However, maximal tumor resection, aggressive chemoradiation and tumor presentation at age >10 years may confer better prognosis in these patients
Deep learning with synthetic, temporal, and adversarial supervision
In this thesis we explore alternatives to manually annotated training examples for supervising the training of deep learning models. Specifically, we develop methods for learning under three different supervision paradigms, namely — (1) synthetic data, (2) temporal data, and (3) adversarial supervision for learning from unaligned examples. The dominant application domain of our work is text spotting, i.e. detection and recognition of text instances in images. We learn text localisation networks on synthetic data, and harness an adversarial discriminator for training text recognition networks using no paired training examples. Further, we exploit the changing pose of objects in temporal sequences (videos) to learn object landmark detectors. The unifying objective is to scale deep learning methods beyond manually annotated training data. We develop a large-scale, realistic synthetic scene text dataset. Armed with this large annotated dataset of scene images, we train a novel, fast fully-convolutional text detection network, and show excellent performance on real images. This generalisation from synthetic to real images, confirms the verisimilitude of our rendering process. The dataset, SynthText in the Wild, has been widely adapted by the research community, and has enabled the development of end-to-end text spotting models. While synthetic text can be readily generated, it needs to be adapted for the specific application domain. However, unaligned examples of text-images, and valid language sentences are abundant. With this in mind, we develop a method for text recognition which learns from such unaligned data. We cast the text recognition problem as one of aligning the conditional distribution of strings predicted from given text images, with lexically valid strings. This alignment is induced through an adversarial discriminator which tries to distinguish the predicted and real text strings apart. Our method achieves excellent text recognition accuracy, using no labelled training examples. Temporal sequences (videos) of objects encode changes in their pose. We develop a method to harness this, and learn object landmark detectors, which consistently track object parts across different poses and instances. We achieve this by conditionally generating a future frame given a past frame, and a sparse keypoint like (learnt) representation extracted from the future frame. We demonstrate generality of our method by learning landmarks for human faces (where we outperform existing landmark detectors), articulated human body, and rigid 3D objects, with no modification to the method. Finally, we propose one-step inductive training for improving generalisation in recurrent neural networks to longer sequences. We restrict the recurrent state to a spatial memory map which tracks the regions of the image which have been accounted for, and train the network for valid evolution of this map. We show excellent generalisation to much longer sequences on two sequential visual recognition tasks — joint localisation and recognition of multiple lines of text, and counting objects in aerial images
Scenic: A Language for Scenario Specification and Scene Generation
We propose a new probabilistic programming language for the design and
analysis of perception systems, especially those based on machine learning.
Specifically, we consider the problems of training a perception system to
handle rare events, testing its performance under different conditions, and
debugging failures. We show how a probabilistic programming language can help
address these problems by specifying distributions encoding interesting types
of inputs and sampling these to generate specialized training and test sets.
More generally, such languages can be used for cyber-physical systems and
robotics to write environment models, an essential prerequisite to any formal
analysis. In this paper, we focus on systems like autonomous cars and robots,
whose environment is a "scene", a configuration of physical objects and agents.
We design a domain-specific language, Scenic, for describing "scenarios" that
are distributions over scenes. As a probabilistic programming language, Scenic
allows assigning distributions to features of the scene, as well as
declaratively imposing hard and soft constraints over the scene. We develop
specialized techniques for sampling from the resulting distribution, taking
advantage of the structure provided by Scenic's domain-specific syntax.
Finally, we apply Scenic in a case study on a convolutional neural network
designed to detect cars in road images, improving its performance beyond that
achieved by state-of-the-art synthetic data generation methods.Comment: 41 pages, 36 figures. Full version of a PLDI 2019 paper (extending UC
Berkeley EECS Department Tech Report No. UCB/EECS-2018-8
A Time-Series-Based Feature Extraction Approach for Prediction of Protein Structural Class
This paper presents a novel feature vector based on physicochemical property of amino acids for prediction protein structural classes. The proposed method is divided into three different stages. First, a discrete time series representation to protein sequences using physicochemical scale is provided. Later on, a wavelet-based time-series technique is proposed for extracting features from mapped amino acid sequence and a fixed length feature vector for classification is constructed. The proposed feature space summarizes the variance information of ten different biological properties of amino acids. Finally, an optimized support vector machine model is constructed for prediction of each protein structural class. The proposed approach is evaluated using leave-one-out cross-validation tests on two standard datasets. Comparison of our result with existing approaches shows that overall accuracy achieved by our approach is better than exiting methods
Helping Hands: An Object-Aware Ego-Centric Video Recognition Model
We introduce an object-aware decoder for improving the performance of
spatio-temporal representations on ego-centric videos. The key idea is to
enhance object-awareness during training by tasking the model to predict hand
positions, object positions, and the semantic label of the objects using paired
captions when available. At inference time the model only requires RGB frames
as inputs, and is able to track and ground objects (although it has not been
trained explicitly for this). We demonstrate the performance of the
object-aware representations learnt by our model, by: (i) evaluating it for
strong transfer, i.e. through zero-shot testing, on a number of downstream
video-text retrieval and classification benchmarks; and (ii) by using the
representations learned as input for long-term video understanding tasks (e.g.
Episodic Memory in Ego4D). In all cases the performance improves over the state
of the art -- even compared to networks trained with far larger batch sizes. We
also show that by using noisy image-level detection as pseudo-labels in
training, the model learns to provide better bounding boxes using video
consistency, as well as grounding the words in the associated text
descriptions. Overall, we show that the model can act as a drop-in replacement
for an ego-centric video model to improve performance through visual-text
grounding.Comment: ICCV202
- …
