Search CORE

1,658 research outputs found

Inductive Visual Localisation: Factorised Training for Superior Generalisation

Author: Gupta Ankush
Vedaldi Andrea
Zisserman Andrew
Publication venue
Publication date: 21/07/2018
Field of study

End-to-end trained Recurrent Neural Networks (RNNs) have been successfully applied to numerous problems that require processing sequences, such as image captioning, machine translation, and text recognition. However, RNNs often struggle to generalise to sequences longer than the ones encountered during training. In this work, we propose to optimise neural networks explicitly for induction. The idea is to first decompose the problem in a sequence of inductive steps and then to explicitly train the RNN to reproduce such steps. Generalisation is achieved as the RNN is not allowed to learn an arbitrary internal state; instead, it is tasked with mimicking the evolution of a valid state. In particular, the state is restricted to a spatial memory map that tracks parts of the input image which have been accounted for in previous steps. The RNN is trained for single inductive steps, where it produces updates to the memory in addition to the desired output. We evaluate our method on two different visual recognition problems involving visual sequences: (1) text spotting, i.e. joint localisation and reading of text in images containing multiple lines (or a block) of text, and (2) sequential counting of objects in aerial images. We show that inductive training of recurrent models enhances their generalisation ability on challenging image datasets.Comment: In BMVC 2018 (spotlight

arXiv.org e-Print Archive

Oxford University Research Archive

Learning to Read by Spelling: Towards Unsupervised Text Recognition

Author: Gupta Ankush
Vedaldi Andrea
Zisserman Andrew
Publication venue
Publication date: 23/09/2018
Field of study

This work presents a method for visual text recognition without using any paired supervisory data. We formulate the text recognition task as one of aligning the conditional distribution of strings predicted from given text images, with lexically valid strings sampled from target corpora. This enables fully automated, and unsupervised learning from just line-level text-images, and unpaired text-string samples, obviating the need for large aligned datasets. We present detailed analysis for various aspects of the proposed method, namely - (1) impact of the length of training sequences on convergence, (2) relation between character frequencies and the order in which they are learnt, (3) generalisation ability of our recognition network to inputs of arbitrary lengths, and (4) impact of varying the text corpus on recognition accuracy. Finally, we demonstrate excellent text recognition accuracy on both synthetically generated text images, and scanned images of real printed books, using no labelled training examples

arXiv.org e-Print Archive

Oxford University Research Archive

A Deep Generative Framework for Paraphrase Generation

Author: Agarwal Arvind
Gupta Ankush
Rai Piyush
Singh Prawaan
Publication venue
Publication date: 15/09/2017
Field of study

Paraphrase generation is an important problem in NLP, especially in question answering, information retrieval, information extraction, conversation systems, to name a few. In this paper, we address the problem of generating paraphrases automatically. Our proposed method is based on a combination of deep generative models (VAE) with sequence-to-sequence models (LSTM) to generate paraphrases, given an input sentence. Traditional VAEs when combined with recurrent neural networks can generate free text but they are not suitable for paraphrase generation for a given sentence. We address this problem by conditioning the both, encoder and decoder sides of VAE, on the original sentence, so that it can generate the given sentence's paraphrases. Unlike most existing models, our model is simple, modular and can generate multiple paraphrases, for a given sentence. Quantitative evaluation of the proposed method on a benchmark paraphrase dataset demonstrates its efficacy, and its performance improvement over the state-of-the-art methods by a significant margin, whereas qualitative human evaluation indicate that the generated paraphrases are well-formed, grammatically correct, and are relevant to the input sentence. Furthermore, we evaluate our method on a newly released question paraphrase dataset, and establish a new baseline for future research

arXiv.org e-Print Archive

Crossref

Association for the Advancement of Artificial Intelligence: AAAI Publications

Recommended from our members

RARE-30. PEDIATRIC GLIOBLASTOMA IN THE POST-TEMOZOLOMIDE ERA: OUTCOMES AND CHARACTERISTICS

Author: Aghi Manish
Berger Mitchel
Chandra Ankush
Gupta Nalin
McDermott Mike
Oh Taemin
Shah Sumedh
Wadhwa Harsh
Publication venue: eScholarship, University of California
Publication date: 11/11/2019
Field of study

Abstract INTRODUCTION Glioblastoma (GBM) is the most common brain tumor, however, is a rare occurrence in children and is poorly characterized. We evaluated the characteristics and outcomes of pediatric GBM (pGBM). METHODS Retrospective analysis of pediatric (age< 18) patients diagnosed with GBM undergoing first glioblastoma resection at our brain tumor center (2005- 2016). RESULTS From 1457 GBM patients, we identified twenty-four (1.65%) pGBMs (Median Age=9 years, Females=45.8%). Median overall survival (OS) was 32.1 months, while the median progression-free survival was 11.5 months. The commonest symptoms at presentation were headaches (54.2%,n=13) and motor symptoms (50%,n=12). Mean tumor diameter was 4.5 cm and 25% of the cohort underwent gross total resection (GTR) of their tumor. Univariate analysis revealed median OS significantly associated with tumor extent of resection (GTR=56.4 months; STR/Biopsy=13.7 months, p=0.001), age at surgery (>10 years=43.9 months, < 10 years= 17.2 months, p=0.01), tumor size (> 4cm= 9.1 months, < 4cm=56.9 months, p=0.01),motor symptoms at presentation (present=14.9 months, absent=41.04 months, p=0.02) and infratentorial tumors (infratentorial=17.4 vs supratentorial=53.4 months, p=0.02). Multivariate analysis revealed GTR (HR 0.2[95% CI 0.07–0.72]; p=0.03), Age >10 years (HR 0.6[95% CI 0.02–0.64]; p=0.002), tumor >4 cm (HR 2.89[95% CI 1.88–4.11]; p=0.001) and EGFR amplification (HR 3.48[95% CI 0.82–17.4]; p=0.005) to be independent predictors of OS. Comparing patients under and over 10 years, we found that older patients had smaller tumors at presentation (4.9 vs 3.6 cms, p=0.03), greater rates of preoperative temozolomide (n=1,7.7% vs n=6, 54.5%) and bevacizumab (n=1,7.7% vs n=4, 36.4%) treatment, and lower rates of EGFR amplification (66.7% vs 11.1%) that could explain survival disparities between groups. CONCLUSION Motor symptoms, larger tumors at presentation and tumor EGFR amplification may be indictive of poorer outcomes in pGBM. However, maximal tumor resection, aggressive chemoradiation and tumor presentation at age >10 years may confer better prognosis in these patients

eScholarship - University of California

Deep learning with synthetic, temporal, and adversarial supervision

Author: Gupta Ankush
Publication venue
Publication date: 16/04/2019
Field of study

In this thesis we explore alternatives to manually annotated training examples for supervising the training of deep learning models. Specifically, we develop methods for learning under three different supervision paradigms, namely — (1) synthetic data, (2) temporal data, and (3) adversarial supervision for learning from unaligned examples. The dominant application domain of our work is text spotting, i.e. detection and recognition of text instances in images. We learn text localisation networks on synthetic data, and harness an adversarial discriminator for training text recognition networks using no paired training examples. Further, we exploit the changing pose of objects in temporal sequences (videos) to learn object landmark detectors. The unifying objective is to scale deep learning methods beyond manually annotated training data. We develop a large-scale, realistic synthetic scene text dataset. Armed with this large annotated dataset of scene images, we train a novel, fast fully-convolutional text detection network, and show excellent performance on real images. This generalisation from synthetic to real images, confirms the verisimilitude of our rendering process. The dataset, SynthText in the Wild, has been widely adapted by the research community, and has enabled the development of end-to-end text spotting models. While synthetic text can be readily generated, it needs to be adapted for the specific application domain. However, unaligned examples of text-images, and valid language sentences are abundant. With this in mind, we develop a method for text recognition which learns from such unaligned data. We cast the text recognition problem as one of aligning the conditional distribution of strings predicted from given text images, with lexically valid strings. This alignment is induced through an adversarial discriminator which tries to distinguish the predicted and real text strings apart. Our method achieves excellent text recognition accuracy, using no labelled training examples. Temporal sequences (videos) of objects encode changes in their pose. We develop a method to harness this, and learn object landmark detectors, which consistently track object parts across different poses and instances. We achieve this by conditionally generating a future frame given a past frame, and a sparse keypoint like (learnt) representation extracted from the future frame. We demonstrate generality of our method by learning landmarks for human faces (where we outperform existing landmark detectors), articulated human body, and rigid 3D objects, with no modification to the method. Finally, we propose one-step inductive training for improving generalisation in recurrent neural networks to longer sequences. We restrict the recurrent state to a spatial memory map which tracks the regions of the image which have been accounted for, and train the network for valid evolution of this map. We show excellent generalisation to much longer sequences on two sequential visual recognition tasks — joint localisation and recognition of multiple lines of text, and counting objects in aerial images

Oxford University Research Archive

Scenic: A Language for Scenario Specification and Scene Generation

Author: Dosovitskiy Alexey
Fremont Daniel J.
Gupta Ankush
Jiang Chenfanfu
Kulkarni Tejas
Liebelt Joerg
Milch Brian
Naveh Yehuda
Nori Aditya V
Ritchie Daniel
Ros Germán
Russell Stuart
Saheb-Djahromi Nasser
Sutton Michael
Wood Frank
Wu Bichen
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 20/06/2019
Field of study

We propose a new probabilistic programming language for the design and analysis of perception systems, especially those based on machine learning. Specifically, we consider the problems of training a perception system to handle rare events, testing its performance under different conditions, and debugging failures. We show how a probabilistic programming language can help address these problems by specifying distributions encoding interesting types of inputs and sampling these to generate specialized training and test sets. More generally, such languages can be used for cyber-physical systems and robotics to write environment models, an essential prerequisite to any formal analysis. In this paper, we focus on systems like autonomous cars and robots, whose environment is a "scene", a configuration of physical objects and agents. We design a domain-specific language, Scenic, for describing "scenarios" that are distributions over scenes. As a probabilistic programming language, Scenic allows assigning distributions to features of the scene, as well as declaratively imposing hard and soft constraints over the scene. We develop specialized techniques for sampling from the resulting distribution, taking advantage of the structure provided by Scenic's domain-specific syntax. Finally, we apply Scenic in a case study on a convolutional neural network designed to detect cars in road images, improving its performance beyond that achieved by state-of-the-art synthetic data generation methods.Comment: 41 pages, 36 figures. Full version of a PLDI 2019 paper (extending UC Berkeley EECS Department Tech Report No. UCB/EECS-2018-8

arXiv.org e-Print Archive

Crossref

A Time-Series-Based Feature Extraction Approach for Prediction of Protein Structural Class

Author: Gupta Ravi
Mittal Ankush
Singh Kuldip
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

This paper presents a novel feature vector based on physicochemical property of amino acids for prediction protein structural classes. The proposed method is divided into three different stages. First, a discrete time series representation to protein sequences using physicochemical scale is provided. Later on, a wavelet-based time-series technique is proposed for extracting features from mapped amino acid sequence and a fixed length feature vector for classification is constructed. The proposed feature space summarizes the variance information of ten different biological properties of amino acids. Finally, an optimized support vector machine model is constructed for prediction of each protein structural class. The proposed approach is evaluated using leave-one-out cross-validation tests on two standard datasets. Comparison of our result with existing approaches shows that overall accuracy achieved by our approach is better than exiting methods

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Helping Hands: An Object-Aware Ego-Centric Video Recognition Model

Author: Gupta Ankush
Zhang Chuhan
Zisserman Andrew
Publication venue
Publication date: 15/08/2023
Field of study

We introduce an object-aware decoder for improving the performance of spatio-temporal representations on ego-centric videos. The key idea is to enhance object-awareness during training by tasking the model to predict hand positions, object positions, and the semantic label of the objects using paired captions when available. At inference time the model only requires RGB frames as inputs, and is able to track and ground objects (although it has not been trained explicitly for this). We demonstrate the performance of the object-aware representations learnt by our model, by: (i) evaluating it for strong transfer, i.e. through zero-shot testing, on a number of downstream video-text retrieval and classification benchmarks; and (ii) by using the representations learned as input for long-term video understanding tasks (e.g. Episodic Memory in Ego4D). In all cases the performance improves over the state of the art -- even compared to networks trained with far larger batch sizes. We also show that by using noisy image-level detection as pseudo-labels in training, the model learns to provide better bounding boxes using video consistency, as well as grounding the words in the associated text descriptions. Overall, we show that the model can act as a drop-in replacement for an ego-centric video model to improve performance through visual-text grounding.Comment: ICCV202

arXiv.org e-Print Archive