2,010 research outputs found

    Enhancing Feature Selection Using Word Embeddings: The Case of Flu Surveillance

    Get PDF
    Health surveillance systems based on online user-generated content often rely on the identification of textual markers that are related to a target disease. Given the high volume of available data, these systems benefit from an automatic feature selection process. This is accomplished either by applying statistical learning techniques, which do not consider the semantic relationship between the selected features and the inference task, or by developing labour-intensive text classifiers. In this paper, we use neural word embeddings, trained on social media content from Twitter, to determine, in an unsupervised manner, how strongly textual features are semantically linked to an underlying health concept. We then refine conventional feature selection methods by a priori operating on textual variables that are sufficiently close to a target concept. Our experiments focus on the supervised learning problem of estimating influenza-like illness rates from Google search queries. A "flu infection" concept is formulated and used to reduce spurious and potentially confounding features that were selected by previously applied approaches. In this way, we also address forms of scepticism regarding the appropriateness of the feature space, alleviating potential cases of overfitting. Ultimately, the proposed hybrid feature selection method creates a more reliable model that, according to our empirical analysis, improves the inference performance (Mean Absolute Error) of linear and nonlinear regressors by 12% and 28.7%, respectively

    Hepatocellular carcinoma: Review of disease and tumor biomarkers.

    Get PDF
    © The Author(s) 2016.Hepatocellular carcinoma (HCC) is a common malignancy and now the second commonest global cause of cancer death. HCC tumorigenesis is relatively silent and patients experience late symptomatic presentation. As the option for curative treatments is limited to early stage cancers, diagnosis in non-symptomatic individuals is crucial. International guidelines advise regular surveillance of high-risk populations but the current tools lack sufficient sensitivity for early stage tumors on the background of a cirrhotic nodular liver. A number of novel biomarkers have now been suggested in the literature, which may reinforce the current surveillance methods. In addition, recent metabonomic and proteomic discoveries have established specific metabolite expressions in HCC, according to Warburgs phenomenon of altered energy metabolism. With clinical validation, a simple and non-invasive test from the serum or urine may be performed to diagnose HCC, particularly benefiting low resource regions where the burden of HCC is highest

    A Concept Language Model for Ad-hoc Retrieval

    Get PDF
    We propose an extension to language models for information retrieval. Typically, language models estimate the probability of a document generating the query, where the query is considered as a set of independent search terms. We extend this approach by considering the concepts implied by both the query and words in the document. The model combines the probability of the document generating the concept embodied by the query, and the traditional language model probability of the document generating the query terms. We use a word embedding space to express concepts. The similarity between two vectors in this space is estimated using a weighted cosine distance. The weighting significantly enhances the discrimination between vectors. We evaluate our model on benchmark datasets (TREC 6–8) and empirically demonstrate it outperforms state-of-the-art baselines

    Information-Theoretic Active Learning for Content-Based Image Retrieval

    Full text link
    We propose Information-Theoretic Active Learning (ITAL), a novel batch-mode active learning method for binary classification, and apply it for acquiring meaningful user feedback in the context of content-based image retrieval. Instead of combining different heuristics such as uncertainty, diversity, or density, our method is based on maximizing the mutual information between the predicted relevance of the images and the expected user feedback regarding the selected batch. We propose suitable approximations to this computationally demanding problem and also integrate an explicit model of user behavior that accounts for possible incorrect labels and unnameable instances. Furthermore, our approach does not only take the structure of the data but also the expected model output change caused by the user feedback into account. In contrast to other methods, ITAL turns out to be highly flexible and provides state-of-the-art performance across various datasets, such as MIRFLICKR and ImageNet.Comment: GCPR 2018 paper (14 pages text + 2 pages references + 6 pages appendix

    Estimating the Population Impact of a New Pediatric Influenza Vaccination Program in England Using Social Media Content

    Get PDF
    BACKGROUND: The rollout of a new childhood live attenuated influenza vaccine program was launched in England in 2013, which consisted of a national campaign for all 2 and 3 year olds and several pilot locations offering the vaccine to primary school-age children (4-11 years of age) during the influenza season. The 2014/2015 influenza season saw the national program extended to include additional pilot regions, some of which offered the vaccine to secondary school children (11-13 years of age) as well. OBJECTIVE: We utilized social media content to obtain a complementary assessment of the population impact of the programs that were launched in England during the 2013/2014 and 2014/2015 flu seasons. The overall community-wide impact on transmission in pilot areas was estimated for the different age groups that were targeted for vaccination. METHODS: A previously developed statistical framework was applied, which consisted of a nonlinear regression model that was trained to infer influenza-like illness (ILI) rates from Twitter posts originating in pilot (school-age vaccinated) and control (unvaccinated) areas. The control areas were then used to estimate ILI rates in pilot areas, had the intervention not taken place. These predictions were compared with their corresponding Twitter-based ILI estimates. RESULTS: Results suggest a reduction in ILI rates of 14% (1-25%) and 17% (2-30%) across all ages in only the primary school-age vaccine pilot areas during the 2013/2014 and 2014/2015 influenza seasons, respectively. No significant impact was observed in areas where two age cohorts of secondary school children were vaccinated. CONCLUSIONS: These findings corroborate independent assessments from traditional surveillance data, thereby supporting the ongoing rollout of the program to primary school-age children and providing evidence of the value of social media content as an additional syndromic surveillance tool

    An adaptive technique for content-based image retrieval

    Get PDF
    We discuss an adaptive approach towards Content-Based Image Retrieval. It is based on the Ostensive Model of developing information needs—a special kind of relevance feedback model that learns from implicit user feedback and adds a temporal notion to relevance. The ostensive approach supports content-assisted browsing through visualising the interaction by adding user-selected images to a browsing path, which ends with a set of system recommendations. The suggestions are based on an adaptive query learning scheme, in which the query is learnt from previously selected images. Our approach is an adaptation of the original Ostensive Model based on textual features only, to include content-based features to characterise images. In the proposed scheme textual and colour features are combined using the Dempster-Shafer theory of evidence combination. Results from a user-centred, work-task oriented evaluation show that the ostensive interface is preferred over a traditional interface with manual query facilities. This is due to its ability to adapt to the user's need, its intuitiveness and the fluid way in which it operates. Studying and comparing the nature of the underlying information need, it emerges that our approach elicits changes in the user's need based on the interaction, and is successful in adapting the retrieval to match the changes. In addition, a preliminary study of the retrieval performance of the ostensive relevance feedback scheme shows that it can outperform a standard relevance feedback strategy in terms of image recall in category search

    Profiles of physical, emotional and psychosocial wellbeing in the Lothian birth cohort 1936

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Physical, emotional, and psychosocial wellbeing are important domains of function. The aims of this study were to explore the existence of separable groups among 70-year olds with scores representing physical function, perceived quality of life, and emotional wellbeing, and to characterise any resulting groups using demographic, personality, cognition, health and lifestyle variables.</p> <p>Methods</p> <p>We used latent class analysis (LCA) to identify possible groups.</p> <p>Results</p> <p>Results suggested there were 5 groups. These included High (n = 515, 47.2% of the sample), Average (n = 417, 38.3%), and Poor Wellbeing (n = 37, 3.4%) groups. The two other groups had contrasting patterns of wellbeing: one group scored relatively well on physical function, but low on emotional wellbeing (Good Fitness/ Low Spirits,n = 60, 5.5%), whereas the other group showed low physical function but relatively well emotional wellbeing (Low Fitness/Good Spirits, n = 62, 5.7%). Salient characteristics that distinguished all the groups included smoking and drinking behaviours, personality, and illness.</p> <p>Conclusions</p> <p>Despite there being some evidence of these groups, the results also support a largely one-dimensional construct of wellbeing in old age—for the domains assessed here—though with some evidence that some individuals have uneven profiles.</p

    E-NER - An Annotated Named Entity Recognition Corpus of Legal Text

    Get PDF
    Identifying named entities such as a person, location or organization, in documents can highlight key information to readers. Training Named Entity Recognition (NER) models requires an annotated data set, which can be a time-consuming labour-intensive task. Nevertheless, there are publicly available NER data sets for general English. Recently there has been interest in developing NER for legal text. However, prior work and experimental results reported here indicate that there is a significant degradation in performance when NER methods trained on a general English data set are applied to legal text. We describe a publicly available legal NER data set, called E-NER, based on legal company filings available from the US Securities and Exchange Commission's EDGAR data set. Training a number of different NER algorithms on the general English CoNLL-2003 corpus but testing on our test collection confirmed significant degradations in accuracy, as measured by the F1-score, of between 29.4% and 60.4%, compared to training and testing on the E-NER collection

    Retrieval of highly dynamic information in an unstructured peer-to-peer network

    Get PDF
    We present a framework for the retrieval of highly dynamic information in an unstructured peer-to-peer network. Non- exhaustive search in an unstructured network is necessar- ily probabilistic, and we utilize the probably approximately correct (PAC) search architecture to determine the required replication rate for a document in order to guarantee a high probability of retrieval. Once this replication rate is deter- mined, the problem becomes how to replicate a new docu- ment across the network to meet this requirement, without overloading the communication capacity of the network. To solve this, we model the problem as rumour spreading, and use techniques from this field to propagate new documents. Our document spreading algorithm is designed such that a document has a very high probability of being replicated to the required number of nodes, but the probability of spread- ing to fewer or more nodes is small. Apart from facilitating rapid and restrained dissemination, our proposed method also withstands sudden spikes in the data creation rate. We illustrate the utility of the framework in the context of a micro-blogging social network. However it could also be used to index dynamic web pages in a distributed search engine or for a system which indexes newly created BitTorrents in a de-centralized environment. Simulations performed on net- work of 100,000 nodes validate our proposed framework
    corecore