40 research outputs found
Novel database design for extreme scale corpus analysis
This thesis presents the patterns and methods uncovered in the development of a new scalable corpus database management system, LexiDB, which can handle the ever-growing size of modern corpus datasets. Initially, an exploration of existing corpus data systems is conducted which examines their usage in corpus linguistics as well as their underlying architectures. From this survey, it is identified that existing systems are designed primarily to be vertically scalable (i.e. scalable through the usage of bigger, better and faster hardware). This motivates a wider examination of modern distributable database management systems and information retrieval techniques used for indexing and retrieval. These techniques are modified and adapted into an architecture that can be horizontally scaled to handle ever bigger corpora. Based on this architecture several new methods for querying and retrieval that improve upon existing techniques are proposed as modern approaches to query extremely large annotated text collections for corpus analysis. The effectiveness of these techniques and the scalability of the architecture is evaluated where it is demonstrated that the architecture is comparably scalable to two modern No-SQL database management systems and outperforms existing corpus data systems in token level pattern querying whilst still supporting character level pattern matching
Unfinished Business:Construction and Maintenance of a Semantically Tagged Historical Parliamentary Corpus, UK Hansard from 1803 to the present day
Creating, curating and maintaining modern political corpora is becoming an ever more involved task. As interest from various socialbodies and the general public in political discourse grows so too does the need to enrich such datasets with metadata and linguisticannotations. Beyond this, such corpora must be easy to browse and search for linguists, social scientists, digital humanists and thegeneral public. We present our efforts to compile a linguistically annotated and semantically tagged version of the Hansard corpus from1803 right up to the present day. This involves combining multiple sources of documents and transcripts. We describe our toolchainfor tagging; using several existing tools that provide tokenisation, part-of-speech tagging and semantic annotations. We also provide anoverview of our bespoke web-based search interface built on LexiDB. In conclusion, we examine the completed corpus by looking atfour case studies making use of semantic categories made available by our toolchain
LexiDB: Patterns & Methods for Corpus Linguistic Database Management
LexiDB is a tool for storing, managing and querying corpus data. In contrast to other database management systems (DBMSs), itis designed specifically for text corpora. It improves on other corpus management systems (CMSs) because data can be added anddeleted from corpora on the fly with the ability to add live data to existing corpora. LexiDB sits between these two categories ofDBMSs and CMSs, more specialised to language data than a general-purpose DBMS but more flexible than a traditional static corpusmanagement system. Previous work has demonstrated the scalability of LexiDB in response to the growing need to be able to scale outfor ever-growing corpus datasets. Here, we present the patterns and methods developed in LexiDB for storage, retrieval and querying ofmulti-level annotated corpus data. These techniques are evaluated and compared to an existing CMS (Corpus Workbench CWB - CQP)and indexer (Lucene). We find that LexiDB consistently outperforms existing tools for corpus queries. This is particularly apparent withlarge corpora and when handling queries with large result sets
Infrastructure for Semantic Annotation in the Genomics Domain
We describe a novel super-infrastructure for biomedical text mining which incorporates an end-to-end pipeline for the collection, annotation, storage, retrieval and analysis of biomedical and life sciences literature, combining NLP and corpus linguistics methods. The infrastructure permits extreme-scale research on the open access PubMed Central archive. It combines an updatable Gene Ontology Semantic Tagger (GOST) for entity identification and semantic markup in the literature, with a NLP pipeline scheduler (Buster) to collect and process the corpus, and a bespoke columnar corpus database (LexiDB) for indexing. The corpus database is distributed to permit fast indexing, and provides a simple web front-end with corpus linguistics methods for sub-corpus comparison and retrieval. GOST is also connected as a service in the Language Application (LAPPS) Grid, in which context it is interoperable with other NLP tools and data in the Grid and can be combined with them in more complex workflows. In a literature based discovery setting, we have created an annotated corpus of 9,776 papers with 5,481,543 words
Enhancing discoverability & access to data
This report summarises the first Community Conversation on improving data discoverability and access, which focused on spatial data and the Land Cover Map. The conversation took place during an online workshop held by UKCEH on 6 May 2025. This Community Conversation is part of our co-design approach to the Digital and Data Integration work package (WP2) in the National Capability for UK (NC-UK) challenges programme. The event attracted over 90 attendees (70 from outside UKCEH) including researchers, policy professionals, students, and data users. The workshop introduced recent updates to the Environmental Information Data Centre EIDC Catalogue to improve data access, as well as opportunities to enhance the underlying metadata. The Land Cover Map (LCM) datasets were showcased, along with a new Spatial Data Explorer (Beta) tool that will be launched soon with LCM 2024 datasets and expanded to include more datasets. Also, early ideas for using AI tools like Semantic Search and Large Language Models (LLM) to enhance discoverability were explored. Panel discussions and participant feedback will help shape future improvements to the EIDC, LCM tools, and community co-design efforts
Understanding the Impacts of Online Mental Health Peer Support Forums: Realist Synthesis
Background:Online forums are widely used for mental health peer support. However, evidence of their safety and effectiveness is mixed. Further research focused on articulating the contexts in which positive and negative impacts emerge from forum use is required to inform innovations in implementation.Objective:This study aimed to develop a realist program theory to explain the impacts of online mental health peer support forums on users.Methods:We conducted a realist synthesis of literature published between 2019 and 2023 and 18 stakeholder interviews with forum staff.Results:Synthesis of 102 evidence sources and 18 interviews produced an overarching program theory comprising 22 context-mechanism-outcome configurations. Findings indicate that users’ perceptions of psychological safety and the personal relevance of forum content are foundational to ongoing engagement. Safe and active forums that provide convenient access to information and advice can lead to improvements in mental health self-efficacy. Within the context of welcoming and nonjudgmental communities, users may benefit from the opportunity to explore personal difficulties with peers, experience reduced isolation and normalization of mental health experiences, and engage in mutual encouragement. The program theory highlights the vital role of moderators in creating facilitative online spaces, stimulating community engagement, and limiting access to distressing content. A key challenge for organizations that host mental health forums lies in balancing forum openness and anonymity with the need to enforce rules, such as restrictions on what users can discuss, to promote community safety.Conclusions:This is the first realist synthesis of online mental health peer support forums. The novel program theory highlights how successful implementation depends on establishing protocols for enhancing safety and strategies for maintaining user engagement to promote forum sustainability
‘In my experience …’ : The use of the word experience in peer online forums for mental health
Objective: Peer support online forums potentially offer accessible and inexpensive access to information and support through shared lived experience, including in relation to mental health. However, the impacts of participating in online communities are not fully understood. The present study takes a linguistic perspective to investigating how references to personal lived experience are 1) used, i.e., how forum contributors present their experience; and 2) responded to, i.e., how forum contributors react to experience of others. Methods: The study employs the methods of corpus-based discourse analysis using data from two mental health forums. The study design and results have been conducted in consultation with a PPI group. Results: When sharing what they call their experience, forum contributors typically give advice and/or provide information for the benefit of others. The most frequent information type is ‘information about treatment and medication’, while the most frequent advice type is ‘advice to seek help’. When contributors respond to what they call others’ experience, they typically express gratitude and reciprocally share their own experience. In some cases, they also explicitly articulate the impact of reading others’ experience, for example, by saying that they feel less alone. Conclusion: While we found some instances of negative judgements about health professionals, we did not find any clearcut instances of mis/disinformation or potentially harmful advice. Overall, the analysis supports the view that sharing lived experience in peer online mental health forums can be beneficial
The ParlaMint corpora of parliamentary proceedings
This paper presents the ParlaMint corpora containing transcriptions of the sessions of the 17 European national parliaments with half a billion words. The corpora are uniformly encoded, contain rich meta-data about 11 thousand speakers, and are linguistically annotated following the Universal Dependencies formalism and with named entities. Samples of the corpora and conversion scripts are available from the project’s GitHub repository, and the complete corpora are openly available via the CLARIN.SI repository for download, as well as through the NoSketch Engine and KonText concordancers and the Parlameter interface for on-line exploration and analysis
An Orally Bioavailable, Indole-3-glyoxylamide Based Series of Tubulin Polymerization Inhibitors Showing Tumor Growth Inhibition in a Mouse Xenograft Model of Head and Neck Cancer.
A number of indole-3-glyoxylamides have previously been reported as tubulin polymerization inhibitors, although none has yet been successfully developed clinically. We report here a new series of related compounds, modified according to a strategy of reducing aromatic ring count and introducing a greater degree of saturation, which retain potent tubulin polymerization activity but with a distinct SAR from previously documented libraries. A subset of active compounds from the reported series is shown to interact with tubulin at the colchicine binding site, disrupt the cellular microtubule network, and exert a cytotoxic effect against multiple cancer cell lines. Two compounds demonstrated significant tumor growth inhibition in a mouse xenograft model of head and neck cancer, a type of the disease which often proves resistant to chemotherapy, supporting further development of the current series as potential new therapeutics
