448 research outputs found
A Review of Trusted Broker Architectures for Data Sharing
Sharing data across organizational boundaries must strike a balance between the competing data quality dimensions of access and security. Without access to data, it can't be used and, consequently, is of no value. At the same time, uncontrolled access to data, especially sensitive personal data, can result in dire legal and ethical consequences. This paper discusses the trade-offs between security and access for three styles of trusted broker architectures in hopes that this will provide guidance for organizations trying to implement data sharing systems.Naval Postgraduate School Acquisition Research Progra
Reconsidering Learning Communities: Expanding the Discourse by Challenging the Discourse
This article draws on historical and philosophical lenses and interviews with students to question some fundamental tenets underlying the practice of freshman learning communities (FLCs): that they develop community and improve students\u27 learning experiences. The article brings to the discourse of FLCs some critical questions regarding their value and practice
Methods to Measure Importance of Data Attributes to Consumers of Information Products
Errors in data sources of information product (IP) manufacturing systems can degrade overall IP quality as perceived by consumers. Data defects from inputs propagate throughout the IP manufacturing process. Information Quality (IQ) research has focused on improving the quality of inputs to mitigate error propagation and ensure an IP will be fit for use by consumers. However, the feedback loop from IP consumers to IP producers is often incomplete since the overall quality of the IP is not based solely on quality of inputs but rather by the IP’s fitness for use as a whole. It remains uncertain that high quality inputs directly correlate to a high quality IP. The methods proposed in this paper investigate the effects of intentionally decreasing, or disrupting, quality of inputs, measuring the consumers\u27 evaluations as compared to an undisrupted IP, and proposes scenarios illustrating the advantage of these methods over traditional survey methods. Fitness for use may then be increased using those attributes deemed “important” by consumers in future IP revisions
CoDoSA: A Lightweight, XML-Based Framework for Integrating Unstructured Textual Information
One of the most fundamental dimensions of information quality is access. For many organizations, a large part of their information assets is locked away in Unstructured Textual Information (UTI) in the form of email, letters, contracts, call notes, and spreadsheet. In addition to internal UTI, there is also a wealth of publicly available UTI on websites, in newspapers, courthouse records and other sources that can add value when combined with internally managed information. This paper describes a system called Compressed Document Set Architecture (CoDoSA) designed to facilitate the integration of UTI into a structured database environment where it can be more readily accessed and manipulated. The CoDoSA Framework comprises an XML-based metadata standard and an associated Application Program Interface (API). It further describes how CoDoSA can facilitate the storage and management of information during the ETL (Extract, Transform, and Load) process to integrate unstructured UTI information. It also explains how CoDoSA promotes higher information quality by providing several features that simplify the governance of metadata standards and enforcement of data quality constraints across different UTI applications and development teams. In addition, CoDoSA provides a mechanism for inserting semantic tags into captured UTI, tags that can be used in later steps to drive semantic-mediated queries and processes
Towards Trustable Language Models: Investigating Information Quality of Large Language Models
Large language models (LLM) are generating information at a rapid pace,
requiring users to increasingly rely and trust the data. Despite remarkable
advances of LLM, Information generated by LLM is not completely trustworthy,
due to challenges in information quality. Specifically, integrity of
Information quality decreases due to unreliable, biased, tokenization during
pre-training of LLM. Moreover, due to decreased information quality issues, has
led towards hallucination, fabricated information. Unreliable information can
lead towards flawed decisions in businesses, which impacts economic activity.
In this work, we introduce novel mathematical information quality evaluation of
LLM, we furthermore analyze and highlight information quality challenges,
scaling laws to systematically scale language models.Comment: 31 page
Temporal RNA Integrity Analysis of Archived Spaceflight Biological Samples from ALSDA from 1991 to 2016
The purpose of this study is to assess the quality of spaceflight tissues stored in Ames Life Science Data Archive (ALSDA) freezers. Garnering information for downstream functional analysis such as generation of omics datasets from tissues is, in part, dependent on the state of sample preservation. To assess the viability of a select group of tissues, RNA integrity number (RIN) values were calculated for RNA extracted from rodent livers. Rat livers from Spacelab Life Sciences 1 (SLS-1) and mouse livers from Commercial Biomedical Test Module 3 (CBTM-3), Rodent Research 1 (RR1), and Rodent Research 3 (RR3) were tested. It was found that mean RIN values from CBTM3, RR1, and RR3 were suitable for downstream functional analysis (RIN greater than 5) while the mean RIN value for SLS-1 was not (RIN equal to 2.5 plus or minus 0.1). Information from this study could lay the foundation for future efforts in determining the types of assays that are most appropriate for different tissues in ALSDA freezers, which would maximize the scientific return on rare spaceflight samples
Critical Cultural Success Factors for Achieving High Quality Information in an Organization
While information and data quality practitioners are in general agreement that social, cultural, and organizational factors are the most important in determining the success or failure of an organization’s data quality programs, there is little to no existing research quantifying these factors. In this research we build from both our previous research and others’ to distill and clarify those cultural factors which are the Critical Cultural Success Factors (CCSFs) for successful Information and Data Quality programs in an organization. Using the Delphi method for gaining consensus from a group of experts, we distilled fourteen factors down to six and clarified the definitions of those six factors. We begin explaining how these CCSFs fit into Organizational Learning Theory and plan to ultimately define a new system dynamics model incorporating them so that organizations and information quality practitioners can positively affect the success of information and data quality programs
The Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution
This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise linking decisions, not just the pairwise classifications alone. Part of the problem is that the measures of precision and recall as calculated in data mining classification algorithms such as logistic regression is different from applying these measures to entity resolution (ER) results.. As a classifier, logistic regression precision and recall measure the algorithm’s pairwise decision performance. When applied to ER, precision and recall measure how accurately the set of input references were partitioned into subsets (clusters) referencing the same entity. When applied to datasets containing more than two references, ER is a two-step process. Step One is to classify pairs of records as linked or not linked. Step Two applies transitive closure to these linked pairs to find the maximally connected subsets (clusters) of equivalent references. The precision and recall of the final ER result will generally be different from the precision and recall measures of the pairwise classifier used to power the ER process. The experiments described in the paper were performed using a well-tested set of synthetic customer data for which the correct linking is known. The best F-measure of precision and recall for the final ER result was obtained by substantially increasing the threshold of the logistic regression pairwise classifier
Theme-weighted Ranking of Keywords from Text Documents using Phrase Embeddings
Keyword extraction is a fundamental task in natural language processing that
facilitates mapping of documents to a concise set of representative single and
multi-word phrases. Keywords from text documents are primarily extracted using
supervised and unsupervised approaches. In this paper, we present an
unsupervised technique that uses a combination of theme-weighted personalized
PageRank algorithm and neural phrase embeddings for extracting and ranking
keywords. We also introduce an efficient way of processing text documents and
training phrase embeddings using existing techniques. We share an evaluation
dataset derived from an existing dataset that is used for choosing the
underlying embedding model. The evaluations for ranked keyword extraction are
performed on two benchmark datasets comprising of short abstracts (Inspec), and
long scientific papers (SemEval 2010), and is shown to produce results better
than the state-of-the-art systems.Comment: preprint for paper accepted in Proceedings of 1st IEEE International
Conference on Multimedia Information Processing and Retrieva
A Curriculum for a Master of Science in Information Quality
The first Master of Science in Information Quality (IQ) degree is designed and being offered to prepare students for careers in industry and government as well as advanced graduate studies. The curriculum is guided by the Model Curriculum and Guidelines for Graduate Degree Programs in Information Systems, which are endorsed by the Association for Computing Machinery and the Association for Information Systems. The curriculum integrates two key educational innovations: (1) an interdisciplinary approach to curriculum design, and (2) a balance between theoretical rigor and practical relevance. In response to the demand from industry, the curriculum aims to educate students who can lead the effort to solve current and future information quality problems. As such, problem-based learning is balanced with foundation-building learning to effectively deliver the intellectual contents of the curriculum. Much of the individual course content is based on cumulated research results and practices developed over the last two decades. The curriculum is designed to balance information quality theory with industry best practices using modern tools and technology. It includes the skill sets that are critical to succeed as IQ professionals. Since IQ is an inter-disciplinary field, the curriculum draws upon total quality management, database, core knowledge of IQ, change management, project management, and IQ policy and strategy. The courses are delivered using case studies, hands-on laboratories, theory building, and team projects to enhance the student\u27s learning experience. Upon completing the program, students will be equipped with sufficient breadth and depth in the IQ field to solve real world problems and pursue further studies
- …
