442 research outputs found
Query-Time Data Integration
Today, data is collected in ever increasing scale and variety, opening up enormous potential for new insights and data-centric products. However, in many cases the volume and heterogeneity of new data sources precludes up-front integration using traditional ETL processes and data warehouses. In some cases, it is even unclear if and in what context the collected data will be utilized. Therefore, there is a need for agile methods that defer the effort of integration until the usage context is established.
This thesis introduces Query-Time Data Integration as an alternative concept to traditional up-front integration. It aims at enabling users to issue ad-hoc queries on their own data as if all potential other data sources were already integrated, without declaring specific sources and mappings to use. Automated data search and integration methods are then coupled directly with query processing on the available data. The ambiguity and uncertainty introduced through fully automated retrieval and mapping methods is compensated by answering those queries with ranked lists of alternative results. Each result is then based on different data sources or query interpretations, allowing users to pick the result most suitable to their information need.
To this end, this thesis makes three main contributions. Firstly, we introduce a novel method for Top-k Entity Augmentation, which is able to construct a top-k list of consistent integration results from a large corpus of heterogeneous data sources. It improves on the state-of-the-art by producing a set of individually consistent, but mutually diverse, set of alternative solutions, while minimizing the number of data sources used. Secondly, based on this novel augmentation method, we introduce the DrillBeyond system, which is able to process Open World SQL queries, i.e., queries referencing arbitrary attributes not defined in the queried database. The original database is then augmented at query time with Web data sources providing those attributes. Its hybrid augmentation/relational query processing enables the use of ad-hoc data search and integration in data analysis queries, and improves both performance and quality when compared to using separate systems for the two tasks. Finally, we studied the management of large-scale dataset corpora such as data lakes or Open Data platforms, which are used as data sources for our augmentation methods. We introduce Publish-time Data Integration as a new technique for data curation systems managing such corpora, which aims at improving the individual reusability of datasets without requiring up-front global integration. This is achieved by automatically generating metadata and format recommendations, allowing publishers to enhance their datasets with minimal effort.
Collectively, these three contributions are the foundation of a Query-time Data Integration architecture, that enables ad-hoc data search and integration queries over large heterogeneous dataset collections
Providing Insight into the Performance of Distributed Applications Through Low-Level Metrics
The field of high-performance computing (HPC) has always dealt with the bleeding edge of computational hardware and software to achieve the maximum possible performance for a wide variety of workloads. When dealing with brand new technologies, it can be difficult to understand how these technologies work and why they work the way they do. One of the more prevalent approaches to providing insight into modern hardware and software is to provide tools that allow developers to access low-level metrics about their performance. The modern HPC ecosystem supports a wide array of technologies, but in this work, I will be focusing on two particularly influential technologies: The Message Passing Interface (MPI), and Graphical Processing Units (GPUs).For many years, MPI has been the dominant programming paradigm in HPC. Indeed, over 90% of applications that are a part of the U.S. Exascale Computing Project plan to use MPI in some fashion. The MPI Standard provides programmers with a wide variety of methods to communicate between processes, along with several other capabilities. The high-level MPI Profiling Interface has been the primary method for profiling MPI applications since the inception of the MPI Standard, and more recently the low-level MPI Tool Information Interface was introduced.Accelerators like GPUs have been increasingly adopted as the primary computational workhorse for modern supercomputers. GPUs provide more parallelism than traditional CPUs through a hierarchical grid of lightweight processing cores. NVIDIA provides profiling tools for their GPUs that give access to low-level hardware metrics.In this work, I propose research in applying low-level metrics to both the MPI and GPU paradigms in the form of an implementation of low-level metrics for MPI, and a new method for analyzing GPU load imbalance with a synthetic efficiency metric. I introduce Software-based Performance Counters (SPCs) to expose internal metrics of the Open MPI implementation along with a new interface for exposing these counters to users and tool developers. I also analyze a modified load imbalance formula for GPU-based applications that uses low-level hardware metrics provided through nvprof in a hierarchical approach to take the internal load imbalance of the GPU into account
Effect of arsenic-phosphorus interaction on arsenic-induced oxidative stress in chickpea plants
Arsenic-induced oxidative stress in chickpea was investigated under glasshouse conditions in response to application of arsenic and phosphorus. Three levels of arsenic (0, 30 and 60 mg kg−1) and four levels of P (50, 100, 200, and 400 mg kg−1) were applied to soil-grown plants. Increasing levels of both arsenic and P significantly increased arsenic concentrations in the plants. Shoot growth was reduced with increased arsenic supply regardless of applied P levels. Applied arsenic induced oxidative stress in the plants, and the concentrations of H2O2 and lipid peroxidation were increased. Activity of superoxide dismutase (SOD) and concentrations of non-enzymatic antioxidants decreased in these plants, but activities of catalase (CAT) and ascorbate peroxidase (APX) were significantly increased under arsenic phytotoxicity. Increased supply of P decreased activities of CAT and APX, and decreased concentrations of non-enzymatic antioxidants, but the high-P plants had lowered lipid peroxidation. It can be concluded that P increased uptake of arsenic from the soil, probably by making it more available, but although plant growth was inhibited by arsenic the P may have partially protected the membranes from arsenic-induced oxidative stress
A Domain-Specific Language for Do-It-Yourself Analytical Mashups
The increasing amount and variety of data available in the web leads to new possibilities in end-user focused data analysis. While the classic data base technologies for data integration and analysis (ETL and BI) are too complex for the needs of end users, newer technologies like web mashups are not optimal for data analysis. To make productive use of the data available on the web, end users need easy ways to find, join and visualize it. We propose a domain specific language (DSL) for querying a repository of heterogeneous web data. In contrast to query languages such as SQL, this DSL describes the visualization of the queried data in addition to the selection, filtering and aggregation of the data. The resulting data mashup can be made interactive by leaving parts of the query variable. We also describe an abstraction layer above this DSL that uses a recommendation-driven natural language interface to reduce the difficulty of creating queries in this DSL
Rule-based Construction of Matching Processes
Mapping complex metadata structures is crucial in a number of domains such as
data integration, ontology alignment or model management. To speed up that
process automatic matching systems were developed to compute mapping
suggestions that can be corrected by a user. However, constructing and tuning
match strategies still requires a high manual effort by matching experts as
well as correct mappings to evaluate generated mappings. We therefore propose a
self-configuring schema matching system that is able to automatically adapt to
the given mapping problem at hand. Our approach is based on analyzing the input
schemas as well as intermediate matching results. A variety of matching rules
use the analysis results to automatically construct and adapt an underlying
matching process for a given match task. We comprehensively evaluate our
approach on different mapping problems from the schema, ontology and model
management domains. The evaluation shows that our system is able to robustly
return good quality mappings across different mapping problems and domains.Comment: 10 Page
Regulierung der Anlageberatung und behavioral finance: Anlegerleitbild: homo oeconomicus vs. Realität
Die Arbeit untersucht die Regulierung der Anlageberatung und stellt dar, warum es dabei trotz bester Absichten des Gesetzgebers bisher nicht gelungen ist, einen Rechtsrahmen zu schaffen, der die Anzahl an enttäuschten und getäuschten Anlegern verringert. Dabei wird auf die Erkenntnisse der immer weiter verbreiteten behavioral finance (Lehre vom Verhalten der Anleger) zurückgegriffen. Die Untersuchung zeigt, wie und warum Wunsch und Wirklichkeit beim Anlegerleitbild des Gesetzgebers und der Regulierung der Anlageberatung auseinandergehen. Um das Verhalten der Akteure vor, während und nach der Anlageberatung steuern zu können, muss man deren Verhaltensweisen verstehen und mit diesem Wissen sowohl die bisher verwendeten Instrumente des Anlegerschutzes (Stichwort: Informationspflichten) anpassen, als auch neue Methoden (z.B. choice architecture) heranziehen. Das Buch wendet sich an den Gesetzgeber und die interessierten Rechtswissenschaftler, daneben aber auch an die mit der Anlageberatung befassten Gerichte und Anwälte und zu guter Letzt auch an Anleger, die etwas über sich lernen wollen
En kvalitativ studie om slöjdlärares uppfattningar om hur auktoritet legitimeras
Lärarens roll som auktoritet i klassrummet är något som alltid har varit närvarande, är
närvarande och som också kommer att vara närvarande i framtiden. Syftet med studien är att
undersöka vad några verksamma legitimerade lärare i ämnet slöjd har för uppfattning om
auktoritet som begrepp, hur lärarens roll som auktoritet i klassrummet legitimeras samt om
det finns tillfällen där det anses vara legitimt att agera auktoritärt. Teorin som används i
studien är Max Webers (1983) auktoritetsteori. I studien presenteras relevant sammanställd
tidigare forskning om hur lärare skapar auktoritet i klassrummet samt hur synen på auktoritet
inom skolan har förändrats i takt med samhällets utveckling. Studiens ansats är kvalitativ och
använder sig av en semistrukturerad intervjumodell, vilket presenteras genom tematisk
analysmetod. Fyra framtagna teman presenteras i studiens resultat: Med auktoritet menas
ordning och reda, Vikten av att bygga relationer, Lärare legitimerar sin auktoritet genom
karisma och Är elevens säkerhet i fara är det tillåtet att vara auktoritär I studiens resultat
framkommer att lärares syn på auktoritet som begrepp är positivt. Att lärare anser det vara en
individuell uppgift att skapa auktoritet vilket sker genom att relationer byggs mellan lärare
och elev samt att läraren uppvisar karisma. Slutligen visas i resultatet att om läraren upplever
att det finns en risk för personskador är det tillåtet att agera auktoritärt för att förhindra att en
skada uppstår
Frontiers in Crowdsourced Data Integration
There is an ever-increasing amount and variety of open web data available that is insufficiently examined or not considered at all in decision making processes. This is because of the lack of end-user friendly tools that help to reuse this public data and to create knowledge out of it. Therefore, we propose a schema-optional data repository that provides the flexibility necessary to store and gradually integrate heterogeneous web data. Based on this repository, we propose a semi-automatic schema enrichment approach that efficiently augments the data in a “pay-as-you-go” fashion. Due to the inherently appearing ambiguities we further propose a crowd-based verification component that is able to resolve such conflicts in a scalable manner.Die stetig wachsende Zahl offen verfügbarer Webdaten findet momentan viel zu wenig oder gar keine Berücksichtigung in Entscheidungsprozessen. Der Grund hierfür ist insbesondere in der mangelnden Unterstützung durch anwenderfreundliche Werkzeuge zu finden, die diese Daten nutzbar machen und Wissen daraus genieren können. Zu diesem Zweck schlagen wir ein schemaoptionales Datenrepositorium vor, welches ermöglicht, heterogene Webdaten zu speichern sowie kontinuierlich zu integrieren und mit Schemainformation anzureichern. Auf Grund der dabei inhärent auftretenden Mehrdeutigkeiten, soll dieser Prozess zusätzlich um eine Crowd-basierende Verifikationskomponente unterstützt werden
Identifying And Weighting Integration Hypotheses On Open Data Platforms
Open data platforms such as data.gov or opendata.socrata. com provide a huge
amount of valuable information. Their free-for-all nature, the lack of
publishing standards and the multitude of domains and authors represented on
these platforms lead to new integration and standardization problems. At the
same time, crowd-based data integration techniques are emerging as new way of
dealing with these problems. However, these methods still require input in form
of specific questions or tasks that can be passed to the crowd. This paper
discusses integration problems on Open Data Platforms, and proposes a method
for identifying and ranking integration hypotheses in this context. We will
evaluate our findings by conducting a comprehensive evaluation using on one of
the largest Open Data platforms.Comment: Presented at the First International Workshop On Open Data, WOD-2012
(http://arxiv.org/abs/1204.3726
Top-k Entity Augmentation using Consistent Set Covering
Entity augmentation is a query type in which, given a set of entities and a large corpus of possible data sources, the values of a missing attribute are to be retrieved. State of the art methods return a single result that, to cover all queried entities, is fused from a potentially large set of data sources. We argue that queries on large corpora of heterogeneous sources using information retrieval and automatic schema matching methods can not easily return a single result that the user can trust, especially if the result is composed from a large number of sources that user has to verify manually. We therefore propose to process these queries in a Top-k fashion, in which the system produces multiple minimal consistent solutions from which the user can choose to resolve the uncertainty of the data sources and methods used. In this paper, we introduce and formalize the problem of consistent, multi-solution set covering, and present algorithms based on a greedy and a genetic optimization approach. We then apply these algorithms to Web table-based entity augmentation. The publication further includes a Web table corpus with 100M tables, and a Web table retrieval and matching system in which these algorithms are implemented. Our experiments show that the consistency and minimality of the augmentation results can be improved using our set covering approach, without loss of precision or coverage and while producing multiple alternative query results
- …
