97 research outputs found
Causality and the semantics of provenance
Provenance, or information about the sources, derivation, custody or history
of data, has been studied recently in a number of contexts, including
databases, scientific workflows and the Semantic Web. Many provenance
mechanisms have been developed, motivated by informal notions such as
influence, dependence, explanation and causality. However, there has been
little study of whether these mechanisms formally satisfy appropriate policies
or even how to formalize relevant motivating concepts such as causality. We
contend that mathematical models of these concepts are needed to justify and
compare provenance techniques. In this paper we review a theory of causality
based on structural models that has been developed in artificial intelligence,
and describe work in progress on a causal semantics for provenance graphs.Comment: Workshop submissio
Multiparameter shallow-seismic waveform inversion based on the Jensen-Shannon divergence
ABSTRACT: Seismic full-waveform inversion (FWI) or waveform inversion (WI) has gained extensive attention as a cutting-edge imaging method, which is expected to reveal the high-resolution images of complex geological structures. In this paper, we regard each 1-D signal in the inversion system as a 1-D probability distribution, then use the Jensen–Shannon divergence from information theory to measure the discrepancy between the predicted and observed signals, and finally implement a novel 2-D multiparameter shallow-seismic WI (MSWI). Essentially, the novel approach achieves an implicit weighting along the time-axis for each 1-D adjoint source defined by the classical WI (CWI), thus enhancing the extra illumination for a deeper medium compared with the CWI. By evaluating the inversion results of the two-layer model and fault model, the reconstruction accuracy for S-wave velocity and density of the new method is increased by about 30 and 20 per cent compared with that of the CWI under the same conditions, respectively. The reconstruction performance for P-wave velocity of these two methods is almost equal. In addition, the new 2-D MSWI is also resilient to white Gaussian noise in the data. Numerically, the inversion system has almost the strongest sensitivities to the S-wave velocity and density, performing the poorest sensitivity to the P-wave velocity. Finally, we test the novel method with a detection case for a power tunnel
Proteogenomic Data and Resources for Pan-Cancer Analysis
The National Cancer Institute\u27s Clinical Proteomic Tumor Analysis Consortium (CPTAC) investigates tumors from a proteogenomic perspective, creating rich multi-omics datasets connecting genomic aberrations to cancer phenotypes. To facilitate pan-cancer investigations, we have generated harmonized genomic, transcriptomic, proteomic, and clinical data for \u3e1000 tumors in 10 cohorts to create a cohesive and powerful dataset for scientific discovery. We outline efforts by the CPTAC pan-cancer working group in data harmonization, data dissemination, and computational resources for aiding biological discoveries. We also discuss challenges for multi-omics data integration and analysis, specifically the unique challenges of working with both nucleotide sequencing and mass spectrometry proteomics data
Storing Auxiliary Data for Efficient Maintenance and Lineage Tracing of Complex Views
As views in a data warehouse become more complex, the view maintenance process can become very complicated and potentially very inefficient. Storing auxiliary views in the warehouse can reduce the complexity and improve the efficiency of view maintenance, and the same auxiliary views can help in efficiently answering lineage tracing queries over the warehouse views. In this paper, we study the problem of selecting auxiliary views to materialize in order to minimize the total view maintenance and lineage tracing cost. We consider relational views with arbitrary use of aggregation operators, and we define an initial search space for our optimization problem based on a normal form for such view definitions. We present several auxiliary view selection algorithms, and to study their performance we conduct experiments using the TPC-D benchmark in addition to synthetic view definitions and statistics. The results of our experiments show: (1) the exhaustive algorithm that selects the optimal set of auxiliary views is far too expensive in many cases; (2) two heuristic algorithms that we present select good (often optimal) sets of auxiliary views in a much shorter time; (3) even auxiliary views selected by a very simple algorithm can significantly reduce the overall view maintenance and lineage tracing cost
Lineage Tracing for General Data Warehouse Transformations
Data warehousing systems integrate information from operational data sources into a central repository to enable analysis and mining of the integrated information. During the integration process, source data typically undergoes a series of transformations, which may vary from simple algebraic operations or aggregations to complex "data cleansing" procedures. In a warehousing environment, the data lineage problem is that of tracing warehouse data items back to the original source items from which they were derived. We formally define the lineage tracing problem in the presence of general data warehouse transformations, and we present algorithms for lineage tracing in this environment. Our tracing procedures take advantage of known structure or properties of transformations when present, but also work in the absence of such information. Our results can be used as the basis for a lineage tracing tool in a general warehousing setting, and also can guide the design of data warehouses that enable efficient lineage tracing.
Lineage Tracing in a Data Warehousing System
e system applies the tracing procedures to the source tables and/or auxiliary views to obtain the lineage results and show the specific view data derivation process. 1 Lineage Tracing System 1.1 Lineage Example Given a view data item I , the exact set of source data that produced I is called I's lineage. We use an example to illustrate the concepts; a full formalization of the problem along with solutions and algorithms are given in [2]. Consider a financial data warehouse with the three source tables shown in Figure 3. A view Promising (Figure 4) is defined to contain all "promising" industries, where an industry is regarded as promising if some stock in that industry is gaining money over all purchases, and the stock has a price-earnings ratio below 40. Over our sample source data the view contains two tuples, hcomputeri and hm
Lineage Tracing in a Data Warehousing System (Demonstration Proposal)
A data warehousing system collects data from multiple distributed sources and stores the integrated information as materialized views in a local data warehouse. Users then perform data analysis and mining on the warehouse views. Figure 1 shows the basic architecture of a data warehousing system. In many cases, the warehouse view contents alone are not sufficient for in-depth analysis. It is often useful to be able to "drill through" from interesting (or potentially erroneous) view data to the original source data that derived the view data. For a given view data item, identifying the exact set of base data items that produced the view data item is termed the view data lineage problem. Motivation for and applications of lineage tracing in a warehousing environment are provided in [2]. In the context of the WHIPS data warehousing project at Stanford [3], we have developed a complete prototype that performs efficient and consistent lineage tracing. Some commercial data warehousing systems support schema-level lineage tracing, or provide specialized drill-down and/or drill-through facilities for multi-dimensional warehouse views. Our lineage tracing prototype supports more ne-grained instance-level lineage tracing for arbitrarily complex relational views, including aggregation. Our prototype automatically generates lineag
Practical Lineage Tracing in Data Warehouses
We consider the view data lineage problem in a warehousing environment: For a given data item in a materialized warehouse view, we want to identify the set of source data items that produced the view item. We formalize the problem and present a lineage tracing algorithm for relational views with aggregation. Based on our tracing algorithm, we propose a number of schemes for storing auxiliary views that enable consistent and efficient lineage tracing in a multisource data warehouse. We report on a performance study of the various schemes, identifying which schemes perform best in which settings. Based on our results, we have implemented a lineage tracing package in the WHIPS data warehousing system prototype at Stanford. With this package, users can select view tuples of interest, then efficiently "drill down" to examine the source data that produced them. 1 Introduction Data warehousing systems collect data from multiple distributed sources, integrate the information as materialized v..
- …
