3 research outputs found
Finding a Second Wind: Speeding Up Graph Traversal Queries in RDBMSs Using Column-Oriented Processing
Recursive queries and recursive derived tables constitute an important part
of the SQL standard. Their efficient processing is important for many real-life
applications that rely on graph or hierarchy traversal. Position-enabled
column-stores offer a novel opportunity to improve run times for this type of
queries. Such systems allow the engine to explicitly use data positions (row
ids) inside its core and thus, enable novel efficient implementations of query
plan operators.
In this paper, we present an approach that significantly speeds up recursive
query processing inside RDBMSes. Its core idea is to employ a particular aspect
of column-store technology (late materialization) which enables the query
engine to manipulate data positions during query execution. Based on it, we
propose two sets of Volcano-style operators intended to process different query
cases.
In order validate our ideas, we have implemented the proposed approach in
PosDB, an RDBMS column-store with SQL support. We experimentally demonstrate
the viability of our approach by providing a comparison with PostgreSQL.
Experiments show that for breadth-first search: 1) our position-based approach
yields up to 6x better results than PostgreSQL, 2) our tuple-based one results
in only 3x improvement when using a special rewriting technique, but it can
work in a larger number of cases, and 3) both approaches can't be emulated in
row-stores efficiently
Solving Data Quality Problems with Desbordante: a Demo
Data profiling is an essential process in modern data-driven industries. One
of its critical components is the discovery and validation of complex
statistics, including functional dependencies, data constraints, association
rules, and others.
However, most existing data profiling systems that focus on complex
statistics do not provide proper integration with the tools used by
contemporary data scientists. This creates a significant barrier to the
adoption of these tools in the industry. Moreover, existing systems were not
created with industrial-grade workloads in mind. Finally, they do not aim to
provide descriptive explanations, i.e. why a given pattern is not found. It is
a significant issue as it is essential to understand the underlying reasons for
a specific pattern's absence to make informed decisions based on the data.
Because of that, these patterns are effectively rest in thin air: their
application scope is rather limited, they are rarely used by the broader
public. At the same time, as we are going to demonstrate in this presentation,
complex statistics can be efficiently used to solve many classic data quality
problems.
Desbordante is an open-source data profiler that aims to close this gap. It
is built with emphasis on industrial application: it is efficient, scalable,
resilient to crashes, and provides explanations. Furthermore, it provides
seamless Python integration by offloading various costly operations to the C++
core, not only mining.
In this demonstration, we show several scenarios that allow end users to
solve different data quality problems. Namely, we showcase typo detection, data
deduplication, and data anomaly detection scenarios
