Main Content
DPQL: The Data Profiling Query Language
Seeger, Marcian and Schmidl, Sebastian and Vielhauer, Alexander and Papenbrock, Thorsten
The pdf is avaliable here.
Abstract
Data profiling describes the activity of extracting implicit metadata, such as schema
descriptions, data types, and various kinds of data dependencies, from a given data set. The
considerable amount of research papers about novel metadata types and ever-faster data profiling
algorithms emphasize the importance of data profiling in practice. Unfortunately, though, the current
state of data profiling research fails to address practical application needs: Typical data profiling
algorithms (i. e., challenging to operate structures) discover all (i. e., too many) minimal (i. e., the
wrong) data dependencies within minutes to hours (i. e., too long). Consequently, if we look at the
practical success of our research, we find that data profiling targets data cleaning, but most cleaning
systems still use only hand-picked dependencies; data profiling targets query optimization, but hardly
any query optimizer uses modern discovery algorithms for dependency extraction; data profiling targets
data integration, but the application of automatically discovered dependencies for matching purposes
is yet to be shown - and the list goes on. We aim to solve the profiling-and-application-disconnect
with a novel data profiling engine that integrates modern profiling techniques for various types of data
dependencies and provides the applications with a versatile, intuitive, and declarative Data Profiling
Query Language (DPQL). The DPQL enables applications to specify precisely what dependencies are
needed, which not only refines the results and makes the data profiling process more accessible but
also enables much faster and (in terms of dependency types and selections) holistic profiling runs. We
expect that integrating modern data profiling techniques and the post-processing of their results under
a single application endpoint will result in a series of significant algorithmic advances, new pruning
concepts, and a profiling engine with innovative components for workload autoconfiguration, query
optimization, and parallelization. With this paper, we present the first version of the DPQL syntax and
its semantics, which introduces a fundamentally new line of research in data profiling.
Bibliography
@article{dpql, title={DPQL: The Data Profiling Query Language}, author={Seeger, Marcian and Schmidl, Sebastian and Vielhauer, Alexander and Papenbrock, Thorsten}, journal=BTW, volume={20}, pages={319--415}, year={2023}, publisher={Gesellschaft f{\"u}r Informatik eV} } |
|
%0 Journal Article %T DPQL: The Data Profiling Query Language %A Seeger, Marcian %A Schmidl, Sebastian %A Vielhauer, Alexander %A Papenbrock, Thorsten %@ 3885797259 %D 2023 %I Gesellschaft für Informatik eV |
|
TY - JOUR AU - Seeger, Marcian AU - Schmidl, Sebastian AU - Vielhauer, Alexander AU - Papenbrock, Thorsten TI - DPQL: The Data Profiling Query Language T2 - BTW VL - 20 SP - 319-415 PY - 2023 DA - 2023 PB - Gesellschaft für Informatik eV ER - |