Main Content
DPQL: Applications for Holistic Data Profiling
Seeger, Marcian and Papenbrock, Thorsten
The pdf is avaliable here.
Abstract
Data profiling is the process of extracting implicit metadata, such as data types, attribute
statistics, and various types of data dependencies, from raw datasets. Because structural metadata
is often not stored explicitly, many data management applications rely on data profiling to identify,
for example, functional dependencies, inclusion dependencies, or unique column combinations. The
utilization of automatically discovered metadata within use cases, such as data discovery, integration,
normalization, cleaning, or optimization, is however a still very complicated process. This is because
existing data profiling approaches consider different types of metadata in isolation although in practice
many use cases require specific combinations, i.e., patterns of structural metadata. For this reason,
we recently proposed the Data Profiling Query Language DPQL that can express (and in the near
future holistically discover) arbitrary patterns of various types of metadata. DPQL query patterns
allow data scientists to express exactly what metadata real-world applications require. This eliminates
otherwise complicated post-processing efforts, reduces result sizes, and might lead to significantly
shorter profiling times. To demonstrate the expressiveness and versatility of DPQL, we survey a
variety of data engineering applications in this paper and solve their metadata requirements with
concrete DPQL query patterns. We also measure DPQL result sizes on two benchmark datasets and
compare them to the result sizes of standard data profiling algorithms to demonstrate that holistic data
profiling with DPQL produces more suitable and in many cases also much smaller result sets.
Bibliography
Bibtex |
@article{dpql, title={DPQL: Applications for Holistic Data Profiling}, author={Seeger, Marcian and Papenbrock, Thorsten}, journal=BTW, year={2025}, publisher={Gesellschaft f{\"u}r Informatik eV} } |
EndNote |
%0 Journal Article %T DPQL: Applications for Holistic Data Profiling %A Seeger, Marcian %A Papenbrock, Thorsten %D 2025 %I Gesellschaft für Informatik eV |
RefMan |
TY - JOUR AU - Seeger, Marcian AU - Papenbrock, Thorsten TI - DPQL: Applications for Holistic Data Profiling T2 - BTW PY - 2025 DA - 2025 PB - Gesellschaft für Informatik eV |