Main Content

Theses

Writing a thesis is the final step in obtaining a Bachelor or Master degree. A thesis is always coupled to a scientific project in some field of expertise. Candidates who want to write their thesis in the Big Data Analytics group should, therefore, be interested and trained in a field related to our research areas.

A thesis is an independent, scientific and practical work. This means that the thesis and its related project are conducted exclusively by the candidate; the execution follows proper scientific practices; and all necessary artifacts, algorithms and evaluations have been physically implemented and submitted as part of the thesis. A proper way of sharing code and evaluation artifacts is the creation of a public GitHub repository, which can, then, be referenced in the thesis. The thesis serves as a documentation for the project and as scientific analysis and reflection of the gathered insights.

For students interested in a thesis, we offer interesting topics and a close, continuous supervision during the entire thesis time. Every thesis is supervised by at least one member of our team, who can give advice and help in critical situations. The condensed results of our best master theses have been published at top scientifc venues, such as VLDB, CIKM, EDBT, etc.

A selection of open thesis topics can be found on this page. We also encourage interested students to suggest own ideas in the context of our research areas and to contact individual members of the group directly. An ideal thesis topic is connected in some form to the research projects of a group member. That group member will then become a supervisor for the thesis. Hence, taking a look at the personal pages and our current projects is a good starter for a thesis project. Recent publications on conferences, such as VLDB or SIGMOD, or open research challenges on, for example, Kaggle are good resources for finding interesting thesis ideas.

Organizational information

Exposé: Before starting a thesis, Master students have to write a 2-5 pages long exposé. The exposé is a description of the planned project and includes a motivation for the topic, a literature review on related work, a draft of the research/project idea, and a plan for the final evaluation. Please consider our template with initial instructions when starting your exposé. The exposé can be created in the context of the "Selbstständiges wissenschaftliches Arbeiten" module.
Timetable: Once the thesis project is started, it must be finished within six months for Master and four months for Bachelor theses. Only special events, such as times of sickness, can extend this period. If you are working on a regular job or if you need to take further courses during your thesis time, the thesis time can be extended as well. A thesis can be started at any time, which is in alignment with semester times but also asynchronous to semester times.
Presentations: The work on a Master thesis requires students to give at least two talks. A mid-term talk serves to get some additional feedback from a larger audience and to practice the final thesis defense; this talk is not graded. The final talk is a proper defense of the thesis and the final results; this talk is graded as one part of the academic performance.

Hints for the thesis

Length: A typical thesis is 30-60 pages (Bachelor) and 40-90 pages (Master) long.
Language: A thesis can be written in German or English. We recommend English, though.
Format: We highly recommend writing a thesis in LaTeX, as in this way many structural defects can easily be avoided.
Tips for writing a thesis
Tips for writing a paper (short)
Tips for writing a paper (long)

Bachelor and Master Theses

Data Profiling
- DPQL: The Data Profiling Query Language
  - The data profiling language DPQL is a recently developed metadata profiling interface that serves the discovery of complex metadata patterns.
  - We aim to develop efficient profiling approaches that find these metadata patterns as fast as possible.
  - Open challenges:
    - Theoretical foundations for the integration of ODs, MDs, MvDs, DC and further dependency types into the language.
    - Theoretical foundations for the integration of SIZE, CARDINALITY, NEGATION, SPLIT and other filters into the language.
    - Theoretical foundations for maximal dependency combinations into the language.
    - Practical implementation of a static batch DPQL engine for UCC, FD, and IND patterns based on existing theory.
    - Practical implementation of an incremental DPQL query engine for UCC, FD, and IND patterns based on existing theory.
- GraphDC: Efficient Denial Constraint Profiling for Graphs
  - Denial constraints are expressions of data patterns that may not occur in the data. In this project, we aim to discover such denial constraints in graph data. For example, a graph may "not contain a clique of size four (or larger)" or it may "not contain a path between nodes of type X that is equal to (or shorter than) 4 hops".
  - The project will be conducted with a partner research group that specializes on graph generation and the group of Prof. Dr. Gabriele Taentzer.
- Sindy++: Reactive, Distributed Discovery of Inclusion Dependencies
  - We aim to translate the batch processing-based Sindy algorithms for the discovery of inclusion dependencies with Akka into a reactive, more efficient data profiling approach.
- Many++: Fuzzy Inclusion Dependency Discovery for Data Integration
  - We aim to translate the Many algorithm for inclusion dependency discovery on Web Tables into a partializing IND discovery algorithm that is better suited for data integration scenarios.
- HPIValid++ and FDHits++
  - HPIValid and FDHits are quite effective data profiling algorithms written in Rust and C++. Because these programming languages are faster than Java, it is unclear how well these algorithms actually perform. We therefore aim to translate these two algorithms into the programming language Java.
- DPBench: A Survey and Benchmark of Data Profiling Platforms
  - In this project, we aim the survey and benchmark existing data profiling systems including Astera, OpenRefine, Desbordante, Viadotto, Talend Data Fabric, Informatica Data Explorer, IBM InfoSphere Information Analyser, Alteryx Designer Cloud, Apache Griffin, Integrate.io, Tibco Clarity, DemandTools, RingLead, Melissa Clean Suite, WinPure Clean & Match, Informatica Cloud Data Quality, Oracle Enterprise Data Quality, SAS Data Quality, and IBM Infosphere Information Server.
- Improving State-of-the-Art Dependency Discovery Algorithms
  - Improving and/or Developing new Conditional or Partial Dependency Discovery Algorithms for:
    - Functional dependencies, Inclusion dependencies or Unique Column Combinations
  - Based on new theoretical findings, State-of-the-Art algorithms can be further improved, or new algorithms be developed for:
    - Functional Dependencies or Inclusion Dependencies
  - GPU-Accelerated Discovery of Data Dependencies (Functional Dependencies, Inclusion Dependencies or Unique Column Combinations)
Time Series Analytics
- Präzisions-LDS: Time Series Analytics for Thermal Spraying Processes
  - Thermal Spraying is a coating technique for metal surfaces. We can monitor these processes and generate time series data about their observable behavior. In this project, we aim to assist these processes with AI technology.
  - Challenges:
    - Predicting the thickness of coating surfaces.
    - Predicting the stability of coating processes.
    - Improving the robustness and scalability of analyzing large quantities of time series data for anomalies and other properties.
    - Assessment of spraying streams and their well-functioning based on image recordings of the spaying process.
- Crisis in emergenCity - Intelligent Info-Station Planning
  - Based on the movement events of agents in cities, we aim to plan the placement of info-stations, such that these stations inform as many nearby agents as possible in some fixed time period.
  - The project will be conducted in collaboration with the emergenCity project.
  - We will use the streams of movement data and the Lambda engine that is currently in development at the UMR.
  - Keywords: Lambda queries, lattice search
- SemanticWindows: Slicing of Time Series into Variable-Length, Meaningful Subsequences
  - In this project, we aim to slice time series into semantically meaningful subsequences. In contrast to traditional sliding or hopping windows, semantic windows should capture variable-lengths concepts, such as hearth beats in ECG data. These subsequences will then support anomaly detection algorithms or clustering algorithms in creating better results.
- Anomaly Detection on Time Series Data Streams
  - Discovering anomalies in streaming data is a challenging task; hence, we aim to translate batch anomaly detection algorithm(s) into the streaming scenario.
  - Our goal is to discovery anomalies as fast as possibly by sacrificing as little precision as possible.
  - Keywords: stream processing
Data Integration
- Second-Line Schema Matching
  - First-Line schema matching produces similarity matrices which indicate how likely two attributes of different schemata represent the same semantic concept.
  - Second-Line schema matching consumes similarity matrices and aims to produce improved similarity matrices.
  - There are two main approaches for second-line matching: 1) similarity matrix boosting and 2) ensemble matching. While the former tries to transform a given similarity matrix into a more valuable one, the latter consumes multiple matrices and combines them to a single new similarity matrix.
- HungarianMethod++: Matching Web-sized schemata
  - We aim to improve the Hungarian Method by improving its efficiency in exchange for a bit of fuzzyness/approximation (= reduced correctness)
  - Also interesting: Can we allow (to some extend) 1:n and n:m mappings in the attribute matching?
- OntologyShredding: Transforming ontology data to relational tables
  - Knowledgebases are a valuable source of publicly available data and data integration scenarios. To make these scenarios usable also for relational data integration systems, this project aims to develop a shredding algorithms that translates linked open data into meaningful relational tables for data integration purposes.
- RelationDecomposer: Decomposing relational databases into schema matching scenarios
  - Data integration test scenarios are very rare, especially if these scenarios should offer special properties, such as join- and unionable tables, unary and complex attributes matches, a broad selection of data types, schema-based and schema-less data, real-world data values and many other properties. This project, therefore, aims to develop a relation decomposer that takes existing, integrated datasets as input and automatically generates different integration scenarios with specific properties from these seed datasets via relational decomposition.
- WDCIntegration - Fusing the Web Data Commons Data
  - The Web Data Commons Crawl is a large dataset of relational tables that stem from crawled HTML Web tables. These tables often store data about same/similar concepts, but they are due to their crawling completely unconnected. Hence, we aim to integrate the WDC commons corpus in a possibly meaningful and correct way, which is both a technically and conceptually challenging task.
- LakeHouse: Virtual Integration on the Dynamic Data of Data Lakes
  - Data in data lakes is subject to constant changes. Data lakes, thereby, lack most of the control mechanisms that traditional database systems would use to, for example, standardise schemata, maintain indexes or enforce constraints. In this project, we aim to develop a system named lakehouse that dynamically integrates certain parts of a Data Lake to serve certain user-defined queries.
- IntegrationPrepper: Effective Preprocessing of Datasets for Data Integration
  - In this project, we investigate the effect of different data pre-processing steps on the quality and effectiveness of traditional data integration approaches. This pre-processing can be schema normalization, data correction/imputation, value standardization, hashing/encryption, sampling etc.
- Data Integration in NFDI
  - The National Research Data Infrastructure (NFDI) deploys a meta-search engine for research data. We aim to improve their results by integrating multi-modal and heterogeneous research data.
- Integartion and Cleaning of COVID Data
  - In this project, we aim to integrate questionnaire and patient data on (long) COVID patients; the data is also expected to contain many error so that data cleaning will also be necessary.
  - The data was recorded by the UKGM and will be conducted together with the "Institut für Künstliche Intelligenz in der Medizin".
Data Cleaning
- Mimir++: An Effective Data Correction Ensemble
  - Mimir is a error correction system based on data profiling, large language models and value imputation. In this project, we aim to extend Mimir's corrector ensemble with novel (and existing) correction approaches.
- ViroClean: Cleaning and Deduplicating Scientific Data
  - In this project, we aim to clean scientific data from the Virologie of the University of Marburg; a particular dataset is expected to contain a lot of inconsistencies, data errors, and duplicates that should be cleansed with a holistic data cleaning approach.
  - The Virologie of the University of Marburg will be our partner for this project.
Distributed Computing
- DistDBSCAN: Efficient Distributed DBSCAN Processing
  - DBSCAN is a popular clustering algorithm. In this project, we aim to develop a novel version of the DBSCAN algorithm that uses Voronoi diagrams for data partitioning.
  - Implementation of State-of-the-Art distributed DBSCAN in Spark
Machine Learning
- DataGossip++: A Data Exchange Extension for Distributed Machine Learning Algorithms
  - The federated learning technique DataGossip proposes to exchange not only model weights, but also some training data items for better convergence on skewed data distributions; we aim to improve this technique with more intelligent training data selection techniques.
  - Keywords: federated learning, distributed computing
Current:
- Image2Surface: Predicting Surface Properties of Workpieces from High-Resolution Images
- Large ML Models on Edge Devices
- Action: An Evaluation Plattform for Distributed Systems
- TimeScale - A Time Series Engineering Library
- TheraBuddy
- Similarity Fooding
- CardiacPredict: An AI System for Mortality Prediction and Medical Decision Support
- Anomaly Detection on Time Series Data Streams
- Schema Matching in der Logistik
- On-the-fly Data Deduplication
- DataGossip
- Generated Dataset Detector
- Many++ for WDC Integration
Completed:
- Entwicklung eines Statistischen Dashboards für die BYTE Challenge - Frontend (2024)
- Entwicklung eines Statistischen Dashboards für die BYTE Challenge - Backend (2024)
- Large Language Modelle für Data Engineering (2024)
- Order Dependency Discovery on Web Tables (2024)
- Efficient Partial Inclusion Dependency Discovery (2024)
- DBSENCForest: Enhancing Novel Class Detection in Machine Learning (2024)
- Data Augmentation for Steel Classification (2024)
- Efficient Data Discovery in Data Lakes (2024)
- Erkennung anomaler medizinischer Muster – Analyse nicht invasiver medizinischer Daten mittels maschinellen Lernens (2024)
- Data Generation and Machine Learning in the Context of Optimizing a Twin Wire Arc Spray Process (2023)
- A Clustering Approach to Column Type Annotation: Effects of Pre-Clustering (2023)
- Holistische Integration von WebDaten (2023)
- User-Centric Explainable Deep Reinforcement Learning for Decision Support Systems (2023)
- Combining Time Series Anomaly Detection Algorithms (2023)
- DPQLEngine: Processing the Data Profiling Query Language (2023)
- Aggregating Machine Learning Models for the Energy Consumption Forecast of Heat Generators (2023)
- Correlation Anomaly Detection in High-Dimensional Time Series (2023)
- HYPAAD: Hyper Parameter Optimization in Anomaly Detection (2022)
- Time Series Anomaly Detection: An Aircraft Turbine Case Study (2022)
- Distributed Duplicate Detection on Streaming Data (2021)
- UltraMine - Scalable Analytics on Time Series Data (2021)
- Distributed Graph Based Approximate Nearest Neighbor Search (2020)
- A2DB: A Reactive Database for Theta-Joins (2020)
- Distributed Detection of Sequential Anomalies in Time Related Sequences (2020)
- Efficient Distributed Discovery of Bidirectional Order Dependencies (2020)
- Distributed Unique Column Combination Discovery (2019)
- Reactive Inclusion Dependency Discovery (2019)
- Inclusion Dependency Discovery on Streaming Data (2019)
- Generating Data for Functional Dependency Profiling (2018)
- Efficient Detection of Genuine Approximate Functional Dependencies (2018)
- Efficient Discovery of Matching Dependencies (2017)
- Discovering Interesting Conditional Functional Dependencies (2017)
- Multivalued Dependency Detection (2016)
- DataRefinery - Scalable Offer Processing with Apache Spark (2016)
- Spinning a Web of Tables through Inclusion Dependencies (2014)
- Discovery of Conditional Unique Column Combination (2014)
- Discovering Matching Dependencies (2013)