Paper accepted to EDBT 2024

Pythagoras: Semantic Type Detection of Numerical Data in Enterprise Data Lakes

Authors: Sven Langenecker, Christoph Sturm, Christian Schalles, Carsten Binnig

Detecting semantic types of table columns is a crucial task to enable dataset discovery in data lakes. However, prior semantic type detection approaches have primarily focused on non-numerical data despite the fact that numerical data play an essential role in many real-world enterprise data lakes. Therefore, existing models are typically rather inadequate when applied to data lakes that contain a high proportion of numerical data. In this paper, we introduce Pythagoras, our new learned semantic type detection approach specially designed to support numerical along with non-numerical data. Pythagoras uses a GNN in combination with a novel graph representation of tables to predict the semantic types for numerical data with high accuracy. In our experiments, we compare Pythagoras against five state-of-the-art approaches using two different datasets and show that our model significantly outperforms these baselines on numerical data. In comparison to the best existing approach, we achieve F1-Score increases of around +22%, which sets new benchmarks.

https://dx.doi.org/10.48786/edbt.2024.62

 

Paper accepted to LWDA 2023

Pythagoras: Semantic Type Detection of Numerical Data Using Graph Neural Networks

Authors: Sven Langenecker, Christoph Sturm, Christian Schalles, Carsten Binnig

Detecting semantic types of table columns is a crucial task to enable dataset discovery in data lakes. However, prior semantic type detection approaches have primarily focused on non-numeric data despite the fact that numeric data play an essential role in many enterprise data lakes. Therefore, typically, existing models are rather inadequate when applied to data lakes that contain a high proportion of numerical data. In this paper, we introduce Pythagoras, our new learned semantic type detection approach specially designed to support numerical data along with non-numerical data. Pythagoras uses a graph neural network based on a new graph representation of tables to predict the semantic types for numerical data with high accuracy. In our initial experiments, we thus achieve F1-Scores of 0.829 (support-weighted) and 0.790 (macro), respectively, exceeding the state-of-the-art performance significantly.

https://ceur-ws.org/Vol-3630/LWDA2023-paper13.pdf

 

 

 

Article in Database Spectrum Vol. 23 (2023) accepted

SportsTables: A New Corpus for Semantic Type Detection (Extended Version)

Authors: Sven Langenecker, Christoph Sturm, Christian Schalles, Carsten Binnig

Table corpora such as VizNet or TURL which contain annotated semantic types per column are important to build machine learning models for the task of automatic semantic type detection. However, there is a huge discrepancy between corpora and real-world data lakes since they contain a huge fraction of numerical data which are not present in existing corpora. Hence, in this paper, we introduce a new corpus that contains a much higher proportion of numerical columns than existing corpora. To reflect the distribution in real-world data lakes, our corpus SportsTables has on average approx. 86% numerical columns, posing new challenges to existing semantic type detection models which have mainly targeted non-numerical columns so far. To demonstrate this effect, we show in this extended version paper of [18] the results of an extensive study using four different state-of-the-art approaches for semantic type detection on our new corpus. Overall, the results demonstrate significant performance differences in predicting semantic types for textual and numerical data.

https://doi.org/10.1007/s13222-023-00457-y

 

Paper accepted to SIGMOD 2023

Steered Training Data Generation for Learned Semantic Type Detection

Authors: Sven Langenecker, Christoph Sturm, Christian Schalles, Carsten Binnig

In this paper, we introduce STEER to adapt learned semantic type extraction approaches to a new, unseen data lake. STEER provides a data programming framework for semantic labeling which is used to generate new labeled training data with minimal overhead. At its core, STEER comes with a novel training data generation procedure called Steered-Labeling that can generate high quality training data not only for non-numeric but also for numerical columns. With this generated training data STEER is able to fine-tune existing learned semantic type extraction models. We evaluate our approach on four different data lakes and show that we can significantly improve the performance of two different types of learned models across all data lakes.

https://doi.org/10.1145/3589786

 

Poster accepted to DHBW AI Transfer Congress 2023

Steered Training Data Generation for Learned Semantic Type Detection

Authors: Sven Langenecker, Christoph Sturm, Christian Schalles, Carsten Binnig

The poster introduces STEER to adapt learned semantic type extraction approaches to a new, unseen data lake. STEER provides a data programming framework for semantic labeling which is used to generate new labeled training data with minimal overhead. At its core, STEER comes with a novel training data generation procedure called Steered-Labeling that can generate high quality training data not only for non-numeric but also for numerical columns. With this generated training data STEER is able to fine-tune existing learned semantic type extraction models. We evaluate our approach on four different data lakes and show that we can significantly improve the performance of two different types of learned models across all data lakes.

https://www.dhbw.de/fileadmin/user_upload/Dokumente/Forschung/AI_Transfer_Congress/Proceedings_DHBW_AITC_2023.pdf

 

Paper accepted to BTW 2023

SportsTables: A new Corpus for Semantic Type Detection

Authors: Sven Langenecker, Christoph Sturm, Christian Schalles, Carsten Binnig

Table corpora such as VizNet or TURL which contain annotated semantic types per column are important to build machine learning models for the task of automatic semantic type detection. However, there is a huge discrepancy between corpora that are used for training and testing since real-world data lakes contain a huge fraction of numerical data which are not present in existing corpora. Hence, in this paper, we introduce a new corpus that contains a much higher proportion of numerical columns than existing corpora. To reflect the distribution in real-world data lakes, our corpus SportsTables has on average approx. 86% numerical columns, posing new challenges to existing semantic type detection models which have mainly targeted non-numerical columns so far. To demonstrate this effect, we show the results of a first study using a state-of-the-art approach for semantic type detection on our new corpus and demonstrate significant performance differences in predicting semantic types for textual and numerical data.

https://doi.org/10.18420/BTW2023-68