‘Semantics-Aware Dataset Discovery From Data Lakes With Contextualized Column-Based Representation Learning’

“Dataset discovery from data lakes is essential in many real application scenarios. In this paper, we propose Starmie, an end-to-end framework for dataset discovery from data lakes (with table union search as the main use case). Our proposed framework features a contrastive learning method to train column encoders from pre-trained language models in a fully unsupervised manner. The column encoder of Starmie captures the rich contextual semantic information within tables by leveraging a contrastive multi-column pre-training strategy.”

Find the paper and the full list of authors in the Proceedings of the VLDB Endowment.

View on Site: ‘Semantics-Aware Dataset Discovery From Data Lakes With Contextualized Column-Based Representation Learning’