AI4Science - Advancing science through explainable AI, foundation models, and automated scientific discovery.

Authors

Stevanoska, Sintija, Dost, Katharina, Camacho Villalón, Christian L, Džeroski, Sašo

Publication

International Conference on Discovery Science, 2025

DOI

https://doi.org/10.1007/978-3-032-05461-6_16

Abstract

Self-supervised learning (SSL) offers a promising solution to the problem of label scarcity by leveraging large amounts of unlabeled data to learn transferable representations. However, using extensive unlabeled datasets can introduce substantial computational costs during pretext training and may include noisy or unrepresentative samples that degrade learning. In this work, we explore whether selecting a subset of the unlabeled examples available for the pretext task can reduce computation time while maintaining satisfactory performance of SSL methods for tabular data. In particular, we investigate whether estimates of the uncertainty with which predictions are made for unlabeled data (based on the available labeled data) can inform sampling and lead to superior embeddings. To answer these questions, we carry out large-scale experiments across 28 tabular benchmark datasets with TabNet and SCARF, on varying amounts of labeled data, and with a number of strategies for sampling unlabeled data. Our results show that reducing the pool of unlabeled data yields significant computational gains with only marginal reductions in performance.

See other Publications

Selecting Unlabeled Data for Tabular Self-Supervised Learning