AI4Science loader
Authors
Stevanoska, Sintija, Dost, Katharina, Camacho Villalón, Christian L, Džeroski, Sašo
Publication
International Conference on Discovery Science, 2025
Abstract

Self-supervised learning (SSL) offers a promising solution to the problem of label scarcity by leveraging large amounts of unlabeled data to learn transferable representations. However, using extensive unlabeled datasets can introduce substantial computational costs during pretext training and may include noisy or unrepresentative samples that degrade learning. In this work, we explore whether selecting a subset of the unlabeled examples available for the pretext task can reduce computation time while maintaining satisfactory performance of SSL methods for tabular data. In particular, we investigate whether estimates of the uncertainty with which predictions are made for unlabeled data (based on the available labeled data) can inform sampling and lead to superior embeddings. To answer these questions, we carry out large-scale experiments across 28 tabular benchmark datasets with TabNet and SCARF, on varying amounts of labeled data, and with a number of strategies for sampling unlabeled data. Our results show that reducing the pool of unlabeled data yields significant computational gains with only marginal reductions in performance.