AI4Science - Advancing science through explainable AI, foundation models, and automated scientific discovery.

Authors

Stevanoska, Sintija, Levatić, Jurica, Džeroski, Sašo

Publication

Machine Learning, 2025

Abstract

Labeled data scarcity remains a significant challenge in machine learning. Semi-supervised learning (SSL) offers a promising solution to this problem by simultaneously leveraging both labeled and unlabeled examples during training. While SSL with neural networks has been successful on image classification tasks, its application to tabular data remains limited. In this work, we propose SSLAE, a lightweight yet effective autoencoder-based SSL architecture that integrates reconstruction and classification losses into a single composite objective. We conduct an extensive evaluation of the proposed approach across 90 tabular benchmark datasets, comparing SSLAE’s performance to its supervised baseline and several other neural approaches for both supervised and semi-supervised learning, on varying amounts of labeled data. Our results show that SSLAE consistently outperforms its competitors, particularly in low-label regimes. To better understand when unlabeled data can improve performance, we perform a meta-analysis linking dataset characteristics to SSLAE’s relative gains over its supervised baseline. This analysis reveals key properties—such as class imbalance, feature variability, and alignment between features and labels—that influence the success of SSL, contributing to a deeper understanding of when the inclusion of unlabeled data is beneficial in neural tabular learning.

See other Publications

Semi-supervised learning from tabular data with autoencoders: when does it work?