AI4Science loader
Authors
Stevanoska, Sintija, Levatić, Jurica, Džeroski, Sašo
Publication
Machine Learning, 2025
Abstract

Labeled data scarcity remains a significant challenge in machine learning. Semi-supervised learning (SSL) offers a promising solution to this problem by simultaneously leveraging both labeled and unlabeled examples during training. While SSL with neural networks has been successful on image classification tasks, its application to tabular data remains limited. In this work, we propose SSLAE, a lightweight yet effective autoencoder-based SSL architecture that integrates reconstruction and classification losses into a single composite objective. We conduct an extensive evaluation of the proposed approach across 90 tabular benchmark datasets, comparing SSLAE’s performance to its supervised baseline and several other neural approaches for both supervised and semi-supervised learning, on varying amounts of labeled data. Our results show that SSLAE consistently outperforms its competitors, particularly in low-label regimes. To better understand when unlabeled data can improve performance, we perform a meta-analysis linking dataset characteristics to SSLAE’s relative gains over its supervised baseline. This analysis reveals key properties—such as class imbalance, feature variability, and alignment between features and labels—that influence the success of SSL, contributing to a deeper understanding of when the inclusion of unlabeled data is beneficial in neural tabular learning.