The increasing volume of remotely sensed imagery (RSI) requires efficient processing and extraction of meaningful information. Modern deep learning architectures excel in various tasks but typically require large labeled datasets, which are often scarce in RSI due to the tedious labeling of complex heterogeneous landscapes containing multiple semantic categories. This can limit the potential of supervised deep learning methods. To address this, we propose SSL-MAE, a novel semi-supervised learning method based on a masked autoencoder. Our approach unifies self-supervision and discriminative learning within a single, end-to-end framework, leveraging both abundant unlabeled data and limited labeled data. Additionally, we introduce an adaptive mechanism to control the level of supervision during learning, crucial for balancing prediction quality with effective use of unlabeled data. To validate SSL-MAE's effectiveness, we conducted comprehensive experiments on 10 publicly available RSI datasets (5 multi-class and 5 multi-label classification). The results show that our method outperforms state-of-the-art semi-supervised methods and performs favorably against self-supervised methods when labeled data are scarce. Finally, our results suggest that the proposed adaptive joint learning strategy is not restricted to a single design choice; instead, when integrated with different state-of-the-art self-supervised approaches, it significantly enhances their predictive performance.