AI4Science - Advancing science through explainable AI, foundation models, and automated scientific discovery.

Authors

Koloski, Boshko, Pollak, Senja, Navigli, Roberto, Škrlj, Blaž

Publication

Machine Learning, 2026

DOI

https://doi.org/10.1007/s10994-026-07008-y

Abstract

Building on the success of large language models (LLMs), LLM-based representations have dominated the document representation landscape, achieving strong performance on document embedding benchmarks. However, high-dimensional, computationally expensive LLM embeddings can be too generic or inefficient for domain-specific and resource-scarce applications. To address these limitations, we introduce FuDoBa—a Bayesian optimisation-based representation learning method that integrates LLM embeddings with domain-specific structured knowledge, sourced both locally and from external repositories such as WikiData. This fusion produces low-dimensional, task-relevant representations while reducing training complexity and yielding interpretable early-fusion weights for improved classification performance. We demonstrate the effectiveness of our approach on six datasets across two domains, showing that when paired with robust AutoML-based classifiers, our method performs on par with, or surpasses, proprietary LLM-only embedding baselines, while offering modality-wise interpretability and a smaller dimensional footprint.

See other Publications

FuDoBa: Fusing Document and Knowledge Graph Based Representations with Bayesian Optimisation