SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

Marek Šuppa, Andrej Ridzik, Daniel Hládek, Natália Kňažeková, Viktória Ondrejová|June 11, 2026arXiv

Key Takeaway

For low-resource languages, adapting existing multilingual embedding models through vocabulary trimming and task-specific fine-tuning can produce efficient, locally-deployable alternatives to large proprietary models without sacrificing performance.

Summary

This paper introduces SkMTEB, the first comprehensive benchmark for evaluating text embedding models on Slovak, a low-resource language.

evaluation efficiency

Key Terms

text-embeddings vocabulary-trimming retrieval-augmented-generation multilingual-model instruction-tuned