Effective Biological Representation Learning by Masking Gene Expression

Kian Kenyon-Dean, Alina Selega, Ihab Bendidi, Jordan M. Sorokin, Luca Bertinetto et al.|May 29, 2026arXiv

Key Takeaway

For gene expression analysis, a smaller, carefully curated dataset with the right model architecture beats massive datasets—showing that thoughtful data selection and design choices are critical for biological AI, not just scale.

Summary

This paper develops TxFM, a self-supervised learning model for analyzing gene expression data from RNA sequencing. Using masked autoencoding (hiding parts of data to learn patterns), the model learns useful representations of genes that outperform larger foundation models.

Key Terms

masked-language-modeling self-supervised-learning transfer-learning batch-effects foundation-models