Statistical Embeddings for Similarity, Retrieval, and Interpretable Alignment of Numeric Tabular Datasets

M. Ross Kunz, John Merickel, Keith Wilson|May 28, 2026arXiv

Key Takeaway

You can now retrieve similar datasets and understand which variables correspond across them without needing shared column names—useful for finding relevant training data or initializing models for new datasets.

Summary

This paper presents a method to compare and retrieve numeric datasets by converting their statistical properties into embeddings. Instead of requiring datasets to share the same variables, the approach uses statistical summaries (like means, distributions) embedded via sentence transformers, then applies Canonical Correlation Analysis to find which variables align across datasets.

data evaluation

Key Terms

canonical-correlation-analysis sentence-transformers differential-privacy tabular-foundation-models sparse-vector-representation