You can now retrieve similar datasets and understand which variables correspond across them without needing shared column names—useful for finding relevant training data or initializing models for new datasets.
This paper presents a method to compare and retrieve numeric datasets by converting their statistical properties into embeddings. Instead of requiring datasets to share the same variables, the approach uses statistical summaries (like means, distributions) embedded via sentence transformers, then applies Canonical Correlation Analysis to find which variables align across datasets.