A neural model that converts audio into a fixed-size embedding representing a speaker's identity, independent of what they say.