Specialized components that convert different input types (text, audio, video, motion) into a common token format.