Technique that aligns spoken words to their timestamps in audio by constraining the alignment to match a known transcript.