For high-stakes AI applications, you can improve both accuracy and confidence calibration by smartly combining supervised reasoning examples with unsupervised learning, rather than treating them separately.
This paper addresses a critical problem in AI safety: large language models that are confidently wrong in high-stakes applications.