Sequence probability is useful for ranking answers within a dataset but shouldn't be trusted as a guide for choosing decoding methods or hyperparameters—optimizing for probability doesn't guarantee better answers.
This paper investigates whether higher sequence probability in language models actually correlates with correct answers. The researchers test this across different decoding methods, models, and benchmarks, finding that while probability predicts correctness within a dataset, changing decoding parameters to increase probability doesn't reliably improve accuracy.