Reward-Guided Fine-Tuning — Glossary — ThinkLLM