Use the same optimizer for finetuning as you used for pretraining—it significantly reduces catastrophic forgetting while maintaining task performance, even outperforming parameter-efficient methods like LoRA.
When finetuning large language models, using the same optimizer during finetuning as was used during pretraining reduces forgetting of previously learned knowledge while maintaining or improving performance on new tasks.