A large collection of source code written in many different programming languages, used to train the model.