A volume of data used to train an AI model. There are many bodies of data that are combined to train small and large language models. For example, the RedPajama AI small model data set comprises 300 billion parameters from books, GitHub, Wikipedia and other sources. The Dolma data set's three trillion parameters came from sources such as Reddit, Project Gutenberg, Wikipedia and Wikibooks (Dolma stands for
Data to feed
OLMo's
Appetite). See
OLMo,
large language model and
Hugging Face.