The nonprofit Allen Institute for Artificial Intelligence (AI2) has released Dolma (Data to feed OLMo’s Appetite), a massive new open source dataset for training AI language models. Weighing in at 3 trillion tokens, Dolma is the largest openly available dataset of its kind to date.
The Allen Institute created Dolma as part of their OLMo project to build an open, transparent language model. In curating Dolma, they focused on principles of openness, representativeness, size, reproducibility and risk mitigation.
- Openness: Addressing the research community's concern about the limited access to pretraining corpora and corresponding language models, they aimed for a dataset that is transparent and open for scrutiny.
- Representativeness: The content should be on par with existing datasets utilized for both private and open language models.
- Size: The team acknowledged the potential advantages of having a large dataset. Research indicates that larger training datasets can lead to improved model performance.
- Reproducibility: Any tools developed during the creation of Dolma would be openly available for others to reproduce or adapt.
- Risk Mitigation: The dataset would be designed to minimize the risk posed to individuals, ensuring that web-crawled content isn't traceable back to real-world individuals.
Creating Dolma presented numerous challenges. The Allen Institute had to decide on sources, preprocessing techniques, the languages to be included, and strategies to eliminate personal information. The team used established practices for filtering and deduplicating the raw data. They also took steps to remove potentially harmful content. The final dataset encompasses a rich blend from web data, English books from Project Gutenberg, scientific manuscripts from peS2o, Wikipedia, and code from GitHub repositories.
Compared to closed datasets, Dolma sets itself apart by emphasizing transparency, especially as many datasets behind large-scale models remain obscured in the private domain. In comparison with other open datasets, Dolma’s strengths lie in its size and licensing terms. Dolma is the largest open dataset to date. It is licensed under the Allen Institute's ImpACT terms as a medium-risk artifact.
Those interested in accessing Dolma must abide by and agree to terms around transparency, risk mitigation and responsible use. The license prohibits military applications, surveillance or generating disinformation. This unique licensing balances the ease of access with potential risks associated with disseminating vast datasets.
Prospective users must provide their contact details and state their intended use of the dataset. Once approved, Dolma is freely accessible for download on the HuggingFace. For more information, consult this handy data sheet. They have also provided a GitHub repository with code and instructions on generating and inspecting Dolma.
The Allen Institute plans to expand Dolma with more data sources and languages. They hope Dolma will advance openness in AI research and lead to development of better language technologies.