OpenAI Announces Data Partnerships to Diversify AI Training Sets

By Chris McKay November 9, 2023 • 2 min read

Artificial intelligence research lab OpenAI announced an ambitious new initiative called Data Partnerships aimed at collaborating with third parties to build more diverse and comprehensive datasets for training AI models.

In a blog post, the company explained that the goal of Data Partnerships is to enable the creation models with a deeper understanding of all subject matters, industries, cultures and languages.

To ultimately make AGI that is safe and beneficial to all of humanity, we'd like AI models to deeply understand all subject matters, industries, cultures and languages, which requires as broad a training dataset as possible.

By expanding the sources of data used for model training beyond what is readily available online, OpenAI hopes to steer the future of AI in an inclusive direction that provides value to people across languages, geographies and walks of life.

The company is currently inviting two ways to collaborate:

Open-Source Archive: OpenAI is seeking partners to help build a public open-source dataset that anyone can use to train AI models. The company says they would also explore using it to safely train additional open-source models [themselves]".
Private Datasets: Organizations can also work directly with OpenAI to have their private data included in training for OpenAI's proprietary foundation models and custom models, while retaining control over data privacy and access.

OpenAI has already partnered with several organizations under early Data Partnerships, including the government of Iceland and legal nonprofit Free Law Project. These collaborations have improved model performance in Icelandic language and understanding of legal documents.

The company says its state-of-the-art in-house technologies provide advanced optical character recognition and automatic speech recognition can facilitate the digitization and structuring of a wide range of data types and modalities.

Diversity and breadth of training data is crucial for developing capable and socially-aware AI systems. Models trained on narrow datasets can inadvertently amplify harmful biases and fail to generalize across geographic and cultural boundaries. By expanding data sources beyond readily available online content, OpenAI aims to train models that better embody more inclusive perspectives.

If you are interested in exploring a partnership, complete this form to reach out to the OpenAI team.

Chris McKay is the founder and chief editor of Maginative. His thought leadership in AI literacy and strategic AI adoption has been recognized by top academic institutions, media, and global brands.