ChatGPT Gets An Upgrade: OpenAI Rolls Out Multimodal Capabilities

ChatGPT Gets An Upgrade: OpenAI Rolls Out Multimodal Capabilities

OpenAI has announced a major expansion of ChatGPT's capabilities, integrating voice and image functionalities to create a more intuitive, multimodal conversational AI. These new features begin to bring to life OpenAI's vision for artificial general intelligence that can perceive and interact with the world much like humans.

Launched in November 2022, ChatGPT took the world by storm as a remarkably human-like chatbot that could engage in natural conversations, answer follow-up questions, and perform tasks like explaining concepts, correcting essays, and generating creative content. However, it was limited to text.

Now, OpenAI is unveiling voice and image capabilities that allow users to verbally communicate with ChatGPT and show it photos for a more interactive experience. This upgrade represents a significant step toward OpenAI's goal of developing an AI assistant that can be helpful across many aspects of everyday life.

The voice feature allows fluid back-and-forth conversation with ChatGPT. Users can simply speak to ask questions or make requests, with the assistant responding through a natural-sounding voice.

OpenAI crafted five distinct voices after collaborating with professional voice actors. The voices were generated using a sophisticated text-to-speech model that can mimic human-like voices from just a text prompt and a few seconds of actual speech.

audio-thumbnail
Juniper
0:00
/34.368
audio-thumbnail
Cove
0:00
/34.368
audio-thumbnail
Sky
0:00
/34.368
audio-thumbnail
Ember
0:00
/34.368
audio-thumbnail
Breeze
0:00
/33.168

This immediacy and conversational flow enabled by voice input provides added convenience and opens up more real-world applications. Users can get cooking help by snapping fridge photos and asking questions, assist kids with homework by reading problems aloud, or get travel tips by describing scenes from a trip.

Users can now also visually guide ChatGPT by sending one or more images. For instance, travelers could share photos from a landmark and ask for historical facts. Professionals could diagram a workflow to request process optimization advice. The options are vast.

To make sense of images, OpenAI leveraged its new multimodal GPT-3.5 and GPT-4 models. They apply the reasoning and context parsing abilities of language models to images, text, and combinations of the two.

Drawing tools on mobile apps allow focusing ChatGPT's attention on certain image aspects. The assistant can interpret complex screenshots, data visualizations, diagrams, photographs, and documents.

Given the risks associated with synthetic media and image analysis, OpenAI is gradually enabling these features for select users groups. Plus and Enterprise customers will gain initial access.

The company collaborated with accessibility app Be My Eyes to ensure responsible image usage that assists people's daily lives without overstepping privacy boundaries. Features like chat transcription were intentionally excluded from the rollout.

OpenAI also conducted tests to identify potential harms in high-risk domains and implemented technical safeguards around image analysis. Transparency about model limitations aims to prevent misuse.

This measured approach allows OpenAI to refine protections while expanding access to more users soon. It reflects the company's commitment to developing AI that is both profoundly capable and broadly beneficial.

Today's announcement of voice and image functionalities push ChatGPT closer to the cutting edge of artificial intelligence - while keeping ethical considerations at the forefront.

Chris McKay is the founder and chief editor of Maginative. His thought leadership in AI literacy and strategic AI adoption has been recognized by top academic institutions, media, and global brands.

Let’s stay in touch. Get the latest AI news from Maginative in your inbox.

Subscribe