Microsoft

Microsoft Shares New Prompting Techniques to Push the Boundaries of Frontier AI Models

December 12, 2023 • 3 min read

Microsoft researchers have achieved a new state-of-the-art performance on a key AI benchmark by developing an advanced prompting technique to better steer OpenAI's powerful GPT-4 language model.

The researchers focused on the well-established MMLU benchmark, which tests AI systems across 57 diverse areas of knowledge from math to medicine. By modifying an approach called Medprompt, originally designed for medical questions, the team attained the highest MMLU score ever recorded - 90.10%.

Medprompt can efficiently steer generalist models like GPT-4 to achieve top performance, even when compared to models specifically finetuned for medicine

Medprompt combines three distinct prompting strategies:

Dynamic few-shot selection
Self-generated chain of thought
Choice-shuffle ensembling

Firstly, the dynamic few-shot selection leverages the power of few-shot learning to adapt quickly to specific domains. This method innovatively selects different few-shot examples for varied task inputs, enhancing relevance and representation.

The second strategy involves self-generated Chain of Thought (CoT). This approach prompts the model to generate intermediate reasoning steps, improving its ability to tackle complex reasoning tasks. Unlike traditional methods relying on human-crafted examples, Medprompt automates this process, reducing the risk of erroneous reasoning paths.

Finally, the Majority Vote Ensembling strategy amplifies the predictive performance by combining outputs from various algorithms. This includes a unique trick called choice-shuffling for multiple-choice questions, enhancing the robustness of the model's responses.

Visual illustration of Medprompt components and additive contributions to performance on the MedQA benchmark. Prompting strategy combines kNN-based few-shot example selection, GPT-4-generated chain-of-thought prompting, and answer-choice shuffled ensembling.

While initially specialized for the medical domain, Microsoft found Medprompt’s careful prompting methodology transferred well to the wide range of disciplines. This can be seen when it is applied to the comprehensive MMLU (Measuring Massive Multitask Language Understanding) benchmark. The original application of Medprompt on GPT-4 achieved an impressive 89.1% score on MMLU. However, refining the method by increasing the number of ensembled calls and integrating a simpler prompting method alongside the original strategy led to the development of Medprompt+.

Reported performance of multiple models and methods on the MMLU benchmark.

Medprompt+ achieved a landmark performance, scoring a record 90.10% on MMLU. This was made possible by integrating outputs from both the base Medprompt strategy and simpler prompts, guided by a control strategy that utilizes inferred confidences of candidate answers. Notably, this approach leverages GPT-4's ability to access confidence scores (logprobs), a feature soon to be publicly available.

Microsoft also open-sourced promptbase, an evolving collection of resources, best practices, and example scripts for eliciting the best performance from foundation models like GPT-4. The goal is to lower barriers for researchers and developers aiming to tap into models’ potential through optimized prompting.

Microsoft's success with GPT-4 and Medprompt gains additional significance, particularly when juxtaposed against Google's latest AI system, Gemini Ultra. Google's announcement of Gemini Ultra (which won't actually be available until some time next year) was accompanied by a highly publicized but misleading demo video that has become a point of contention in the AI community. The video, initially acclaimed for its portrayal of real-time AI interactions, was later admitted by Google to be a compilation of staged responses using still images and text prompts, rather than the dynamic, live interaction it depicted.

Now, Microsoft's prompting innovations have expanded GPT-4’s competencies beyond what Gemini Ultra is expected to achieve when launched next year. On all the popular benchmarks, prompted GPT-4 now outscores the results anticipated from Gemini Ultra.

With GPT-4 already displaying intelligence akin to subject matter experts when sufficiently guided, this milestone demonstrates both the models’s extraordinary natural language capabilities as well as the promising opportunity to extend its abilities further through more sophisticated prompting. Microsoft’s latest advances highlight that as AI models progress, better prompting strategies will likely be key to steering these powerful tools toward reliable, ethical and beneficial outcomes.

Chris McKay is the founder and chief editor of Maginative. His thought leadership in AI literacy and strategic AI adoption has been recognized by top academic institutions, media, and global brands.

An Exclusive Leadership Retreat

Leading in the Intelligence Age

Microsoft Shares New Prompting Techniques to Push the Boundaries of Frontier AI Models