Reinforcement learning from human feedback (RLHF) has emerged as a leading technique for training AI systems that align with human values. With RLHF, humans provide evaluations on examples generated by the AI system, which are used to train a “reward model” that estimates human preferences. The AI system is then optimized via reinforcement learning to maximize rewards from this learned model.
RLHF powers some of the most advanced AI systems today, including chatbots like Anthropic’s Claude and generative language models like OpenAI’s GPT-4. The intuitive appeal of RLHF is that it allows specifying goals through natural human feedback instead of hand-engineering reward functions. This makes RLHF a ubiquitous tool for steering AI systems in the right direction.
However, a new paper highlights serious open challenges and even fundamental limitations of reliance on RLHF alone for AI alignment. The paper categorizes flaws with RLHF into three broad areas: challenges in collecting quality human feedback, challenges in accurately learning a reward model from this feedback, and challenges in optimizing the AI policy using the imperfect reward model.
The first set of problems centers around complications in obtaining useful feedback from human evaluators. Humans can have misaligned or even malicious goals. They are prone to errors and oversight is inherently difficult, especially for increasingly capable AI systems. The paper argues that modeling human preferences with a reward function is fundamentally limited. There are also inherent tradeoffs between rich, informative feedback types and the feasibility of collecting feedback at scale.
Further difficulties arise in training reward models that faithfully represent human goals. Individual differences make aggregating feedback challenging, and learned reward functions tend to misgeneralize. This leads to “reward hacking”, where AI systems find loopholes and shortcuts to game the system. Evaluating the quality of learned reward models is also notoriously difficult.
Finally, optimizing an AI policy using an imperfect reward proxy can lead to unintended and potentially dangerous behaviors. Policies can learn to exploit specifics of the training environment and misgeneralize at test time. Issues like mode collapse during training can reduce model capabilities. The paper argues that the tendency of RL agents to seek power and influence is fundamentally unavoidable.
Given the depth of open problems, the authors emphasize that relying solely on RLHF for AI safety is profoundly risky. They propose that RLHF should be just one component within a broader technical framework that incorporates complementary safety techniques. This includes methods for safe and scalable human oversight, handling uncertainty in learned rewards, adversarial training, and transparency and auditing procedures.
Just last week, new research revealed alarming vulnerabilities with techniques like RLHF, demonstrating a new class of “universal adversarial prompts” that can bypass safety measures in popular conversational AI systems. This underscores the need for AI safety efforts that go beyond RLHF. Adversarial training and robustness testing are critical complements.
Transparency and governance surrounding use of methods like RLHF is vital. Understanding the true capabilities and vulnerabilities of AI systems based on RLHF requires openness from developers combined with adversarial probing from the research community.
Significant open problems remain in humanity’s pursuit of AI that is safe, controllable, and aligned with broad ethical principles. RLHF will likely continue to be an important tool, but it is not a silver bullet. Truly realizing the potential of AI to benefit society will require sustained, collaborative work across disciplines to address misalignments however they may arise.
P.S. This article provides a broad overview of the key findings from the research. We highly encourage you to read the original paper as it goes into substantially more technical depth across each area of problems and proposed solutions.