Large language models (LLMs) have captured our imagination with their impressive text-generation capabilities. While OpenAI’s ChatGPT is the most well-known LLM application, there are several other LLMs, including Google’s PaLM, Meta’s LLaMA, and Anthropic’s Claude. In addition to these general-purpose LLMs, there are domain-specific LLMs such as BloombergGPT for finance professionals, Med-PaLM for healthcare workers, and Casetext for lawyers.Other types of generative AI include image generators (which generate images based on natural-language inputs or text-to-image AI), text-to-video AI, and text-to-code AI. The focus of this NewsBreak is on text-based LLMs, which have been trained on a huge corpus of text content and can be useful for text generation, summarization, translation, Q&A, and many other scenarios. Organizations both large and small are very interested in leveraging these powerful capabilities.
However, while the prospect “a ChatGPT for my business” is tantalizing, any such enterprise deployments of LLMs must recognize and mitigate potential risks. LLMs are known to generate inappropriate content such as hate speech, biased or stereotyped content, or explicit content with offensive language. The output in response to user prompts may also encourage self-harm, unethical behaviors, and criminal activities. So, it’s important to have guardrails in place—both while training the LLMs and while they are being used—to reduce the risk of generating harmful content.
How Are LLMs Trained?
Training LLMs is a complex endeavor requiring a humongous corpus of text. It involves several AI techniques and methods, but one in particular is responsible for the high quality, relevance, and appropriateness of the generated output: Reinforcement Learning from Human Feedback (RLHF). In RLHF, human evaluators rate or correct the model’s outputs and iteratively fine-tune the model to align the outputs to human judgment and preferences. For example, human raters can score the accuracy, appropriateness, or helpfulness of the outputs and provide better and more accurate versions. This human feedback is then used as part of the reinforcement learning process and serves as a reference for what is a “good” output. Over multiple iterations, the human feedback is used to replicate the patterns of high-quality responses. This not only leads to improvements, but also more closely aligns the outputs with human expectations.
In addition, there are several methods used to ensure that LLMs generate reliable text, including:
- Red teaming—This refers to testing to find ways to get the LLM to generate harmful content so that issues identified can be fixed.
- Jailbreaking tests—These are attempts to find ways to bypass the guardrails and restrictions that are in place.
- Content filtering—Automated content filters and moderation rules are applied to prevent the generation of inappropriate content.
Typically, LLM applications use about 50,000 examples of human feedback/ratings. This is resource-intensive, requires a significant amount of time, and poses a challenge to scale up. Not just that—human feedback can be subjective or inconsistent. Preferences can vary across countries and cultures. In short, RLHF is powerful but has its limitations.
What If We Can Write a Constitution That LLMs Have to Abide By?
Reinforcement learning from AI feedback is RLHF with a twist. In this approach, the LLM is trained without human feedback (i.e., without humans providing labels that identify harmful outputs). Instead, the human oversight is provided through a set of rules or guiding principles. These principles serve as a constitution of sorts for the LLM, and hence, this approach is also known as constitutional AI. The constitutional AI approach was developed by Anthropic and was leveraged in the creation of its Claude LLM.
Constitutional AI emphasizes adherence to the predefined set of guidelines and dictates the model outputs. Anthropic has demonstrated this strategy by including principles from the United Nations’ Universal Declaration of Human Rights.
As the adoption of AI accelerates, AI governance is becoming a pressing concern to ensure its responsible and ethical use. Many organizations are excited by the potential applications of LLMs, but they are also wary about the reputational and legal risks should their chatbots spew out inappropriate content. In addition, there are significant global differences in the norms and expectations around what is appropriate or not. LLMs developed in one country, incorporating the norms of the HQ location, don’t always perform well in another. For example, Google’s Bard refused to generate the summary of a news article because the article’s source was not deemed reliable.
Constitutional AI is a potential solution to such limitations because:
- It becomes possible to apply a set of guidelines that are relevant to a particular country or region and tailor the output for local sensibilities. Region-specific legal rules and compliance requirements can be more easily accommodated.
- An organization training a custom model can incorporate its own values as guidelines. When organizations can trust that the content generation is both helpful and harmless, it removes a big barrier to adoption.
- Constitutional AI can also enable a multi-stakeholder approach to AI governance. While the creation and training of AI systems remains a highly complex and technical endeavor, key nontechnical stakeholders can more easily participate and provide input about what rules should govern their usage.
Of course, no single approach, including constitutional AI, is a panacea for the problems of the technology. But it can be a good addition to the existing portfolio of AI governance and oversight tools, which includes responsible AI standards, technical approaches such as explainable AI, regulations in the form of soft norms and hard laws, and AI audits.