If there were an award for webinar titles, Data Conversion Laboratory (DCL) and its Oct. 12 program, “Hallucinate, Confabulate, Obfuscate: The Perils of Generative AI Going Rogue,” would surely be in the running. It’s quite a mouthful. To explain what the webinar was actually about, analyzing that title is a good place to begin.What do these words actually mean? “Hallucinate”: to perceive things—sights, sounds, etc.—that aren’t really there. “Confabulate”: to make up experiences or events that didn’t really happen. “Obfuscate”: to create confusion or make unclear or uncertain. These words convey that the “going rogue” in the title is not the popular fear that AI is going to take over the world and make humanity obsolete. Instead, it’s “going rogue” like a puppy that slips out an open door, runs away, and can’t find its way back home.
In this case, the open door is the portal to a fantasy world, in which AI “perceives” things that aren’t really there, makes up answers that aren’t real and true, and thus adds to society’s general state of confusion. It can’t be trusted to impart knowledge, and you don’t know if what it’s telling you is true or false. It needs close supervision. Don’t leave the door open.
The webinar, which was presented by DCL systems architect Rich Dominelli and CIO Tammy Bilitzky, illustrated the problems with AI (and more specifically, large language models, or LLMs). It addressed their causes and mitigation strategies through experiences in applying LLMs to DCL’s document-processing work. For example, Dominelli recounted his attempts to use an LLM, ChatGPT 3.5, to answer questions using information found in corporate 10-K reports (annual reports mandated by the Securities and Exchange Commission). He found that the application supplied incorrect information, and he described it as “like a toddler who didn’t want to give you an answer that it knew you wouldn’t like, so instead it just makes up something on the spot.”
WHY HALLUCINATIONS HAPPEN
The presenters went on to list four reasons why LLMs hallucinate:
- “Temperature” is a parameter that sets the balance between determinism and creativity. It is adjustable by the application programmer. The lower the “temperature,” the more deterministic the answer provided by the LLM. The higher the “temperature,” the greater the level of creativity—and the greater the likelihood that the application will make up (confabulate) an answer.
- “Top_P” is another parameter that controls the diversity of language used in composing its answer. It’s analogous to a search engine’s behavior in presenting more or fewer of its best retrieved results in response to a query. The greater the Top_P number, the greater the creativity and risk of the answer.
- Bad training data is responsible for much of the widely reported perpetuation of gender, racial, and other stereotypes by AI applications. More fundamentally, the problem is that the applications can’t encompass novel cases. If it is trained on a dataset in which all engineers are male, it doesn’t grasp the possibility that an engineer could just as well be female and that female candidates could be equally (or better) qualified for engineering jobs. Since no training of an LLM can encompass all possible cases, this problem is ubiquitous.
- Model drift and decay derive from another universal phenomenon. No matter how good their testing and training, LLMs will absorb new and unexpected information into their databases. If these new inputs are poor quality, such as disinformation, the applications’ answers will drift toward misinformation.
TACTICS FOR USING AI
As serious as all of these problems may seem, DCL’s message was not to discourage the use of AI. Instead, the presenters saw it as an essential tool, but one that requires careful management. They recommended five tactics:
- Using verified datasets to regularly test the model—Apply carefully constructed training data to begin with. Over time, continue to subject the LLM to test questions against the training set. If it continues to give the same, accurate responses, the chances are that it hasn’t drifted or decayed.
- Assessing multiple datapoints—In other words, apply different machine learning technologies to determine whether the LLM gives responses that are consistent with those other methods.
- Following a consensus-based approach that compares the results of different language models to identify outliers
- Retraining models if they are found to be flawed
- Making models unlearn information
Unlearning was discussed at length, for a couple of reasons. First, it’s new. It was recently described in an anonymous preprint in which the authors started with a model that had ingested all of the Harry Potter novels and then erased all of its knowledge of the Harry Potter character. Second, it is much less expensive and time-consuming than complete retraining. And third, it raises serious ethical concerns. Given the dismal current state of the management and oversight of web search engines and social media, as well as the proliferation of disinformation, imagine an LLM that has been deliberately purged of valid information and taught to dispense lies—lies that can’t be traced back to any specific source, but are accepted as true because they come from AI. If LLMs take the place of search engines and determine what people believe, as some forecasters expect, the possibilities are disturbing.
A NEW PERSPECTIVE ON AI
Ultimately, DCL’s webinar provided a realistic and reasonably detailed non-technical counterpoint to both the hype put forth by AI advocates and the doomsday scenarios promoted by others. Far from being sentient, as others have claimed, Dominelli held that “LLMs are not intelligent. They are prediction engines. They are attempting to predict what answer you want. They are not reasoning. …” However, they have become an important tool for many professions and businesses, not the least for content- and document-processing tasks.
“At the end of the day, we can’t wait for AI to be perfect,” noted Bilitzky. So, responsible management must adopt a “trust but verify” approach. Moreover, the technology will continue to evolve. Current capabilities may grow, and current problems may be solved or at least mitigated. New problems will emerge. Any analysis of such a rapidly developing technology can only be a snapshot—but a clear and well-composed snapshot is still a valuable resource. Meanwhile, enjoy this puppy, but don’t let it out the door.