All About the Hallucination Metric in Large Language Models (LLMs)
Large Language Models (LLMs) are at the forefront of technological discussions, but these models come with flaws too. They face challenges like "hallucinations" - generating incorrect or irrelevant content. Understanding these hallucinations is crucial for effective use of LLMs. It provides insight into AI's potential and limitations. Additionally, evaluating LLMs and their outputs is essential for developing robust applications.
Keep reading to know in detail about what LLM evaluation metrics are, how they can be used to evaluate LLM systems, and possible causes of hallucinations.
So, What Are Hallucinations?
The dictionary meaning of the word hallucination is experiencing something that does not exist. A real-life analogy would be when someone remembers a conversation or episode incorrectly and ends up providing distorted information.
And how does this translate to hallucinations in LLMs? Well, large language models are infamously capable of generating false detail that is factually incorrect or may not be relevant to the prompt itself. Since LLMs are designed to process and generate human-like text based on the data they have been trained on, hallucinations and their extent is dependent on the training model of the LLM itself.
Possible Causes of Hallucinations
- Data Biases: LLMs may hallucinate due to biases present in the training data. For instance, if the training model was prepared by scraping Wikipedia or Reddit, we cannot be absolutely sure of such data accuracy. LLMs tend to summarize data or generalize from a vast amount of data available - and the reasoning ability can go wrong.
- Lack of Context: Inadequate contextual understanding can lead to the generation of irrelevant or inaccurate information. In fact, the more context you add in your prompt, the more refined it is for an LLM to understand, for that enables the LLM to provide a refined answer.
- Noise in Input: Noisy or incomplete input data can cause LLMs to hallucinate.
An example of noisy input in the context of using a language model like GPT could be: "Can, like, uh, you provide, like, a summary or something of the key points, like, in the latest, uh, research paper on climate change, you know?"
In this example, the noisy input contains filler words, repetitions, and hesitations that do not add any meaningful information to the query. This kind of input can potentially confuse or distract the language model, leading to a less accurate or relevant response. Cleaning up the input by removing unnecessary elements can help improve the model's performance and output quality. A better input would be: “Can you provide a summary of the key points in the latest research paper on climate change?”
- Generation Method: Whether it’s the model architecture, or the fine-tuning parameters - the generation method has a huge role to play in hallucinations.
Here is an interesting example from the early days of GPT
Introducing the Hallucination Metric
LLMs utilize AI technology to make assessments based on training data, in a manner that resembles how a human would evaluate a given situation or data. While offering efficiency and scalability, considerations around bias, accuracy (hallucinations), and human oversight are critical for ethical and fair decision-making. We have already discussed what hallucinations are and why they happen. Let’s see how LLMs use this metric to evaluate the propensity of LLMs to generate inaccurate information. It helps assess the reliability and trustworthiness of the model's outputs.
Calculation Methods
- Overlap with Ground Truth: Comparing the generated text with a set of expected outcomes or ground truth data.
- Semantic Coherence: Evaluating the logical consistency and coherence of the generated text.
- Fact-Checking: Cross-referencing facts stated in the output with reliable sources to detect LLM hallucinations.
- Simple Calculations Based on Context: Some frameworks, like the Deep Eval Framework which is a powerful framework for using LLMs e, directly employ context for calculating the hallucination metric. See the Deep Eval Docs here.
Image Source: Confident AI
Evaluating Hallucination Scores and Improving Responses
Lower hallucination scores indicate higher accuracy and trustworthiness of the model's judgments. Evaluating hallucination scores involves assessing the frequency and severity of inaccuracies or fabricated information in LLM-generated outputs. Metrics such as BLEU, ROUGE, and METEOR can provide insights into the text's accuracy, but they focus on similarity to reference texts rather than factual correctness.
It is important to note that after evaluating the severity of inaccuracies, the next step would be to minimize the hallucination itself.
How would you do that?
- Improve the context of the prompt itself.
- Add more background information. Better context means better results. Give examples of what you would want.
- Tell the system on how to structure the response. If you happen to be an architect, you can opt to fine-tune some of the parameters like temperature to decrease the randomness of the response.
- Implementing post-processing steps, such as filtering and fact-checking, can help identify and correct hallucinations before presenting the final output.
ProArch’s AI capabilities
With deep expertise in implementing AI technologies like generative AI and machine learning, ProArch’s AI consulting services deconstruct business problems and integrate AI where it makes the greatest impact. This also involves understanding how Gen AI applications are built, testing them and making sure they produce the correct output.
Reach out to us to work with an AI consulting company that turns obstacles into opportunities.