Popular Hallucination finding metric tools for LLMs
1. FEQA (Fact-based Evaluation of Question Answering)
- Description: This metric evaluates the factual accuracy of generated text by generating questions from the text and checking if the answers to these questions align with known facts.
- Usage: Commonly used in contexts where the generated content needs to be factually verified.
2. QAGS (Question Answering and Generation Score)
- Description: QAGS measures factual consistency by generating questions from the model’s output and comparing the answers to these questions with the original reference text.
- Usage: Useful for summarization tasks and assessing factual accuracy in generated content.
3. ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
- Description: ROUGE measures the overlap of n-grams, word sequences, and word pairs between the generated text and reference summaries. While primarily used for summarization, it can be adapted for hallucination detection.
- Usage: Evaluates the content coherence and factual alignment of generated text.
4. BLEU (Bilingual Evaluation Understudy)
- Description: BLEU scores measure the precision of n-grams in the generated text compared to reference texts. Although designed for translation, it can help identify inaccuracies by comparing generated text with ground truth data.
- Usage: Assesses the overlap and fidelity of generated content against reference standards.
5. FactCC (Fact-Consistent Conditional generation)
- Description: FactCC is a neural model that verifies the factual consistency of a generated summary with the original document. It focuses on detecting factual discrepancies in text generation.
- Usage: Applied in text summarization and other tasks requiring factual integrity.
6. BERTScore
- Description: BERTScore uses pre-trained BERT embeddings to measure the similarity between generated text and reference text at a semantic level. It can help in identifying hallucinatory content by checking semantic consistency.
- Usage: Measures the alignment and relevance of generated content with reference texts.
7. SummaQA
- Description: SummaQA evaluates summaries by checking the factual consistency of generated summaries with the source documents using question-answering techniques.
- Usage: Commonly used for evaluating summarization models.
8. MIMICS (Metrics for Informative Model-Generated Content Summarization)
- Description: MIMICS uses a suite of metrics designed to evaluate informativeness, coherence, and factual accuracy of summaries generated by AI models.
- Usage: Specifically tailored for summarization tasks.
9. Trained Classifiers for Hallucination Detection
- Description: Custom classifiers can be trained to detect hallucination by learning from annotated datasets where instances of hallucination are labeled.
- Usage: Versatile application across different text generation tasks.
10. Human Evaluation
- Description: Although not a tool, human evaluation remains one of the most reliable methods for detecting hallucinations. Subject matter experts review and assess the factual accuracy of the model’s output.
- Usage: Critical for high-stakes applications where factual accuracy is paramount.
11. GPT-3/GPT-4 Based Fact-Checking
- Description: Leveraging advanced LLMs themselves to cross-check generated content against a vast corpus of data to identify discrepancies and potential hallucinations.
- Usage: Effective for real-time verification of generated content.