Monday, June 17, 2024

7.1 Metrics for finding Hallucination

Popular  Hallucination finding metric tools for LLMs

1. FEQA (Fact-based Evaluation of Question Answering)

  • Description: This metric evaluates the factual accuracy of generated text by generating questions from the text and checking if the answers to these questions align with known facts.
  • Usage: Commonly used in contexts where the generated content needs to be factually verified.

2. QAGS (Question Answering and Generation Score)

  • Description: QAGS measures factual consistency by generating questions from the model’s output and comparing the answers to these questions with the original reference text.
  • Usage: Useful for summarization tasks and assessing factual accuracy in generated content.

3. ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

  • Description: ROUGE measures the overlap of n-grams, word sequences, and word pairs between the generated text and reference summaries. While primarily used for summarization, it can be adapted for hallucination detection.
  • Usage: Evaluates the content coherence and factual alignment of generated text.

4. BLEU (Bilingual Evaluation Understudy)

  • Description: BLEU scores measure the precision of n-grams in the generated text compared to reference texts. Although designed for translation, it can help identify inaccuracies by comparing generated text with ground truth data.
  • Usage: Assesses the overlap and fidelity of generated content against reference standards.

5. FactCC (Fact-Consistent Conditional generation)

  • Description: FactCC is a neural model that verifies the factual consistency of a generated summary with the original document. It focuses on detecting factual discrepancies in text generation.
  • Usage: Applied in text summarization and other tasks requiring factual integrity.

6. BERTScore

  • Description: BERTScore uses pre-trained BERT embeddings to measure the similarity between generated text and reference text at a semantic level. It can help in identifying hallucinatory content by checking semantic consistency.
  • Usage: Measures the alignment and relevance of generated content with reference texts.

7. SummaQA

  • Description: SummaQA evaluates summaries by checking the factual consistency of generated summaries with the source documents using question-answering techniques.
  • Usage: Commonly used for evaluating summarization models.

8. MIMICS (Metrics for Informative Model-Generated Content Summarization)

  • Description: MIMICS uses a suite of metrics designed to evaluate informativeness, coherence, and factual accuracy of summaries generated by AI models.
  • Usage: Specifically tailored for summarization tasks.

9. Trained Classifiers for Hallucination Detection

  • Description: Custom classifiers can be trained to detect hallucination by learning from annotated datasets where instances of hallucination are labeled.
  • Usage: Versatile application across different text generation tasks.

10. Human Evaluation

  • Description: Although not a tool, human evaluation remains one of the most reliable methods for detecting hallucinations. Subject matter experts review and assess the factual accuracy of the model’s output.
  • Usage: Critical for high-stakes applications where factual accuracy is paramount.

11. GPT-3/GPT-4 Based Fact-Checking

  • Description: Leveraging advanced LLMs themselves to cross-check generated content against a vast corpus of data to identify discrepancies and potential hallucinations.
  • Usage: Effective for real-time verification of generated content.

 

No comments:

Post a Comment

7.2 Reducing Hallucination by Prompt crafting step by step -

 Reducing hallucinations in large language models (LLMs) can be achieved by carefully crafting prompts and providing clarifications. Here is...