VERITAS-NLI: Validation and Extraction of Reliable Information Through Automated Scraping and Natural Language Inference
Arjun Shah, Hetansh Shah, Vedica Bafna, Charmi Khandor
https://github.com/Arjun-254/VERITAS-NLI
https://arxiv.org/abs/2410.09455v1
- Abstract
- I Introduction
- II Literature Survey
- III Methodology
- IV Proposed Solution
- V Results
- V-A How effective is our Proposed Solution for the given task?
- V-B Which NLI-based Model performs best for Headline Inconsistency Detection?
- V-C Which Small Language Models (SLMs) perform best, and how effective are the generated questions?
- V-D What are the performance characteristics and relationships between the Top-performing Classical models, BERT-based models and our Proposed Solution?
- V-E How efficient are our Proposed Pipelines?
- VI Conclusion
Fake News Detection
- Challenges: Rapid spread of information through online platforms, reliance on training data, inability to generalize on unseen headlines
- Proposed solution: Leveraging web-scraping techniques and Natural Language Inference (NLI) models for verification
Web-Scraping:
- Retrieves external knowledge necessary for verifying the accuracy of a headline
Natural Language Inference (NLI) Models:
- Used to find support for a claimed headline in the corresponding externally retrieved knowledge
Evaluation:
- Diverse self-curated dataset spanning multiple news channels and broad domains
- Best performing pipeline achieves an accuracy of 84.3%
- Surpasses best classical Machine Learning model by 33.3% and Bidirectional Encoder Representations from Transformers (BERT) by 31.0%
Conclusion:
- Combining dynamic web-scraping with Natural Language Inference is effective for fake news detection
Impact of Fake News
- Exacerbation of digital platforms as primary sources of information
- Undermines trust in reliable sources
- Potential to skew public opinion [1]
Spread and Erosion of Misinformation
- False narratives spread easily across social media and online platforms
- Urgent need to address issue [2]
- Proliferation of fake news in 2016 US elections [3]
- Individuals overwhelmed by authenticity of information
Click-Bait Articles and Misinformation
- Google's Adsense as primary metric for news site success [4]
- Erosion of democracy, justice, and public trust [5]
- Intentionally created and disseminated as misinformation campaigns [1]
- Decline in user interactions with misinformation on Facebook, rise on Twitter [6]
VERITAS-NLI Solution
- Leverages state-of-the-art Natural Language Inference Models (NLI) for verifying claims
- Compares claims against externally retrieved information from reputable sources via web scraping techniques [1] [2]
- Employs small language models to generate questions based on headlines, enhancing verification process through a question-answering approach
- Consistency between scraped article and input headlines assessed using NLI models of varying granularity
- Evaluation dataset constructed by manually curating real headlines from reputable sources and synthetically generating fake headlines to simulate misinformation scenarios
- Enables model to adapt to dynamic content, as it does not rely on static, outdated data
- Continuously validates claims against real-time information to ensure model remains relevant and effective in rapidly changing news environments
Identifying Indicators of Fake News
- Analyzes language, sources, factual claims, pronoun-swapping, sentence-negation, named-entity preservation, and overall credibility of information presented
- Cross-references claims made in news content with latest reliable and authoritative sources.
Fact-Checking False Information
Initial Approaches:
- Manual fact-checking by humans: site factchecker.in [9]
- Resource-intensive and limited scalability
- Potential bias and subjectivity
Machine Learning Approaches:
Classical Machine Learning Algorithms:
- Support Vector Machines (SVM) [10, 11]
- Higher performance in detecting fake news
- Naïve Bayes [12, 13]
- Effective for detecting fake news
- Random Forest Classifier [14]
- Achieved an accuracy of 68% on PolitiFact and GossiCop datasets
- K-Nearest Neighbor (KNN) [15]
- Obtained an accuracy of 79% from BuzzFeed News dataset
Deep Learning Approaches:
- Deep learning network [20]
- Detects patterns in language of fake news
- Trained on data up to 2016, suggested for use with automated fact-checkers
- Graph-based models [21]
- Self-supervised contrastive learning and Bayesian graph Local extrema Convolution (BLC)
- Aggregates node features while accounting for uncertainties in social media interactions
- High accuracy on Twitter datasets but struggles with feature learning challenges in power-law distribution of node degrees
- Neural network utilizing graph attention networks, KGAT [22]
- Performs fine-grained fact verification by analyzing relationships between claim and evidence presented as a graph
Transformer-based models:
- BERT [23, 24]
- Outperforms other models using contextualized embeddings
- Achieves high accuracy across multiple datasets
- DistilBERT and SHAP (Shapley Additive exPlanations) [26]
- FakeBERT [28]
- Employs bidirectional learning to enhance contextual understanding of articles
Hybrid Models:
- Combination of Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks [30]
- Achieved an accuracy of 96% on a dataset sourced from Kaggle
- Combination of supervised and unsupervised learning paths with CNNs [31]
- Performs well even with limited labeled data
- Self-serving mechanism to improve detection over traditional methods [32]
Natural Language Processing (NLP) Techniques:
- Difficulties in using current NLP techniques [33]
- Large language models like GPT-3 [34]
- Used for generating questions, responses and verifying facts.
Limitation of Early Fake News Detection Methods
- Depends on initial set of questions generated by Language Models (LLMs)
- Operates on small-scale datasets, limiting scalability and generalizability
- Focuses mainly on specific claim styles or periods (e.g., coronavirus)
Advancements in Fake News Detection Research
- Utilizes Small Language Models (SLMs) for comparable results
- Emphasizes importance of social context for detection
- Incorporates user engagement metrics and hybrid approaches
Datasets and Evaluation Metrics for Fake News Detection
- FEVER dataset: fact extraction and verification benchmark [Thorne et al.]
- Data collection site with ground-truth labels [Hassan et al.]
- ClaimBuster model and Bart-large filtering system [Reddy et al.]
Summary of Fake News Detection Research Progress
- Significant progress in developing methods and models for Fake News Detection (FND)
- Several research gaps: lack of real-world implementation, scalability, robustness, ethical implications
Future Directions
- Address practical issues related to scalability, robustness, and ethical implications
- Explore novel Machine Learning algorithms specifically for FND that can keep up with current news in a dynamic environment.
LIAR Dataset for Fake News Detection
Source: Pulitfact.com (POLITIFACT), publicly available resource
- Compiled from PolitiFact, a Pulitzer Prize-winning website
- Multiple human evaluated statements
- Authentication of POLITIFACT editor
Data:
- News claims from various multimedia sources: news releases, TV/radio interviews, campaign speeches, advertisements, social media platforms (Twitter, Facebook)
- Societal and government related news topics: economy, taxes, campaign-biography, elections, education, jobs
Characteristics:
- Statements collected over a ten-year period
- Each statement accompanied by metadata: speaker, context, detailed judgment report
Fields in the Dataset:
- Label Description: Truthfulness rating assigned to each statement
- Multilabel system: "True", "Mostly True", "Half True", "Barely True", "False", "Pants on Fire"
- Label Assignment: Rationale or explanation for the label assignment by fact-checkers
- Evidence and reasoning based on research, reports, data analysis
- Label Distribution: Figure 1 illustrates the label distribution in the LIAR dataset
Preprocessing:
- Converted 6 classes into two distinct categories (True/False)
- Mitigates subjective disagreements by reducing ambiguity
- Consistent labeling standards: Mapping = { “true”: True, "mostly-true": True, "half-true": True, "barely-true": False, "false": False, "pants-fire": False}
Usage:
- Training of classical machine learning models and transformer-based models for binary classification using the entire transformed LIAR dataset.
Evaluating Model Performance
New Dataset for Headline Analysis:
- Consists of latest headlines from various domains (science, sports, pop culture, politics, business)
- Collected from trusted sources (CNN, USNews, The Guardian, BBC, NYT, Reuters, Times of India, CNBC)
Dataset Specifications:
- Contains 346 credible headlines
- Generates corresponding unreliable headlines using Small Language Model (SLM), microsoft/Phi-3-mini-4k-instruct
- Total of 692 headlines in evaluation dataset
Unreliable Headline Generation:
- Prompted to modify reliable headlines through: sentence negation, number swaps, replacing named entities, noise injection
- Resulting in convincing fake news headlines for model evaluation
Zero-shot Prompt Approach:
- "Given the true headline, you have to make changes to it using a combination of Sentence negation, Number swaps, Replacing Named Entities and Noise Injection to generate a fake news headline. The output should be only the generated fake news headline."
Example:
- Figure 2 shows an example of an unreliable headline generated using Microsoft’s Phi-3 with our zero-shot prompting approach.
Classical Machine Learning Models for Fact-Checking News Articles
Preprocessing Steps:
- Stopword Removal: Removing common words with little meaning
- Token Filtering: Excluding non-alphabetic tokens
- Lemmatization: Reducing words to root form using NLTK's WordNetLemmatizer
- TF-IDF Vectorizer: Representing word importance based on frequency in document and corpus
Classical Machine Learning Models:
- Linear SVC: Support Vector Classifier that finds a hyperplane for maximum distance between classified samples
- Multinomial Naive-Bayes: Probabilistic classifier based on Bayes' Theorem, suitable for document classification
- Random Forest Classifier: Ensemble model creating decision trees to improve predictive accuracy
- Logistic Regression: Model for modeling relationship between dependent and independent variables, predicting likelihood of group membership
Transformer Models:
- BERT: Pre-trained model based on Transformer architecture, capturing context of words through left and right contexts
- Fine-tuning BERT on LIAR dataset provides an advantage due to its deep understanding of language patterns and context
Model Performance Metrics:
- Accuracy, Precision, Recall, F1-Score for models trained and tested on LIAR and Evaluation datasets
Importance of Transformers:
- Effectively handle polysemous words with bidirectional context understanding
- Can be fine-tuned with a small number of task-specific examples to achieve high performance
Text Normalization:
- Convert all text to lowercase (for BERT-base-uncased)
- Perform unicode normalization
Punctuation Splitting:
- Split the text at punctuation, adding spaces around each punctuation mark
WordPiece Tokenization:
- Break words into subword units based on a trained vocabulary
- Unknown words are split into smaller subwords that exist in the vocabulary
Adding Special Tokens:
- Add special tokens such as [CLS], [SEP] for specific tasks (e.g., classification, question answering)
Fine-Tuning:
- Initialization: Fine-tuning begins with pre-trained models initially trained on a large corpus
- Task-Specific Adjustments: Tweak the last few layers of the model to align outputs with task requirements
- Hyperparameter Optimization:
- Number of epochs: A single epoch is often sufficient, but using greater numbers can lead to overfitting
- Batch sizes: Smaller batch sizes help maintain a balance between memory usage and effective learning
- Warmup steps: Implement a warmup phase at the beginning of training where learning rates are gradually increased
- Learning rate: The most suitable value for many models is 5e-3 (learing_rate) and 0.001 (weight_decay)
- FP16: Enables faster training and reduced memory usage while maintaining precision
Proposed Solution: VERITAS-NLI
- VERITAS-NLI utilizes web-scraping, Small Language Models (SLM), and Natural Language Inference (NLI) models to mimic human cognitive processes in evaluating claim reliability.
- Three pipelines - Question-Answer Pipeline, Article Pipeline, and Small Language Model Pipeline - showcase this intuition and demonstrate effectiveness.
Web Scraping Techniques for Fact-Checking News Headlines:
- Retrieve external knowledge using: web scraping, chromedriver, XML Tree traversal
- Approaches to scrape relevant information:Fact-Based Headline Verification: check Google Search 'Quick Answer'
- Top-K Articles Retrieval: get titles and contents from top search results
- **People Also Asked, SLM Generated Questions: generate questions using Small Language Models
- Sources used: trusted news outlets, reputable sources
- Scraping rules: follow 'robots.txt' directives to access content responsibly.
Google Search Methods: Amalgamation of Techniques
- Quick Answer: Dependent on whether input headline is a hard fact or not
- Incorporated **"People Also Asked", - Heavy user interaction dependence for uncommon occurrences
- May not be represented in the PAA section
- Terminating fallback: Naive scraping of top-k articles if previous methods fail
Scraping and Claim Verification Process (Pipeline #3)
- Web scraping
- Passed scraped content to headline-verifying Natural Language Understanding (NLI) model
- Input headline is claim (hypothesis), scraped content is premise
- Model generates prediction and confidence score
- Represents how well claims in headline are supported by external information.
Small Language Models for Question Generation
- Pipeline: Small Language Models (SLM) used to generate questions for more relevant content scraping
- Approach: Identical to Question-Answer Pipeline with an added SLM question generation step
- Transform: User entered input headline into a context-aware question using Open Source SLMs: Mistral-7B-Instruct-v0.3 and Microsoft’s Phi-3-mini-4k-instruct
- Benefits: Retrieve more direct answers from the web for fact-checking claims against external sources
- Question Generation: Transforms input headline into a question using SLMs to increase probability of obtaining a direct response on the web
Example Question Generation Prompt:
- You’re tasked with fact-checking news headlines for authenticity. Given a headline, generate 1 question that needs to be true to verify its accuracy using a Google search and scrape the answer from the quick search box. Ask the crucial questions first.
Pipeline 2: Question Generation Process
- Input headline: "India wins 2023 ICC Men’s Cricket World Cup"
- Prompt: Generate 1 question that needs to be true to verify the authenticity of this headline using a Google search and scrape the answer from the quick search box.
- Question generated by Mistral-7B-Instruct-v0.3: "Who won the 2023 ICC Men’s Cricket World Cup?"
- This question simplifies the scraping workflow and helps retrieve a direct answer to verify the veracity of the claim.
Article Pipeline Methodology:
- Gathers data from online sources systematically using web scraping [43]
- Extracts relevant content based on Top K links to articles
- Acquires headings and content from these links for verification process
- Structures pipeline efficiently and reliably
- Naive approach leads to more accurate outcomes
Pipeline Components:
- Web scraper: acquires headings and content of articles related to a specific input headline
- Model: verifies the claim based on retrieved information
- Natural Language Inference (NLI) models: FactCC [7] and SummaC [8] for analyzing potential inconsistencies within claimed headlines
Benefits:
- Efficient and reliable process
- Rich contextual understanding of topic at hand from article content
- Improves model performance compared to incomplete or misguiding summaries like 'Quick Answer' and 'People Also Asked'
Improved Results:
- Scraping right amount of information (quantity to quality) leads to best output correlation.
FactCC Approach for Ensuring Factually Consistent Summaries:
- VERITAS-NLI: leverages FactCC [7], a model ensuring factual consistency in text summarization
- Architecture of FactCC: built upon BERT, operates under weakly-supervised learning framework, trained on synthetic data, and uses multi-task learning for Consistency Identification, Support Extraction, and Error Localization
- Impact on Text Summarization: ensures that generated summaries remain factually consistent with their source documents
Comparison of Headline and Source Text:
- Inconsistencies detected using Natural Language Inference (NLI) models: Entailment, Contradiction, or Neutral
- NLI models perform well thanks to large training datasets like MNLI [52], SNLI [53], and VitaminC [54]
- Modern attention-based architectures exhibit human performance for this task [8]
FactCC Output:
- Input Headline, generated question(s), retrieved article or direct answer, and NLI-Model Output Score are presented for comparison
- Example of correct outputs: Max Verstappen wins 2023 F1 world title -> Entailment (or high NLI score)
SummaC Approach:
- Obtains state-of-the-art results for inconsistency detection in summarization of documents
- Segments longer documents into sentence units, changing the granularity at which NLI models are applied
- Calculates likelihood of relationships between each summary and document-sentence pair using an NLI model
- Aggregates these scores to produce final output: SummaC-ZS (maximum value per column) or SummaC-Conv (1D-convolution on histograms)
- Determines optimal decision threshold for binary predictions through calibration process.
Study on Fake News Detection:
- Finds significant performance differences between Machine Learning (ML) models, BERT models, and proposed Transformer-based pipelines for classifying news headlines
- Classical ML models: rely heavily on static training data, severely limiting their ability to adapt in dynamic environments
- Average F1-score: 0.675, accuracy: 60.04% (LIAR dataset)
- Decline in performance when tested on larger corpus of news headlines: average F1-score: 0.537, accuracy: 49.86%
- BERT models: initially outperform classical ML models by ∼ 3.5 percent but also suffer from reliance on static training data
Performance Metrics:
- Table III shows results for pipelines using various NLI (Natural Language Inference) models: FactCC, SummaC, Mistral-7B, Phi-3, and Article Pipeline
- Each pipeline is evaluated based on precision, recall, F1 score, and accuracy
- SLM Pipelines perform well but still rely solely on static training data, which limits their adaptability in dynamic environments.
Performance of Proposed Solutions (VERITAS-NLI)
- Significantly outperforms prior approaches: highest accuracy = 84.3% (Article-Pipeline with SummaC-ZS)
- Substantial gains over best classical machine learning model (MultinomialNB): 33.3%
- Transformer-based BERT model: 31.0%
- Ten out of twelve pipeline configurations demonstrated superior accuracy compared to all baseline models
- Novel web-scraping techniques for dynamic information retrieval
- Robust solution for verification of claims without having to retrain the model on newer data
- Explainability by providing externally retrieved knowledge and Top-K articles (Figure 5)
- Detailed performance view in Table II, Figure 6: Common and Unique Correct-Incorrect Decisions Between Pipelines.
SummaC vs FactCC for Headline Inconsistency Detection
Performance:
- SummaC consistently outperforms FactCC due to:
- Application of NLI models on sentence-level pairs
- Enhanced granularity for detecting subtle inconsistencies
Approaches:
- SummaC designed for fine-grained inconsistency analysis (sentence level)
- FactCC focuses on document-level fact checking
Variants and Performance:
- SummaC has two variants: SummaC-ZS (zero-shot), SummaC-Conv (Convolution)
- Both outperform FactCC across all evaluation metrics
- Superior performance due to unique approach of leveraging NLI pair-matrix
Results:
- Average accuracy for pipelines: FactCC = 59.83%, SummaC-ZS = 77.15%, SummaC-Conv = 69.33%
- SummaC pipelines outperform FactCC by 17.32% and 9.50% each
Drawbacks:
- SummaC-ZS: sensitivity to extreme values due to mean calculation
- No longer a hindrance for news headlines (predominantly one sentence)
Study Overview:
- Two small language models used for question generation to scrape specific information and verify claims
- Mistral-7b pipelines have an average accuracy of 64.93%, Phi-3 pipelines have 63.37%
- Both models exhibit similar proficiency, despite Phi-3 having fewer parameters
- Small language model pipelines reach a maximum accuracy of 75.80%, while Article pipeline achieves 84.3%
- Quality of evidence retrieval is dependent on the initial questions generated by the language models (FLEEK [34])
- This could be the reason for lower performance compared to the naive article approach
V-D What are the performance characteristics and relationships between the Top-performing Classical models, BERT-based models and our Proposed Solution?
Research Question Insights:
- Assessing performance characteristics of legacy models and proposed solution through Venn diagrams
- Comparison of distinct correct and incorrect decisions:
- Legacy Models vs. Best-Performing Pipeline (Multinomial Naive Bayes, Logistic Regression, BERT-base-uncased, Article Pipeline with SummaC Zero-Shot)
- Figure 6A and 6B show that our solution outperforms legacy models in both correct decisions and lowest incorrect decisions
- Significant overlap in incorrect decisions among legacy models indicates lack of generalization for latest news headlines
- Comparison between Best Performers (Article Pipeline with SummaC Zero-Shot, QNA_SummaC_ZS, Mistral_SummaC_ZS, Article_SummaC_ZS)
- Figure 6C and 6D: Question-based pipelines uniquely classify more distinct correct decisions than article-based pipelines
- Both types of pipelines have some unique incorrect decisions, with question-based pipelines exhibiting a higher number due to their focus on nuanced questions
- Article_SummaC_ZS has significantly fewer distinct incorrect decisions compared to its counterpart, Article_FactCC.
- Legacy Models vs. Best-Performing Pipeline (Multinomial Naive Bayes, Logistic Regression, BERT-base-uncased, Article Pipeline with SummaC Zero-Shot)
Additional Insights:
- Execution times for various steps of the proposed pipeline are illustrated in Figure 7 using boxplots. (Note: This information is not covered in the initial text.)
Performance Assessment
- Calculated average processing time through stratified sampling to ensure balanced subset of true and false headlines
- All experiments conducted on Google Colab:
- NVIDIA Tesla T4 GPU (15.4GB GDDR6 memory)
- Intel(R) Xeon(R) CPU @ 2.00GHz with 12.7GB RAM
- Divided pipeline into two stages: web-scraping stage and NLI for inconsistency detection stage
- Time taken for each sub-task visualized as box-plot in Figure 7
- Detailed information on time metrics for each pipeline in Table IV
Web Scraping Stage
- Most time-consuming task:
- SLM-Phi3: 2.9 seconds
- SLM-Mistral: 4.1 seconds
- Question-Answer: 5.7 seconds
- Article: 6.7 seconds
NLI Stage (Inconsistency Detection)
- Average inference time across pipelines: 0.047s (FactCC), 0.188s (SummaC-ZS), 0.101s (SummaC-Conv)
- Increased time due to increased granularity and accuracy improvement
- Pipeline-specific average inference times:Article: 0.192s
- SLM Mistral-7B: 0.0486s
- SLM Phi-3: 0.0502s
- Question-Answer: 0.040s
- Highest average times for pipelines with larger text volumes,
Study on Fake News Detection
- Investigates Machine Learning and Transformer Model efficacy for fake news detection
- Classical models' inadequate performance due to reliance on static training data
- Proposes VERITAS-NLI, a solution that combines advanced web-scraping techniques and Natural Language Inference Models
- Finds significant improvement in accuracy (over 30%) compared to benchmark models
- Enhancements essential for advancing the field and maintaining automated systems' pace with evolving fake news strategies
- Aims to safeguard public discourse integrity.