Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement bias detection system with detailed analysis and reporting #49

Merged
merged 2 commits into from
Nov 13, 2024

Conversation

leonvanbokhorst
Copy link
Owner

@leonvanbokhorst leonvanbokhorst commented Nov 13, 2024

Summary by Sourcery

Implement a bias detection system to analyze text for cognitive biases and generate detailed reports. Document the findings in comprehensive and mini-analysis reports, and provide educational material on bias types and their impacts.

New Features:

  • Introduce a bias detection system that analyzes text for various cognitive biases, including confirmation, stereotypical, ingroup-outgroup, anchoring, and availability biases.

Documentation:

  • Add a comprehensive bias analysis report documenting the detection of biases in a specific text, including a summary of findings and detailed explanations for each bias type.
  • Create a mini-analysis report summarizing the key points of a press conference, highlighting identified biases and communication strategies.
  • Provide a general overview of bias, including definitions, types, impact areas, and mitigation strategies.

…ation

- Add BiasDetector class for analyzing various cognitive biases in text
- Implement async document analysis with chunking support
- Add confidence scoring using embedding similarity
- Support multiple bias types: confirmation, stereotypical, ingroup-outgroup, anchoring, and availability
- Include comprehensive error handling and logging
- Remove example narrative stories file

Technical details:
- Uses Ollama for LLM integration and embeddings
- Implements cosine similarity for confidence scoring
- Supports async processing of large documents
- Preserves semantic boundaries in text chunking
Copy link
Contributor

sourcery-ai bot commented Nov 13, 2024

Reviewer's Guide by Sourcery

This pull request introduces a bias detection system and analysis of political speech, focusing on identifying and analyzing different types of cognitive biases (confirmation, stereotypical, ingroup-outgroup, anchoring, and availability bias). The implementation includes a Python-based bias detector class that uses language models for analysis, along with documentation and example analysis reports.

Class diagram for Bias Detection System

classDiagram
    class BiasType {
        <<enumeration>>
        CONFIRMATION
        STEREOTYPICAL
        INGROUP_OUTGROUP
        ANCHORING
        AVAILABILITY
    }

    class BiasDetectionResult {
        BiasType bias_type
        float confidence
        string explanation
        List~string~ affected_segments
    }

    class BiasDetector {
        -string model_name
        -string embeddings
        -Dict~BiasType, string~ prompts
        +BiasDetector(string model_name, string embeddings_model_name)
        +Dict~BiasType, string~ _load_bias_prompts()
        +List~float~ get_embedding(string text)
        +List~BiasDetectionResult~ detect_bias(string text, List~BiasType~ bias_types)
        +float _calculate_confidence(List~float~ text_embedding, string explanation)
        +Dict~BiasType, List~BiasDetectionResult~~ analyze_document(Path file_path)
        +List~string~ _split_text(string text, int chunk_size)
        +void save_analysis_report(Dict~BiasType, List~BiasDetectionResult~ results, Path output_path)
    }

    BiasDetectionResult --> BiasType
    BiasDetector --> BiasDetectionResult
    BiasDetector --> BiasType
Loading

File-Level Changes

Change Details Files
Implemented a BiasDetector class for analyzing text for different types of cognitive biases
  • Created BiasType enum for different bias categories
  • Implemented bias detection using language models and embeddings
  • Added methods for calculating confidence scores using cosine similarity
  • Created text chunking functionality for processing large documents
src/bias_detection.py
Added documentation explaining different types of bias and their characteristics
  • Defined fundamental concepts of bias
  • Described various types of cognitive biases
  • Explained social, cultural and statistical biases
  • Outlined bias mitigation strategies
docs/bias.md
Created analysis reports and examples demonstrating the bias detection system
  • Generated detailed bias analysis report with confidence scores
  • Created mini-analysis of political speech focusing on key biases
  • Added example text files for analysis
docs/schoof_analysis_report.md
docs/mini-schoof-analysis.md
docs/schoof.txt
docs/mini-schoof.txt
docs/mini-schoof-results.txt
docs/faber.txt

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time. You can also use
    this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@leonvanbokhorst leonvanbokhorst changed the title Bias-detection @sourcery-ai Nov 13, 2024
@sourcery-ai sourcery-ai bot changed the title @sourcery-ai Implement bias detection system with detailed analysis and reporting Nov 13, 2024
@leonvanbokhorst leonvanbokhorst merged commit ec00177 into main Nov 13, 2024
1 check failed
@leonvanbokhorst leonvanbokhorst deleted the bias-detection branch November 13, 2024 19:23
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @leonvanbokhorst - I've reviewed your changes - here's some feedback:

Overall Comments:

  • Consider adding more robust error handling and validation around the model calls and JSON parsing to handle potential API failures gracefully.
  • The text chunking could be improved by using a proper NLP tokenizer instead of basic string splitting to better handle complex sentence structures.
  • The confidence calculation using cosine similarity is quite basic - consider implementing more sophisticated metrics or ensemble methods for bias detection confidence scoring.
Here's what I looked at during the review
  • 🟡 General issues: 3 issues found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟡 Complexity: 1 issue found
  • 🟡 Documentation: 1 issue found

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

response = ollama.embeddings(model=self.model_name, prompt=text)
return response["embedding"]

async def detect_bias(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (performance): The method is marked async but makes blocking calls to ollama.generate and ollama.embeddings

Consider using async versions of these calls to prevent blocking the event loop. This is especially important when processing multiple chunks of text.


return results

def _calculate_confidence(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (performance): Multiple separate embedding calls could be optimized by batching or caching

Consider caching the embeddings or batching the calls to reduce API usage and improve performance, especially when processing multiple chunks of text.

    @lru_cache(maxsize=1024)
    def _calculate_confidence(
        self, text_embedding: Tuple[float, ...], explanation: str
    ) -> float:


for paragraph in paragraphs:
# Split paragraph into sentences (basic splitting)
sentences = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Basic sentence splitting might miss common abbreviations

Consider using a proper sentence tokenizer (like nltk.sent_tokenize) to handle cases with abbreviations like 'Mr.', 'Dr.', etc.

- Schoof's response directly contradicts this assumption by stating that antisemitism is an issue for all of Netherlands and needs to be discussed across various sectors, including conversations with Jewish organizations.
- The author does not seem to acknowledge or address Schoof's answer which goes against their preconceived notion.

### Instance (Confidence: 0.70)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (documentation): Consider adding explanation of confidence scores

The document uses confidence scores throughout but never explains what they mean or how they're calculated. This context would be valuable for readers.

Suggested change
### Instance (Confidence: 0.70)
### Instance (Confidence: 0.70 - scores range from 0.0 to 1.0, indicating analysis certainty)


return [chunk for chunk in chunks if chunk.strip()] # Remove empty chunks

def save_analysis_report(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider extracting report generation functionality into a separate class with dedicated responsibilities.

The report generation logic should be extracted to a separate class to improve maintainability and reusability. This would also make the BiasDetector class more focused on its core responsibility. Here's a suggested refactor:

@dataclass
class BiasAnalysisReport:
    results: Dict[BiasType, List[BiasDetectionResult]]

class BiasReportGenerator:
    def generate_markdown(self, analysis: BiasAnalysisReport, output_path: Path) -> None:
        total_instances = sum(len(bias_results) for bias_results in analysis.results.values())
        avg_confidences = {
            bias_type: np.mean([r.confidence for r in bias_results])
            for bias_type, bias_results in analysis.results.items()
        }

        # Generate report content (existing logic)
        report = self._generate_report_content(total_instances, avg_confidences, analysis.results)

        # Save report
        output_path.parent.mkdir(parents=True, exist_ok=True)
        output_path.write_text("\n".join(report))

class BiasDetector:
    def save_analysis_report(self, results: Dict[BiasType, List[BiasDetectionResult]], output_path: Path) -> None:
        report = BiasAnalysisReport(results=results)
        generator = BiasReportGenerator()
        generator.generate_markdown(report, output_path)

This change:

  1. Encapsulates report generation in a dedicated class
  2. Makes it easier to add new report formats
  3. Simplifies testing of report generation
  4. Reduces the responsibilities of BiasDetector


try:
# Get embedding for the explanation
explanation_embedding = self.get_embedding(explanation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): Extract code out into method (extract-method)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant