Vector Victor: Beyond Embeddings

"What's your vector, Victor?" - Now you'll know!

A documentation vectorization system that extracts, analyzes, and organizes technical documentation for semantic search and LLM consumption. VectorVictor transforms your docs into a searchable knowledge base with the power of GPT-4 and vector embeddings.

Overview

Vector Victor is an intelligent documentation processing system that analyzes, extracts, and maps relationships across technical documentation sets. Unlike traditional embedding-based approaches, Vector Victor uses advanced LLM-based parsing combined with DSPy Chain-of-Thought reasoning to understand documentation structure and relationships at a deeper semantic level.

Why No Vectors?

During our development, we discovered that traditional vector embeddings, while useful for simple similarity searches, often miss the nuanced relationships and hierarchical structure present in technical documentation. By leveraging DSPy's Chain-of-Thought capabilities and GPT-4's advanced reasoning, we can:

Extract meaningful section hierarchies
Identify complex relationships between concepts
Understand prerequisite knowledge chains
Map implementation patterns and examples
Generate more accurate documentation graphs

Features

Flexible Documentation Extraction: Support for both GitHub repositories and web-based documentation
Intelligent Analysis: Uses GPT-4 for content understanding and summarization
Smart Documentation Processing: Hierarchical section extraction, code example detection and categorization, relationship mapping between concepts, prerequisite chain identification, framework-specific pattern recognition
Relationship Types: IMPLEMENTS, DEPENDS_ON, EXTENDS, RELATED_TO, EXAMPLE_OF, PREREQUISITE, NEXT_TOPIC, NEXT_IN_SECTION
DSPy Integration: Structured information extraction using DSPy's ChainOfThought
Progress Tracking: Resumable processing for large documentation sets

Project Structure

doc_scraper/
├── scraped_docs/           # Raw scraped documentation
│   └── [project_name]/     # Documentation organized by project
├── llm_docs/              # LLM-optimized documentation
│   └── [project_name]/    # Project-specific processed docs
│       ├── index.json     # Document metadata and organization
│       ├── content.json   # Processed document content
│       ├── relationships.json # Relationship mapping
│       └── graph.json     # Documentation graph
├── scraper.py            # Documentation scraper
├── doc_processor.py      # Content processor and analyzer
├── relationship_extractor.py # Relationship mapping
└── test_api.py          # API connection tester

Setup

Create and activate a virtual environment:

python3 -m venv venv
source venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Create a .env file with your OpenAI API key:

OPENAI_API_KEY=your_api_key_here

Test your setup:

python test_api.py

Usage Guide

1. Documentation Scraping

The scraper supports two types of documentation sources:

GitHub Documentation

python scraper.py --url https://github.com/username/repo/docs --project project_name

The scraper will:

Detect if the URL points to a GitHub docs directory
Clone the repository if needed
Extract markdown and other documentation files
Save raw content in scraped_docs/[project_name]

Web Documentation

python scraper.py --url https://docs.example.com --project project_name --content-selector "article" --link-selector "a"

Optional arguments:

--content-selector: CSS selector for main content (default: "article")
--link-selector: CSS selector for navigation links (default: "a")
--max-depth: Maximum recursion depth for link following (default: 5)

2. Document Processing

The processor uses GPT-4 and DSPy to analyze documentation:

python doc_processor.py --project [project_name]

Processing steps:

Content extraction from scraped documents
Hierarchical section extraction
Relationship mapping between concepts
Prerequisite chain identification
Framework-specific pattern recognition
Progress tracking in llm_docs/progress.json
Index maintenance in llm_docs/index.json

3. Relationship Extraction

Extract relationships between concepts:

python relationship_extractor.py --input llm_docs/[project_name]

This extraction:

Maps connections between concepts
Identifies relationship types (IMPLEMENT, DEPENDS_ON, etc.)
Saves relationships in llm_docs/relationships.json

Working with Processed Documentation

Example of using the processed documentation:

import json
from pathlib import Path

def load_documentation(project_dir: str):
    # Load document content and metadata
    with open(Path(project_dir) / 'content.json') as f:
        content = json.load(f)
  
    # Load relationships
    with open(Path(project_dir) / 'relationships.json') as f:
        relationships = json.load(f)
  
    # Use the loaded data for further analysis or visualization
    return content, relationships

How It Works

1. DSPy Integration

DSPy is used for structured information extraction through its ChainOfThought module:

class DocumentAnalyzer:
    def __init__(self):
        self.analyze = dspy.ChainOfThought(
            "content -> title, summary, key_concepts, code_examples, dependencies, related_topics"
        )

This creates a chain that:

Takes documentation content as input
Uses GPT-4 to understand the content
Extracts structured information in a consistent format

2. Relationship Mapping

The system uses DSPy's Chain-of-Thought reasoning to map connections between concepts:

def map_relationships(content: str) -> List[Relationship]:
    client = dspy.ChainOfThought(
        "content -> relationships"
    )
    response = client.analyze(content)
    return response.data[0].relationships

These relationships enable:

Accurate documentation graphs
Contextual processing
Nuanced understanding of documentation structure

3. Progress Tracking

The system maintains processing state in two files:

progress.json: Tracks processed files
index.json: Maintains document organization

This enables:

Resumable processing for large documentation sets
Progress monitoring
Efficient updates of specific documents

Contributing

Fork the repository
Create a feature branch
Commit your changes
Push to the branch
Create a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
config		config
scraper		scraper
test_docs		test_docs
vector_victor		vector_victor
.gitignore		.gitignore
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
test.sh		test.sh
test_api.py		test_api.py
test_processor.py		test_processor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vector Victor: Beyond Embeddings

Overview

Why No Vectors?

Features

Project Structure

Setup

Usage Guide

1. Documentation Scraping

GitHub Documentation

Web Documentation

2. Document Processing

3. Relationship Extraction

Working with Processed Documentation

How It Works

1. DSPy Integration

2. Relationship Mapping

3. Progress Tracking

Contributing

License

About

Releases

Packages

Languages

George5562/VectorVictor

Folders and files

Latest commit

History

Repository files navigation

Vector Victor: Beyond Embeddings

Overview

Why No Vectors?

Features

Project Structure

Setup

Usage Guide

1. Documentation Scraping

GitHub Documentation

Web Documentation

2. Document Processing

3. Relationship Extraction

Working with Processed Documentation

How It Works

1. DSPy Integration

2. Relationship Mapping

3. Progress Tracking

Contributing

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages