"What's your vector, Victor?" - Now you'll know!
A documentation vectorization system that extracts, analyzes, and organizes technical documentation for semantic search and LLM consumption. VectorVictor transforms your docs into a searchable knowledge base with the power of GPT-4 and vector embeddings.
Vector Victor is an intelligent documentation processing system that analyzes, extracts, and maps relationships across technical documentation sets. Unlike traditional embedding-based approaches, Vector Victor uses advanced LLM-based parsing combined with DSPy Chain-of-Thought reasoning to understand documentation structure and relationships at a deeper semantic level.
During our development, we discovered that traditional vector embeddings, while useful for simple similarity searches, often miss the nuanced relationships and hierarchical structure present in technical documentation. By leveraging DSPy's Chain-of-Thought capabilities and GPT-4's advanced reasoning, we can:
- Extract meaningful section hierarchies
- Identify complex relationships between concepts
- Understand prerequisite knowledge chains
- Map implementation patterns and examples
- Generate more accurate documentation graphs
- Flexible Documentation Extraction: Support for both GitHub repositories and web-based documentation
- Intelligent Analysis: Uses GPT-4 for content understanding and summarization
- Smart Documentation Processing: Hierarchical section extraction, code example detection and categorization, relationship mapping between concepts, prerequisite chain identification, framework-specific pattern recognition
- Relationship Types: IMPLEMENTS, DEPENDS_ON, EXTENDS, RELATED_TO, EXAMPLE_OF, PREREQUISITE, NEXT_TOPIC, NEXT_IN_SECTION
- DSPy Integration: Structured information extraction using DSPy's ChainOfThought
- Progress Tracking: Resumable processing for large documentation sets
doc_scraper/
├── scraped_docs/ # Raw scraped documentation
│ └── [project_name]/ # Documentation organized by project
├── llm_docs/ # LLM-optimized documentation
│ └── [project_name]/ # Project-specific processed docs
│ ├── index.json # Document metadata and organization
│ ├── content.json # Processed document content
│ ├── relationships.json # Relationship mapping
│ └── graph.json # Documentation graph
├── scraper.py # Documentation scraper
├── doc_processor.py # Content processor and analyzer
├── relationship_extractor.py # Relationship mapping
└── test_api.py # API connection tester
- Create and activate a virtual environment:
python3 -m venv venv
source venv/bin/activate
- Install dependencies:
pip install -r requirements.txt
- Create a
.env
file with your OpenAI API key:
OPENAI_API_KEY=your_api_key_here
- Test your setup:
python test_api.py
The scraper supports two types of documentation sources:
python scraper.py --url https://github.com/username/repo/docs --project project_name
The scraper will:
- Detect if the URL points to a GitHub docs directory
- Clone the repository if needed
- Extract markdown and other documentation files
- Save raw content in
scraped_docs/[project_name]
python scraper.py --url https://docs.example.com --project project_name --content-selector "article" --link-selector "a"
Optional arguments:
--content-selector
: CSS selector for main content (default: "article")--link-selector
: CSS selector for navigation links (default: "a")--max-depth
: Maximum recursion depth for link following (default: 5)
The processor uses GPT-4 and DSPy to analyze documentation:
python doc_processor.py --project [project_name]
Processing steps:
- Content extraction from scraped documents
- Hierarchical section extraction
- Relationship mapping between concepts
- Prerequisite chain identification
- Framework-specific pattern recognition
- Progress tracking in
llm_docs/progress.json
- Index maintenance in
llm_docs/index.json
Extract relationships between concepts:
python relationship_extractor.py --input llm_docs/[project_name]
This extraction:
- Maps connections between concepts
- Identifies relationship types (IMPLEMENT, DEPENDS_ON, etc.)
- Saves relationships in
llm_docs/relationships.json
Example of using the processed documentation:
import json
from pathlib import Path
def load_documentation(project_dir: str):
# Load document content and metadata
with open(Path(project_dir) / 'content.json') as f:
content = json.load(f)
# Load relationships
with open(Path(project_dir) / 'relationships.json') as f:
relationships = json.load(f)
# Use the loaded data for further analysis or visualization
return content, relationships
DSPy is used for structured information extraction through its ChainOfThought module:
class DocumentAnalyzer:
def __init__(self):
self.analyze = dspy.ChainOfThought(
"content -> title, summary, key_concepts, code_examples, dependencies, related_topics"
)
This creates a chain that:
- Takes documentation content as input
- Uses GPT-4 to understand the content
- Extracts structured information in a consistent format
The system uses DSPy's Chain-of-Thought reasoning to map connections between concepts:
def map_relationships(content: str) -> List[Relationship]:
client = dspy.ChainOfThought(
"content -> relationships"
)
response = client.analyze(content)
return response.data[0].relationships
These relationships enable:
- Accurate documentation graphs
- Contextual processing
- Nuanced understanding of documentation structure
The system maintains processing state in two files:
progress.json
: Tracks processed filesindex.json
: Maintains document organization
This enables:
- Resumable processing for large documentation sets
- Progress monitoring
- Efficient updates of specific documents
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.