Skip to content

A machine learning project to automatically classify Minecraft mods as valid Create mod add-ons, reducing human effort in mod validation and curation.

Notifications You must be signed in to change notification settings

blueprint-site/blueprint-addon-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ”§ Blueprint Addon Classifier

A machine learning project to automatically classify Minecraft mods as valid Create mod addons, reducing human effort in mod validation and curation.

πŸ“‹ Project Overview

Goal: Build an ML classifier that can automatically identify which Minecraft mods are legitimate Create mod addons based on mod metadata.

Dataset: ~2,000 mod records with human-verified "isValid" labels
Problem Type: Binary classification (Valid Create add-on: true/false)
Current Status: βœ… Exploration phase complete, 🚧 Production pipeline in development

�️ Project Structure

ml-create-addon-classifier/
β”œβ”€β”€ πŸ“Š data/
β”‚   └── addons.json              # Raw dataset (~2K mod records)
β”œβ”€β”€ πŸ““ notebooks/
β”‚   └── 01-exploration.ipynb     # βœ… Complete EDA & baseline models
β”œβ”€β”€ πŸ”§ src/
β”‚   β”œβ”€β”€ api/
β”‚   β”‚   └── server.py           # 🚧 FastAPI inference endpoint  
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   └── loader.py           # 🚧 Data loading utilities
β”‚   β”œβ”€β”€ features/
β”‚   β”‚   └── feature_engineering.py  # 🚧 Feature processing pipeline
β”‚   β”œβ”€β”€ models/                 # 🚧 Model implementations
β”‚   └── evaluation/             # 🚧 Model evaluation framework
β”œβ”€β”€ πŸ“ scripts/
β”‚   └── train_model.py          # 🚧 Training pipeline
β”œβ”€β”€ βš™οΈ config/                   # 🚧 Configuration management
└── πŸ§ͺ tests/                    # 🚧 Unit tests

Legend: βœ… Complete | 🚧 In Development | ❌ Not Started

πŸš€ Current Status & Key Achievements

βœ… Exploration Phase Complete

  • Comprehensive EDA: Analyzed text patterns, categorical distributions, and numerical features
  • Feature Engineering: Created keyword-based features, text metrics, and categorical encodings
  • Baseline Models: Implemented Logistic Regression and Random Forest with educational explanations
  • Performance: Achieved ~80-90% accuracy with basic features
  • Educational Framework: Added detailed markdown cells explaining ML concepts for each code cell

πŸ” Key Findings

  • Strong Signal: Create-specific keywords in mod names are highly predictive
  • Text Patterns: Valid Create add-ons follow distinct naming conventions
  • Feature Importance: Keyword count, author patterns, and categories are most discriminative
  • Data Quality: Clean dataset with consistent labeling and minimal missing values

🎯 Getting Started

Prerequisites

# Python 3.8+ required
pip install -r requirements.txt

Quick Start - Exploration

# 1. Start Jupyter
jupyter lab

# 2. Open and run the exploration notebook
notebooks/01-exploration.ipynb

Current Capabilities

  • βœ… Data Loading: Load and preprocess mod dataset from JSON
  • βœ… Feature Engineering: Extract text, categorical, and numerical features
  • βœ… Baseline Classification: Train and evaluate Logistic Regression & Random Forest
  • βœ… Model Comparison: Compare multiple algorithms with ROC curves and confusion matrices
  • βœ… Educational Content: Comprehensive explanations of ML concepts for each analysis step

πŸ“Š Dataset Details

Source: Aggregated Minecraft mod data from CurseForge and Modrinth
Size: ~2,000 mod records with rich metadata
Features:

  • name: Mod name/title
  • description: Mod description text
  • author: Mod creator
  • categories: List of assigned categories
  • downloads: Download count
  • sources: Platform (CurseForge/Modrinth)
  • isValid: Target variable (human-verified Create add-on status)

Sample Record:

{
  "name": "Create",
  "description": "Aesthetic Technology that empowers the Player",
  "categories": ["decoration", "technology", "utility"],
  "downloads": 116041058,
  "isValid": true
}

🧠 Machine Learning Approach

Current Models (Baseline)

  1. Logistic Regression: Linear classifier for interpretability and feature importance
  2. Random Forest: Tree-based ensemble for non-linear patterns and interactions

Feature Engineering Strategy

  • Text Features: Create-specific keyword extraction, name/description length, word counts
  • Categorical Features: Author encoding, category analysis, source platform
  • Boolean Features: "Create" in name detection, category count features
  • Numerical Features: Download counts, author productivity metrics

Performance Metrics

  • Accuracy: Overall prediction correctness (~85-90% achieved)
  • AUC-ROC: Ability to distinguish between valid/invalid mods (~0.85-0.90)
  • Precision/Recall: Balance between false positives and false negatives
  • Feature Importance: Random Forest reveals most predictive features

οΏ½ Next Steps & Roadmap

🚧 Phase 2: Advanced Modeling (Next Sprint)

Priority: High | Effort: 2-3 weeks

  • Ensemble Models: Implement XGBoost, LightGBM, and CatBoost
  • Text Vectorization: Add TF-IDF and word embeddings for full description analysis
  • Cross-Validation: Implement proper stratified k-fold validation
  • Hyperparameter Tuning: Grid search and Bayesian optimization
  • Advanced Features: Dependency parsing, version pattern analysis

🚧 Phase 3: Production Pipeline (Following Sprint)

Priority: High | Effort: 3-4 weeks

  • Module Implementation: Complete all src/ package implementations
  • Training Pipeline: Automated model training and validation scripts
  • Inference API: FastAPI endpoint for real-time classification
  • Model Persistence: Save/load trained models with versioning
  • Logging & Monitoring: Comprehensive logging and performance tracking

🚧 Phase 4: Deployment & Scaling (Future)

Priority: Medium | Effort: 2-3 weeks

  • Containerization: Docker setup for consistent deployment
  • Cloud Deployment: Deploy inference API to cloud platform
  • Batch Processing: Handle bulk mod classification efficiently
  • Feedback Loop: System for continuous model improvement
  • Web Interface: User-friendly interface for manual validation

πŸ§ͺ Running Tests

# Install test dependencies
pip install pytest pytest-cov

# Run all tests (when implemented)
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=src --cov-report=html

πŸ“ˆ Performance Benchmarks

Current Baseline Results (from exploration notebook):

  • Best Model: Random Forest
  • Accuracy: ~85-90%
  • AUC Score: ~0.85-0.90
  • Training Time: <30 seconds
  • Inference Time: <1ms per prediction

Target Production Goals:

  • Accuracy: >92%
  • AUC Score: >0.95
  • Precision: >90% (minimize false positives)
  • Recall: >85% (catch most valid add-ons)
  • Scalability: Handle 1000+ predictions/minute

🀝 Contributing

Development Workflow

  1. Experimentation: Use notebooks (notebooks/) for rapid prototyping
  2. Implementation: Move proven concepts to src/ modules with proper structure
  3. Testing: Add comprehensive unit tests with >80% coverage
  4. Documentation: Update README and add detailed docstrings

Code Standards

  • Style: Follow PEP 8 with Black formatter
  • Type Hints: Use type annotations for all function signatures
  • Documentation: Comprehensive docstrings and inline comments
  • Testing: Unit tests for all production code with pytest

πŸ“š Educational Value

The exploration notebook (01-exploration.ipynb) includes detailed educational content:

  • Machine Learning Concepts: Feature engineering, model evaluation, cross-validation
  • Data Science Workflow: EDA, preprocessing, modeling, interpretation
  • Domain Knowledge: Minecraft modding ecosystem and Create mod characteristics
  • Best Practices: Code organization, reproducibility, visualization techniques

Perfect for: Onboarding new team members, teaching ML concepts, understanding the problem domain

πŸ› οΈ Dependencies

Core ML Stack:

  • pandas>=2.0.0: Data manipulation and analysis
  • scikit-learn>=1.3.0: Machine learning algorithms
  • numpy>=1.24.0: Numerical computing
  • matplotlib>=3.7.0, seaborn>=0.12.0: Visualization

Advanced ML (for future phases):

  • xgboost>=1.7.0, lightgbm>=4.0.0: Gradient boosting
  • transformers>=4.30.0: Neural language models
  • fastapi>=0.100.0: API development

🎯 Success Criteria

Technical Goals:

  • βœ… Proof of Concept: Demonstrate feasibility (COMPLETE)
  • 🎯 Production Model: >92% accuracy with robust evaluation
  • 🎯 Deployment: Real-time API with <100ms response time
  • 🎯 Scalability: Handle production traffic loads

Business Impact:

  • Primary: Reduce manual validation effort by 80%+
  • Secondary: Improve consistency in Create addon curation
  • Long-term: Enable automatic mod discovery and recommendation

Status: πŸ”¬ Research Complete β†’ 🚧 Development Phase
Next Milestone: Advanced modeling and production pipeline
Last Updated: June 2025 | Team: Spencer + AI Assistant

οΏ½ Quick Start Commands

# 1. Environment setup
git clone <repo-url> && cd ml-create-addon-classifier
pip install -r requirements.txt

# 2. Run exploration (educational)
jupyter lab notebooks/01-exploration.ipynb

# 3. Future: Train production model
python scripts/train_model.py --config config/production.yaml

# 4. Future: Start inference API  
uvicorn src.api.server:app --reload

About

A machine learning project to automatically classify Minecraft mods as valid Create mod add-ons, reducing human effort in mod validation and curation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published