A machine learning project to automatically classify Minecraft mods as valid Create mod addons, reducing human effort in mod validation and curation.
Goal: Build an ML classifier that can automatically identify which Minecraft mods are legitimate Create mod addons based on mod metadata.
Dataset: ~2,000 mod records with human-verified "isValid" labels
Problem Type: Binary classification (Valid Create add-on: true/false)
Current Status: β
Exploration phase complete, π§ Production pipeline in development
ml-create-addon-classifier/
βββ π data/
β βββ addons.json # Raw dataset (~2K mod records)
βββ π notebooks/
β βββ 01-exploration.ipynb # β
Complete EDA & baseline models
βββ π§ src/
β βββ api/
β β βββ server.py # π§ FastAPI inference endpoint
β βββ data/
β β βββ loader.py # π§ Data loading utilities
β βββ features/
β β βββ feature_engineering.py # π§ Feature processing pipeline
β βββ models/ # π§ Model implementations
β βββ evaluation/ # π§ Model evaluation framework
βββ π scripts/
β βββ train_model.py # π§ Training pipeline
βββ βοΈ config/ # π§ Configuration management
βββ π§ͺ tests/ # π§ Unit tests
Legend: β Complete | π§ In Development | β Not Started
- Comprehensive EDA: Analyzed text patterns, categorical distributions, and numerical features
- Feature Engineering: Created keyword-based features, text metrics, and categorical encodings
- Baseline Models: Implemented Logistic Regression and Random Forest with educational explanations
- Performance: Achieved ~80-90% accuracy with basic features
- Educational Framework: Added detailed markdown cells explaining ML concepts for each code cell
- Strong Signal: Create-specific keywords in mod names are highly predictive
- Text Patterns: Valid Create add-ons follow distinct naming conventions
- Feature Importance: Keyword count, author patterns, and categories are most discriminative
- Data Quality: Clean dataset with consistent labeling and minimal missing values
# Python 3.8+ required
pip install -r requirements.txt
# 1. Start Jupyter
jupyter lab
# 2. Open and run the exploration notebook
notebooks/01-exploration.ipynb
- β Data Loading: Load and preprocess mod dataset from JSON
- β Feature Engineering: Extract text, categorical, and numerical features
- β Baseline Classification: Train and evaluate Logistic Regression & Random Forest
- β Model Comparison: Compare multiple algorithms with ROC curves and confusion matrices
- β Educational Content: Comprehensive explanations of ML concepts for each analysis step
Source: Aggregated Minecraft mod data from CurseForge and Modrinth
Size: ~2,000 mod records with rich metadata
Features:
name
: Mod name/titledescription
: Mod description textauthor
: Mod creatorcategories
: List of assigned categoriesdownloads
: Download countsources
: Platform (CurseForge/Modrinth)isValid
: Target variable (human-verified Create add-on status)
Sample Record:
{
"name": "Create",
"description": "Aesthetic Technology that empowers the Player",
"categories": ["decoration", "technology", "utility"],
"downloads": 116041058,
"isValid": true
}
- Logistic Regression: Linear classifier for interpretability and feature importance
- Random Forest: Tree-based ensemble for non-linear patterns and interactions
- Text Features: Create-specific keyword extraction, name/description length, word counts
- Categorical Features: Author encoding, category analysis, source platform
- Boolean Features: "Create" in name detection, category count features
- Numerical Features: Download counts, author productivity metrics
- Accuracy: Overall prediction correctness (~85-90% achieved)
- AUC-ROC: Ability to distinguish between valid/invalid mods (~0.85-0.90)
- Precision/Recall: Balance between false positives and false negatives
- Feature Importance: Random Forest reveals most predictive features
Priority: High | Effort: 2-3 weeks
- Ensemble Models: Implement XGBoost, LightGBM, and CatBoost
- Text Vectorization: Add TF-IDF and word embeddings for full description analysis
- Cross-Validation: Implement proper stratified k-fold validation
- Hyperparameter Tuning: Grid search and Bayesian optimization
- Advanced Features: Dependency parsing, version pattern analysis
Priority: High | Effort: 3-4 weeks
- Module Implementation: Complete all
src/
package implementations - Training Pipeline: Automated model training and validation scripts
- Inference API: FastAPI endpoint for real-time classification
- Model Persistence: Save/load trained models with versioning
- Logging & Monitoring: Comprehensive logging and performance tracking
Priority: Medium | Effort: 2-3 weeks
- Containerization: Docker setup for consistent deployment
- Cloud Deployment: Deploy inference API to cloud platform
- Batch Processing: Handle bulk mod classification efficiently
- Feedback Loop: System for continuous model improvement
- Web Interface: User-friendly interface for manual validation
# Install test dependencies
pip install pytest pytest-cov
# Run all tests (when implemented)
pytest tests/ -v
# Run with coverage
pytest tests/ --cov=src --cov-report=html
Current Baseline Results (from exploration notebook):
- Best Model: Random Forest
- Accuracy: ~85-90%
- AUC Score: ~0.85-0.90
- Training Time: <30 seconds
- Inference Time: <1ms per prediction
Target Production Goals:
- Accuracy: >92%
- AUC Score: >0.95
- Precision: >90% (minimize false positives)
- Recall: >85% (catch most valid add-ons)
- Scalability: Handle 1000+ predictions/minute
- Experimentation: Use notebooks (
notebooks/
) for rapid prototyping - Implementation: Move proven concepts to
src/
modules with proper structure - Testing: Add comprehensive unit tests with >80% coverage
- Documentation: Update README and add detailed docstrings
- Style: Follow PEP 8 with Black formatter
- Type Hints: Use type annotations for all function signatures
- Documentation: Comprehensive docstrings and inline comments
- Testing: Unit tests for all production code with pytest
The exploration notebook (01-exploration.ipynb
) includes detailed educational content:
- Machine Learning Concepts: Feature engineering, model evaluation, cross-validation
- Data Science Workflow: EDA, preprocessing, modeling, interpretation
- Domain Knowledge: Minecraft modding ecosystem and Create mod characteristics
- Best Practices: Code organization, reproducibility, visualization techniques
Perfect for: Onboarding new team members, teaching ML concepts, understanding the problem domain
Core ML Stack:
pandas>=2.0.0
: Data manipulation and analysisscikit-learn>=1.3.0
: Machine learning algorithmsnumpy>=1.24.0
: Numerical computingmatplotlib>=3.7.0
,seaborn>=0.12.0
: Visualization
Advanced ML (for future phases):
xgboost>=1.7.0
,lightgbm>=4.0.0
: Gradient boostingtransformers>=4.30.0
: Neural language modelsfastapi>=0.100.0
: API development
Technical Goals:
- β Proof of Concept: Demonstrate feasibility (COMPLETE)
- π― Production Model: >92% accuracy with robust evaluation
- π― Deployment: Real-time API with <100ms response time
- π― Scalability: Handle production traffic loads
Business Impact:
- Primary: Reduce manual validation effort by 80%+
- Secondary: Improve consistency in Create addon curation
- Long-term: Enable automatic mod discovery and recommendation
Status: π¬ Research Complete β π§ Development Phase
Next Milestone: Advanced modeling and production pipeline
Last Updated: June 2025 | Team: Spencer + AI Assistant
# 1. Environment setup
git clone <repo-url> && cd ml-create-addon-classifier
pip install -r requirements.txt
# 2. Run exploration (educational)
jupyter lab notebooks/01-exploration.ipynb
# 3. Future: Train production model
python scripts/train_model.py --config config/production.yaml
# 4. Future: Start inference API
uvicorn src.api.server:app --reload