Skip to content

lab-rasool/HoneyBee

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

HoneyBee Logo

HoneyBee

A Scalable Modular Framework for Multimodal AI in Oncology

arXiv License: MIT GitHub stars Python PyTorch

Documentation | Paper | Examples | Demo | Google Colab

πŸš€ Overview

HoneyBee is a comprehensive multimodal AI framework designed specifically for oncology research and clinical applications. It seamlessly integrates and processes diverse medical data typesβ€”clinical text, radiology images, pathology slides, and molecular dataβ€”through a unified, modular architecture. Built with scalability and extensibility in mind, HoneyBee empowers researchers to develop sophisticated AI models for cancer diagnosis, prognosis, and treatment planning.

Warning

Alpha Release: This framework is currently in alpha. APIs may change, and some features are still under development.

✨ Key Features

πŸ—οΈ Modular Architecture

  • 3-Layer Design: Clean separation between data loaders, embedding models, and processors
  • Unified API: Consistent interface across all modalities
  • Extensible: Easy to add new models and data sources
  • Production-Ready: Optimized for both research and clinical deployment

πŸ“Š Comprehensive Data Support

Medical Imaging

  • Pathology: Whole Slide Images (WSI) - SVS, TIFF formats with tissue detection
  • Radiology: DICOM, NIFTI processing with 3D support
  • Preprocessing: Advanced augmentation and normalization pipelines

Clinical Text

  • Document Processing: PDF support with OCR for scanned documents
  • NLP Pipeline: Cancer entity extraction, temporal parsing, medical ontology integration
  • Database Integration: Native MINDS format support
  • Long Document Handling: Multiple tokenization strategies for clinical notes

Molecular Data

  • Genomics: Support for expression data and mutation profiles
  • Integration: Seamless combination with imaging and clinical data

🧠 State-of-the-Art Embedding Models

Clinical Text Embeddings

  • GatorTron: Domain-specific clinical language model
  • BioBERT: Biomedical text understanding
  • PubMedBERT: Scientific literature embeddings
  • Clinical-T5: Text-to-text clinical transformers

Medical Image Embeddings

  • REMEDIS: Self-supervised medical image representations
  • RadImageNet: Pre-trained radiological feature extractors
  • UNI: Universal medical image encoder
  • Custom Models: Easy integration of proprietary models

πŸ› οΈ Advanced Capabilities

Multimodal Integration

  • Cross-Modal Learning: Unified representations across modalities
  • Attention Mechanisms: Interpretable fusion strategies
  • Patient-Level Aggregation: Comprehensive patient profiles

Analysis Tools

  • Survival Analysis: Cox PH, Random Survival Forest, DeepSurv
  • Classification: Multi-class cancer type prediction
  • Retrieval: Similar patient identification
  • Visualization: Interactive t-SNE dashboards

Clinical Applications

  • Risk Stratification: Patient outcome prediction
  • Treatment Planning: Personalized therapy recommendations
  • Biomarker Discovery: Multi-omic pattern identification

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • PyTorch 2.0+
  • CUDA 11.7+ (optional, for GPU acceleration)

System Dependencies

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y openslide-tools tesseract-ocr

# macOS
brew install openslide tesseract

# Windows
# Install from official websites:
# - OpenSlide: https://openslide.org/download/
# - Tesseract: https://github.com/UB-Mannheim/tesseract/wiki

Installation

# Clone the repository
git clone https://github.com/lab-rasool/HoneyBee.git
cd HoneyBee

# Install dependencies
pip install -r requirements.txt

# Download required NLTK data
python -c "import nltk; nltk.download('punkt')"

# Install HoneyBee in development mode
pip install -e .

Environment Setup

Create a .env file in the project root:

# MINDS database credentials (if using MINDS format)
HOST=your_server
PORT=5433
DB_USER=postgres
PASSWORD=your_password
DATABASE=minds

# HuggingFace API (for some models)
HF_API_KEY=your_huggingface_api_key

πŸ”¬ Research Applications

HoneyBee has been successfully applied to:

  • Cancer Subtype Classification: Automated identification of cancer subtypes from multimodal data
  • Survival Prediction: Risk stratification and outcome prediction for treatment planning
  • Similar Patient Retrieval: Finding patients with similar clinical profiles for precision medicine
  • Biomarker Discovery: Identifying multimodal patterns associated with treatment response

🀝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

# Fork and clone your fork
git clone https://github.com/YOUR_USERNAME/HoneyBee.git
cd HoneyBee

# Create a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -r requirements.txt
pip install -e .

πŸ› Known Issues & Limitations

  • Alpha Status: Some features are still under development
  • Memory Requirements: WSI processing requires significant RAM (16GB+ recommended)
  • GPU Recommended: While CPU fallback exists, GPU acceleration significantly improves performance
  • Limited Test Coverage: Comprehensive test suite is planned for future releases

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ“ Citation

If you use HoneyBee in your research, please cite our paper:

@article{tripathi2024honeybee,
    title={HoneyBee: A Scalable Modular Framework for Creating Multimodal Oncology Datasets with Foundational Embedding Models},
    author={Aakash Tripathi and Asim Waqas and Yasin Yilmaz and Ghulam Rasool},
    journal={arXiv preprint arXiv:2405.07460},
    year={2024},
    eprint={2405.07460},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

Made with ❀️ by the Lab Rasool team

About

🐝 | From Data to Prognosis: Embedding Multimodal Oncology Data for Precision Medicine

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •