A Python tool for finding and managing duplicate media files, with specialized handling for audiobooks and OpenLibrary integration.
- Advanced duplicate detection with intelligent pattern matching
- OpenLibrary API integration with rate limiting
- Comprehensive reporting system:
- HTML reports with CSS styling
- JSON export for data persistence
- Directory-based statistics
- Space savings calculations
- Processing statistics and memory monitoring
- Smart chapter and series pattern detection
- OpenLibrary metadata enrichment
- Rate-limited API integration (100 requests/5 minutes)
- Multiple format support (.mp3, .m4b, .aac)
- Advanced pattern matching for:
- Series detection (e.g., "Book #1", "Volume 2/3")
- Chapter identification
- Track numbering
- Multiple narrators/versions
- Clone the repository
- Create a virtual environment:
python -m venv venv
- Activate the virtual environment:
- Windows:
venv\Scripts\activate
- Unix/MacOS:
source venv/bin/activate
- Windows:
- Install requirements:
pip install -r requirements.txt
- Run
python DupAssassin.py
- Select media type (Audiobooks, Movies, TV Shows, Ebooks)
- Choose directories to scan
- Review potential duplicates with options:
- View duplicate groups
- View directory statistics
- View file patterns
- Process with OpenLibrary verification
- Memory usage monitoring and garbage collection
- Multi-threaded processing
- Rate-limited API requests
- Progress tracking with ETA
- Graceful API timeout handling
- Interrupt signal management
- Invalid file structure detection
- Logging with configurable levels
- Detailed HTML reports
- JSON data export
- Directory-based analysis
- Space savings calculations
- Processing statistics:
- File processing rates
- API success rates
- Memory usage tracking
- Error logging
- Python 3.8+
- OpenLibrary API access
- Required packages in requirements.txt
MIT License - See LICENSE file