A Model Context Protocol (MCP) server for advanced audio transcription and processing using OpenAI's Whisper and GPT-4o models.
MCP Server Whisper provides a standardized way to process audio files through OpenAI's transcription services. By implementing the Model Context Protocol, it enables AI assistants like Claude to seamlessly interact with audio processing capabilities.
Key features:
- π Advanced file searching with regex patterns, file metadata filtering, and sorting capabilities
- π Parallel batch processing for multiple audio files
- π Format conversion between supported audio types
- π¦ Automatic compression for oversized files
- βοΈ Enhanced transcription with specialized prompts
- ποΈ Text-to-speech generation with customizable voices and models
- π Comprehensive metadata including duration, file size, and format support
- π High-performance caching for repeated operations
# Clone the repository
git clone https://github.com/arcaputo3/mcp-server-whisper.git
cd mcp-server-whisper
# Using uv
uv sync
# Set up pre-commit hooks
uv run pre-commit install
Create a .env
file with the following variables:
OPENAI_API_KEY=your_openai_api_key
AUDIO_FILES_PATH=/path/to/your/audio/files
To run the MCP server in development mode:
mcp dev src/mcp_server_whisper/server.py
To install the server for use with Claude Desktop or other MCP clients:
mcp install src/mcp_server_whisper/server.py [--env-file .env]
list_audio_files
- Lists audio files with comprehensive filtering and sorting options:- Filter by regex pattern matching on filenames
- Filter by file size, duration, modification time, or format
- Sort by name, size, duration, modification time, or format
- All operations support parallelized batch processing
get_latest_audio
- Gets the most recently modified audio file with model support info
convert_audio
- Converts audio files to supported formats (mp3 or wav)compress_audio
- Compresses audio files that exceed size limits
transcribe_audio
- Basic transcription using OpenAI's Whisper modeltranscribe_with_llm
- Transcription with custom prompts using GPT-4otranscribe_with_enhancement
- Enhanced transcription with specialized templates:detailed
- Includes tone, emotion, and background detailsstorytelling
- Transforms the transcript into a narrative formprofessional
- Creates formal, business-appropriate transcriptionsanalytical
- Adds analysis of speech patterns and key points
create_claudecast
- Generate text-to-speech audio using OpenAI's TTS API:- Supports different models (
tts-1
,tts-1-hd
) - Multiple voice options (alloy, ash, coral, echo, fable, onyx, nova, sage, shimmer)
- Customizable output file paths
- Handles texts of any length by automatically splitting and joining audio segments
- Supports different models (
Model | Supported Formats |
---|---|
Whisper | mp3, mp4, mpeg, mpga, m4a, wav, webm |
GPT-4o | mp3, wav |
Note: Files larger than 25MB are automatically compressed to meet API limits.
Basic Audio Transcription
Claude, please transcribe my latest audio file with detailed insights.
Claude will automatically:
- Find the latest audio file using
get_latest_audio
- Determine the appropriate transcription method
- Process the file with
transcribe_with_enhancement
using the "detailed" template - Return the enhanced transcription
Advanced Audio File Search and Filtering
Claude, list all my audio files that are longer than 5 minutes and were created after January 1st, 2024, sorted by size.
Claude will:
- Convert the date to a timestamp
- Use
list_audio_files
with appropriate filters:min_duration_seconds: 300
(5 minutes)min_modified_time: <timestamp for Jan 1, 2024>
sort_by: "size"
- Return a sorted list of matching audio files with comprehensive metadata
Batch Processing Multiple Files
Claude, find all MP3 files with "interview" in the filename and create professional transcripts for each one.
Claude will:
- Search for files using
list_audio_files
with:pattern: ".*interview.*\\.mp3"
format: "mp3"
- Process all matching files in parallel using
transcribe_with_enhancement
enhancement_type: "professional"
- Return all transcriptions in a well-formatted output
Generating Text-to-Speech with Claudecast
Claude, create a claudecast with this script: "Welcome to our podcast! Today we'll be discussing artificial intelligence trends in 2025." Use the shimmer voice.
Claude will:
- Use the
create_claudecast
tool with:text_prompt
containing the scriptvoice: "shimmer"
model: "tts-1-hd"
(default high-quality model)
- Generate the audio file and save it to the configured audio directory
- Provide the path to the generated audio file
Add this to your claude_desktop_config.json
:
{
"mcpServers": {
"whisper": {
"command": "uvx",
"args": [
"--with",
"aiofiles",
"--with",
"mcp[cli]",
"--with",
"openai",
"--with",
"pydub",
"mcp-server-whisper"
],
"env": {
"OPENAI_API_KEY": "your_openai_api_key",
"AUDIO_FILES_PATH": "/path/to/your/audio/files"
}
}
}
}
- Install Screen Recorder By Omi (free)
- Set
AUDIO_FILES_PATH
to/Users/<user>/Movies/Omi Screen Recorder
and replace<user>
with your username - As you record audio with the app, you can transcribe large batches directly with Claude
This project uses modern Python development tools including uv
, pytest
, ruff
, and mypy
.
# Run tests
uv run pytest
# Run with coverage
uv run pytest --cov=src
# Format code
uv run ruff format src
# Lint code
uv run ruff check src
# Run type checking (strict mode)
uv run mypy --strict src
# Run the pre-commit hooks
pre-commit run --all-files
The project uses GitHub Actions for CI/CD:
- Lint & Type Check: Ensures code quality with ruff and strict mypy type checking
- Tests: Runs tests on multiple Python versions (3.10, 3.11)
- Build: Creates distribution packages
- Publish: Automatically publishes to PyPI when a new version tag is pushed
To create a new release version:
git checkout main
# Make sure everything is up to date
git pull
# Create a new version tag
git tag v0.1.1
# Push the tag
git push origin v0.1.1
For detailed architecture information, see Architecture Documentation.
MCP Server Whisper is built on the Model Context Protocol, which standardizes how AI models interact with external tools and data sources. The server:
- Exposes Audio Processing Capabilities: Through standardized MCP tool interfaces
- Implements Parallel Processing: Using asyncio and batch operations for performance
- Manages File Operations: Handles detection, validation, conversion, and compression
- Provides Rich Transcription: Via different OpenAI models and enhancement templates
- Optimizes Performance: With caching mechanisms for repeated operations
Under the hood, it uses:
pydub
for audio file manipulationasyncio
for concurrent processing- OpenAI's Whisper API for base transcription
- GPT-4o for enhanced audio understanding
- FastMCP for simplified MCP server implementation
- Type hints and strict mypy validation throughout the codebase
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a new branch for your feature (
git checkout -b feature/amazing-feature
) - Make your changes
- Run the tests and linting (
uv run pytest && uv run ruff check src && uv run mypy --strict src
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Model Context Protocol (MCP) - For the protocol specification
- pydub - For audio processing
- OpenAI Whisper - For audio transcription
- FastMCP - For MCP server implementation
- Anthropic Claude - For natural language interaction