Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cloud-based Whisper support for transcription #111

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ Manim Voiceover is a [Manim](https://manim.community) plugin for all things voic
- Record voiceovers with your microphone during rendering with a simple command line interface.
- Develop animations with auto-generated AI voices from various free and proprietary services.
- Per-word timing of animations, i.e. trigger animations at specific words in the voiceover, even for the recordings. This works thanks to [OpenAI Whisper](https://github.com/openai/whisper).
- **NEW**: Supports both local and cloud-based Whisper for ARM64 architectures (like Apple Silicon) where the local model may not work.

Here is a demo:

Expand All @@ -41,6 +42,38 @@ Currently supported TTS services (aside from the CLI that allows you to records

[Check out the example gallery to get inspired.](https://voiceover.manim.community/en/latest/examples.html)

## Cloud Whisper Support

For ARM64 architectures (like Apple Silicon Macs) or systems where installing the local Whisper model is problematic, you can now use OpenAI's cloud-based Whisper API for speech-to-text alignment:

```bash
# Run with the provided script
python manim_cloud_whisper.py -pql examples/cloud_whisper_demo.py CloudWhisperDemo
```

Or enable it programmatically:

```python
service = OpenAIService(
voice="alloy",
model="tts-1",
transcription_model="base",
use_cloud_whisper=True # This enables cloud-based Whisper
)
```

You can also set an environment variable to enable cloud-based Whisper:

```bash
# Set the environment variable
export MANIM_VOICEOVER_USE_CLOUD_WHISPER=1

# Run Manim normally
manim -pql examples/cloud_whisper_demo.py CloudWhisperDemo
```

[Learn more about cloud-based Whisper in the documentation.](https://voiceover.manim.community/en/latest/cloud_whisper.html)

## Translate

Manim Voiceover can use machine translation services like [DeepL](https://www.deepl.com/) to translate voiceovers into other languages. [Check out the docs for more details.](https://voiceover.manim.community/en/latest/translate.html)
110 changes: 110 additions & 0 deletions demo_openai_cloud_whisper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
from manim import *
from manim_voiceover.voiceover_scene import VoiceoverScene
from manim_voiceover.services.openai import OpenAIService

class OpenAICloudWhisperDemo(VoiceoverScene):
def construct(self):
# Print the cloud whisper setting
print(f"Cloud Whisper enabled: {config.use_cloud_whisper}")

# Initialize OpenAI speech service with cloud whisper
service = OpenAIService(
voice="alloy", # Available voices: alloy, echo, fable, onyx, nova, shimmer
model="tts-1", # tts-1 or tts-1-hd
transcription_model="base",
use_cloud_whisper=True # Use cloud-based Whisper
)
self.set_speech_service(service)

# Create a title
title = Text("OpenAI TTS + Cloud Whisper Demo", font_size=48)
self.play(Write(title))
self.wait(1)

# Move title to top
self.play(title.animate.to_edge(UP))

# Create a subtitle
subtitle = Text("Word-level alignment on ARM64 architectures",
font_size=36,
color=BLUE)
subtitle.next_to(title, DOWN)
self.play(FadeIn(subtitle))

# Demonstrate voiceover with bookmarks
with self.voiceover(
"""This demonstration uses OpenAI's text-to-speech service
with <bookmark mark='cloud_point'/> cloud-based Whisper for
word-level <bookmark mark='alignment_point'/> alignment."""
) as tracker:
# Wait until the first bookmark
self.wait_until_bookmark("cloud_point")

# Create and animate the cloud text
cloud_text = Text("☁️ Cloud-based Whisper", color=BLUE, font_size=36)
cloud_text.next_to(subtitle, DOWN, buff=1)
self.play(FadeIn(cloud_text))

# Wait until the second bookmark
self.wait_until_bookmark("alignment_point")

# Create and animate the alignment text
alignment_text = Text("Perfect Word Timing", color=GREEN, font_size=36)
alignment_text.next_to(cloud_text, DOWN, buff=0.5)
self.play(FadeIn(alignment_text))

# Continue with demonstration
self.wait(1)

# Show ARM64 compatibility
arm_title = Text("Works on Apple Silicon!", color=RED, font_size=36)
arm_title.next_to(alignment_text, DOWN, buff=1)

with self.voiceover(
"This feature is especially useful for ARM64 architectures like your M4 Pro."
):
self.play(FadeIn(arm_title))

# Final animation
self.wait(1)

with self.voiceover(
"No local Whisper model required. Everything happens in the cloud!"
):
# Create a final animation
final_group = VGroup(title, subtitle, cloud_text, alignment_text, arm_title)
self.play(
final_group.animate.scale(0.8).to_edge(UP),
)

# Create a cloud icon
cloud = Text("☁️", font_size=120)
self.play(FadeIn(cloud))

# Add some particles around the cloud
particles = VGroup(*[
Dot(radius=0.05, color=BLUE).move_to(
cloud.get_center() + np.array([
np.random.uniform(-3, 3),
np.random.uniform(-2, 2),
0
])
)
for _ in range(20)
])
self.play(FadeIn(particles))

# Animate the particles
self.play(
*[
p.animate.shift(np.array([
np.random.uniform(-1, 1),
np.random.uniform(-1, 1),
0
]))
for p in particles
],
run_time=2
)

self.wait(2)
107 changes: 107 additions & 0 deletions direct_openai_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
import os
import json
from pathlib import Path
from dotenv import load_dotenv
import openai

# Load environment variables from .env file
load_dotenv()

# Create a temporary directory for audio files
temp_dir = Path("./temp_direct_test")
temp_dir.mkdir(exist_ok=True)

# Constants for audio offset resolution (same as in manim-voiceover)
AUDIO_OFFSET_RESOLUTION = 1000 # 1000 = milliseconds

print("=== Direct OpenAI API Test ===")

# First, generate speech using OpenAI TTS
print("\nGenerating speech from text...")
text = "This is a test of the cloud-based Whisper feature."

# Generate speech using OpenAI TTS
response = openai.audio.speech.create(
model="tts-1",
voice="alloy",
input=text
)

audio_path = temp_dir / "direct_test.mp3"
response.stream_to_file(str(audio_path))

print(f"Speech generated and saved to {audio_path}")

# Now, transcribe the audio using OpenAI Whisper API
print("\nTranscribing audio with word-level timestamps...")
with open(audio_path, "rb") as audio_file:
transcription = openai.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json",
timestamp_granularities=["word"]
)

# Print the raw response structure
print("\nRaw API Response Structure:")
print(f"Response type: {type(transcription)}")
print(f"Response attributes: {dir(transcription)}")
print(f"Has 'words' attribute: {hasattr(transcription, 'words')}")

if hasattr(transcription, 'words'):
print(f"Words type: {type(transcription.words)}")
print(f"Words count: {len(transcription.words)}")

# Try to access the first word
if len(transcription.words) > 0:
first_word = transcription.words[0]
print(f"First word type: {type(first_word)}")
print(f"First word attributes: {dir(first_word)}")
print(f"First word: {first_word.word if hasattr(first_word, 'word') else 'No word attribute'}")
print(f"First word start: {first_word.start if hasattr(first_word, 'start') else 'No start attribute'}")

# Convert to word boundaries format used by manim-voiceover
print("\nConverting to word boundaries format...")
word_boundaries = []
current_text_offset = 0

if hasattr(transcription, 'words'):
for word_obj in transcription.words:
try:
word = word_obj.word
start_time = word_obj.start

# Create a word boundary entry
word_boundary = {
"audio_offset": int(start_time * AUDIO_OFFSET_RESOLUTION),
"text_offset": current_text_offset,
"word_length": len(word),
"text": word,
"boundary_type": "Word",
}

word_boundaries.append(word_boundary)
current_text_offset += len(word) + 1 # +1 for space

print(f"Added word boundary: {word} at {start_time}s")
except Exception as e:
print(f"Error processing word: {e}")

print(f"\nCreated {len(word_boundaries)} word boundaries")

# Create a cache file that manim-voiceover can use
cache_data = {
"input_text": text,
"input_data": {"input_text": text, "service": "openai"},
"original_audio": audio_path.name,
"word_boundaries": word_boundaries,
"transcribed_text": transcription.text,
"final_audio": audio_path.name
}

cache_file = temp_dir / "cache.json"
with open(cache_file, "w") as f:
json.dump([cache_data], f, indent=2)

print(f"\nCreated cache file at {cache_file}")
print("\nTest completed!")
102 changes: 102 additions & 0 deletions docs/source/cloud_whisper.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Cloud-based Whisper Transcription

## Overview

Manim-voiceover now supports cloud-based transcription using OpenAI's Whisper API. This is particularly useful for:

- ARM64 architectures (like Apple Silicon Macs) where installing the local Whisper model might be problematic
- Systems where you don't want to install the large Whisper model
- When you need higher accuracy transcription than the local model provides

## Setup

To use cloud-based Whisper, you'll need:

1. An OpenAI API key
2. The OpenAI Python package

Install the necessary dependencies:

```bash
pip install "manim-voiceover[openai]"
```

## Usage

### Command Line Option

You can enable cloud-based Whisper for any Manim render by using the provided script:

```bash
python manim_cloud_whisper.py -pql examples/cloud_whisper_demo.py CloudWhisperDemo
```

Or by setting an environment variable:

```bash
# Set the environment variable
export MANIM_VOICEOVER_USE_CLOUD_WHISPER=1

# Run Manim normally
manim -pql examples/cloud_whisper_demo.py CloudWhisperDemo
```

### Programmatic Usage

You can also enable cloud-based Whisper programmatically when initializing any speech service:

```python
from manim_voiceover.services.azure import AzureService
from manim_voiceover.voiceover_scene import VoiceoverScene

class MyScene(VoiceoverScene):
def construct(self):
# Use cloud-based Whisper for transcription
service = AzureService(
voice="en-US-GuyNeural",
transcription_model="base", # Still specify a model name
use_cloud_whisper=True # This enables cloud-based Whisper
)
self.set_speech_service(service)

# Rest of your scene...
```

## How It Works

When cloud-based Whisper is enabled:

1. The speech service will use OpenAI's API to transcribe your audio files
2. Word-level alignment will still work for bookmarks and animations
3. Your audio files will be sent to OpenAI's servers for transcription
4. An OpenAI API key is required and you'll be prompted to enter one if not found

## Pricing

Using cloud-based Whisper incurs costs based on OpenAI's pricing model:

- Audio transcription is billed per minute of audio
- Check [OpenAI's pricing page](https://openai.com/pricing) for the most up-to-date information

## Switching Between Local and Cloud

You can use both local and cloud-based Whisper in the same project:

- Use the `--use-cloud-whisper` flag when you need cloud-based transcription
- Omit the flag to use the local Whisper model

## Troubleshooting

### API Key Issues

If you encounter errors related to the API key:

1. Check that you have set the `OPENAI_API_KEY` environment variable
2. Alternatively, create a `.env` file in your project directory with `OPENAI_API_KEY=your_key_here`

### Response Format Issues

The cloud API might return a different format than expected. If you encounter errors:

1. Check that you're using the latest version of manim-voiceover
2. Try using a different transcription model
Loading