ManimCommunity · psmoros · Mar 9, 2025 · Mar 9, 2025 · Mar 9, 2025 · Mar 9, 2025
@@ -15,6 +15,7 @@ Manim Voiceover is a [Manim](https://manim.community) plugin for all things voic
 - Record voiceovers with your microphone during rendering with a simple command line interface.
 - Develop animations with auto-generated AI voices from various free and proprietary services.
 - Per-word timing of animations, i.e. trigger animations at specific words in the voiceover, even for the recordings. This works thanks to [OpenAI Whisper](https://github.com/openai/whisper).
+- **NEW**: Supports both local and cloud-based Whisper for ARM64 architectures (like Apple Silicon) where the local model may not work.
 
 Here is a demo:
 
@@ -41,6 +42,38 @@ Currently supported TTS services (aside from the CLI that allows you to records
 
 [Check out the example gallery to get inspired.](https://voiceover.manim.community/en/latest/examples.html)
 
+## Cloud Whisper Support
+
+For ARM64 architectures (like Apple Silicon Macs) or systems where installing the local Whisper model is problematic, you can now use OpenAI's cloud-based Whisper API for speech-to-text alignment:
+
+```bash
+# Run with the provided script
+python manim_cloud_whisper.py -pql examples/cloud_whisper_demo.py CloudWhisperDemo
+```
+
+Or enable it programmatically:
+
+```python
+service = OpenAIService(
+    voice="alloy",
+    model="tts-1",
+    transcription_model="base",
+    use_cloud_whisper=True  # This enables cloud-based Whisper
+)
+```
+
+You can also set an environment variable to enable cloud-based Whisper:
+
+```bash
+# Set the environment variable
+export MANIM_VOICEOVER_USE_CLOUD_WHISPER=1
+
+# Run Manim normally
+manim -pql examples/cloud_whisper_demo.py CloudWhisperDemo
+```
+
+[Learn more about cloud-based Whisper in the documentation.](https://voiceover.manim.community/en/latest/cloud_whisper.html)
+
 ## Translate
 
 Manim Voiceover can use machine translation services like [DeepL](https://www.deepl.com/) to translate voiceovers into other languages. [Check out the docs for more details.](https://voiceover.manim.community/en/latest/translate.html)
@@ -0,0 +1,110 @@
+from manim import *
+from manim_voiceover.voiceover_scene import VoiceoverScene
+from manim_voiceover.services.openai import OpenAIService
+
+class OpenAICloudWhisperDemo(VoiceoverScene):
+    def construct(self):
+        # Print the cloud whisper setting
+        print(f"Cloud Whisper enabled: {config.use_cloud_whisper}")
+
+        # Initialize OpenAI speech service with cloud whisper
+        service = OpenAIService(
+            voice="alloy",  # Available voices: alloy, echo, fable, onyx, nova, shimmer
+            model="tts-1",  # tts-1 or tts-1-hd
+            transcription_model="base",
+            use_cloud_whisper=True  # Use cloud-based Whisper
+        )
+        self.set_speech_service(service)
+
+        # Create a title
+        title = Text("OpenAI TTS + Cloud Whisper Demo", font_size=48)
+        self.play(Write(title))
+        self.wait(1)
+
+        # Move title to top
+        self.play(title.animate.to_edge(UP))
+
+        # Create a subtitle
+        subtitle = Text("Word-level alignment on ARM64 architectures", 
+                       font_size=36, 
+                       color=BLUE)
+        subtitle.next_to(title, DOWN)
+        self.play(FadeIn(subtitle))
+
+        # Demonstrate voiceover with bookmarks
+        with self.voiceover(
+            """This demonstration uses OpenAI's text-to-speech service 
+            with <bookmark mark='cloud_point'/> cloud-based Whisper for 
+            word-level <bookmark mark='alignment_point'/> alignment."""
+        ) as tracker:
+            # Wait until the first bookmark
+            self.wait_until_bookmark("cloud_point")
+
+            # Create and animate the cloud text
+            cloud_text = Text("☁️ Cloud-based Whisper", color=BLUE, font_size=36)
+            cloud_text.next_to(subtitle, DOWN, buff=1)
+            self.play(FadeIn(cloud_text))
+
+            # Wait until the second bookmark
+            self.wait_until_bookmark("alignment_point")
+
+            # Create and animate the alignment text
+            alignment_text = Text("Perfect Word Timing", color=GREEN, font_size=36)
+            alignment_text.next_to(cloud_text, DOWN, buff=0.5)
+            self.play(FadeIn(alignment_text))
+
+        # Continue with demonstration
+        self.wait(1)
+
+        # Show ARM64 compatibility
+        arm_title = Text("Works on Apple Silicon!", color=RED, font_size=36)
+        arm_title.next_to(alignment_text, DOWN, buff=1)
+
+        with self.voiceover(
+            "This feature is especially useful for ARM64 architectures like your M4 Pro."
+        ):
+            self.play(FadeIn(arm_title))
+
+        # Final animation
+        self.wait(1)
+
+        with self.voiceover(
+            "No local Whisper model required. Everything happens in the cloud!"
+        ):
+            # Create a final animation
+            final_group = VGroup(title, subtitle, cloud_text, alignment_text, arm_title)
+            self.play(
+                final_group.animate.scale(0.8).to_edge(UP),
+            )
+
+            # Create a cloud icon
+            cloud = Text("☁️", font_size=120)
+            self.play(FadeIn(cloud))
+
+            # Add some particles around the cloud
+            particles = VGroup(*[
+                Dot(radius=0.05, color=BLUE).move_to(
+                    cloud.get_center() + np.array([
+                        np.random.uniform(-3, 3),
+                        np.random.uniform(-2, 2),
+                        0
+                    ])
+                )
+                for _ in range(20)
+            ])
+            self.play(FadeIn(particles))
+
+            # Animate the particles
+            self.play(
+                *[
+                    p.animate.shift(np.array([
+                        np.random.uniform(-1, 1),
+                        np.random.uniform(-1, 1),
+                        0
+                    ]))
+                    for p in particles
+                ],
+                run_time=2
+            )
+
+        self.wait(2) 
@@ -0,0 +1,107 @@
+import os
+import json
+from pathlib import Path
+from dotenv import load_dotenv
+import openai
+
+# Load environment variables from .env file
+load_dotenv()
+
+# Create a temporary directory for audio files
+temp_dir = Path("./temp_direct_test")
+temp_dir.mkdir(exist_ok=True)
+
+# Constants for audio offset resolution (same as in manim-voiceover)
+AUDIO_OFFSET_RESOLUTION = 1000  # 1000 = milliseconds
+
+print("=== Direct OpenAI API Test ===")
+
+# First, generate speech using OpenAI TTS
+print("\nGenerating speech from text...")
+text = "This is a test of the cloud-based Whisper feature."
+
+# Generate speech using OpenAI TTS
+response = openai.audio.speech.create(
+    model="tts-1",
+    voice="alloy",
+    input=text
+)
+
+audio_path = temp_dir / "direct_test.mp3"
+response.stream_to_file(str(audio_path))
+
+print(f"Speech generated and saved to {audio_path}")
+
+# Now, transcribe the audio using OpenAI Whisper API
+print("\nTranscribing audio with word-level timestamps...")
+with open(audio_path, "rb") as audio_file:
+    transcription = openai.audio.transcriptions.create(
+        model="whisper-1",
+        file=audio_file,
+        response_format="verbose_json",
+        timestamp_granularities=["word"]
+    )
+
+# Print the raw response structure
+print("\nRaw API Response Structure:")
+print(f"Response type: {type(transcription)}")
+print(f"Response attributes: {dir(transcription)}")
+print(f"Has 'words' attribute: {hasattr(transcription, 'words')}")
+
+if hasattr(transcription, 'words'):
+    print(f"Words type: {type(transcription.words)}")
+    print(f"Words count: {len(transcription.words)}")
+
+    # Try to access the first word
+    if len(transcription.words) > 0:
+        first_word = transcription.words[0]
+        print(f"First word type: {type(first_word)}")
+        print(f"First word attributes: {dir(first_word)}")
+        print(f"First word: {first_word.word if hasattr(first_word, 'word') else 'No word attribute'}")
+        print(f"First word start: {first_word.start if hasattr(first_word, 'start') else 'No start attribute'}")
+
+# Convert to word boundaries format used by manim-voiceover
+print("\nConverting to word boundaries format...")
+word_boundaries = []
+current_text_offset = 0
+
+if hasattr(transcription, 'words'):
+    for word_obj in transcription.words:
+        try:
+            word = word_obj.word
+            start_time = word_obj.start
+
+            # Create a word boundary entry
+            word_boundary = {
+                "audio_offset": int(start_time * AUDIO_OFFSET_RESOLUTION),
+                "text_offset": current_text_offset,
+                "word_length": len(word),
+                "text": word,
+                "boundary_type": "Word",
+            }
+
+            word_boundaries.append(word_boundary)
+            current_text_offset += len(word) + 1  # +1 for space
+
+            print(f"Added word boundary: {word} at {start_time}s")
+        except Exception as e:
+            print(f"Error processing word: {e}")
+
+print(f"\nCreated {len(word_boundaries)} word boundaries")
+
+# Create a cache file that manim-voiceover can use
+cache_data = {
+    "input_text": text,
+    "input_data": {"input_text": text, "service": "openai"},
+    "original_audio": audio_path.name,
+    "word_boundaries": word_boundaries,
+    "transcribed_text": transcription.text,
+    "final_audio": audio_path.name
+}
+
+cache_file = temp_dir / "cache.json"
+with open(cache_file, "w") as f:
+    json.dump([cache_data], f, indent=2)
+
+print(f"\nCreated cache file at {cache_file}")
+print("\nTest completed!") 
@@ -0,0 +1,102 @@
+# Cloud-based Whisper Transcription
+
+## Overview
+
+Manim-voiceover now supports cloud-based transcription using OpenAI's Whisper API. This is particularly useful for:
+
+- ARM64 architectures (like Apple Silicon Macs) where installing the local Whisper model might be problematic
+- Systems where you don't want to install the large Whisper model
+- When you need higher accuracy transcription than the local model provides
+
+## Setup
+
+To use cloud-based Whisper, you'll need:
+
+1. An OpenAI API key
+2. The OpenAI Python package
+
+Install the necessary dependencies:
+
+```bash
+pip install "manim-voiceover[openai]"
+```
+
+## Usage
+
+### Command Line Option
+
+You can enable cloud-based Whisper for any Manim render by using the provided script:
+
+```bash
+python manim_cloud_whisper.py -pql examples/cloud_whisper_demo.py CloudWhisperDemo
+```
+
+Or by setting an environment variable:
+
+```bash
+# Set the environment variable
+export MANIM_VOICEOVER_USE_CLOUD_WHISPER=1
+
+# Run Manim normally
+manim -pql examples/cloud_whisper_demo.py CloudWhisperDemo
+```
+
+### Programmatic Usage
+
+You can also enable cloud-based Whisper programmatically when initializing any speech service:
+
+```python
+from manim_voiceover.services.azure import AzureService
+from manim_voiceover.voiceover_scene import VoiceoverScene
+
+class MyScene(VoiceoverScene):
+    def construct(self):
+        # Use cloud-based Whisper for transcription
+        service = AzureService(
+            voice="en-US-GuyNeural",
+            transcription_model="base",  # Still specify a model name
+            use_cloud_whisper=True  # This enables cloud-based Whisper
+        )
+        self.set_speech_service(service)
+
+        # Rest of your scene...
+```
+
+## How It Works
+
+When cloud-based Whisper is enabled:
+
+1. The speech service will use OpenAI's API to transcribe your audio files
+2. Word-level alignment will still work for bookmarks and animations
+3. Your audio files will be sent to OpenAI's servers for transcription
+4. An OpenAI API key is required and you'll be prompted to enter one if not found
+
+## Pricing
+
+Using cloud-based Whisper incurs costs based on OpenAI's pricing model:
+
+- Audio transcription is billed per minute of audio
+- Check [OpenAI's pricing page](https://openai.com/pricing) for the most up-to-date information
+
+## Switching Between Local and Cloud
+
+You can use both local and cloud-based Whisper in the same project:
+
+- Use the `--use-cloud-whisper` flag when you need cloud-based transcription
+- Omit the flag to use the local Whisper model
+
+## Troubleshooting
+
+### API Key Issues
+
+If you encounter errors related to the API key:
+
+1. Check that you have set the `OPENAI_API_KEY` environment variable
+2. Alternatively, create a `.env` file in your project directory with `OPENAI_API_KEY=your_key_here`
+
+### Response Format Issues
+
+The cloud API might return a different format than expected. If you encounter errors:
+
+1. Check that you're using the latest version of manim-voiceover
+2. Try using a different transcription model