Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whisper-base (voice-to-text) and opus-mt-it-en (text-translations) Bare Bones #37

Open
hpssjellis opened this issue Feb 4, 2025 · 3 comments

Comments

@hpssjellis
Copy link

HI Josh @xenova me again, I did really good work on the last two Bare Bones. Both HTML/Javscript single page files work great online and offline.

https://hpssjellis.github.io/my-examples-of-transformersJS/public/deepseek-r1-webgpu/deepseek-r1-webgpu-00.html

and

https://hpssjellis.github.io/my-examples-of-transformersJS/public/janus-pro/janus-pro-to-image-00.html

Now I want to make two more bare bones examples based on these models.

https://huggingface.co/onnx-community/whisper-base

and

https://huggingface.co/Xenova/opus-mt-it-en

Yes I will eventually put them both together, but need the bare bones first.

I have seen a full blown example, but it has so many files I don't know where to start. https://realtimeaihub-image-606557915178.us-central1.run.app/

Even the huggingface spaces examples are too based on react and too complex for me. Any suggestions? I will post what I figure out but a starting point would be really helpful.

Jer

@hpssjellis
Copy link
Author

hpssjellis commented Feb 4, 2025

@xenova Your example is perfect, but here is the issue. I can't use it. It is too confusing. Like the other bare bones models I made I will see what I can do.

https://huggingface.co/spaces/Xenova/realtime-whisper-webgpu

P.S. Huggingface should find a way to pay me for using my 35 years teaching coding experience to actually make these models useable.

Here is what I have so far, kind of crummy, the timing of everything seems strange.

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Whisper-base Live Transcription</title>
  <script type="module">
    // import { pipeline, read_audio } from 'https://cdn.jsdelivr.net/npm/@xenova/transformers';   // use to test if latest works

    import { pipeline, read_audio } from 'https://cdn.jsdelivr.net/npm/@huggingface/[email protected]';

    window.startRecording = startRecording;
    window.stopRecording = stopRecording;
    window.loadModel = loadModel;

    let asrPipeline;
    let audioContext;
    let mediaStream;
    let processorNode;
    let audioBuffer = [];
    const sampleRate = 16000;
    const chunkDuration = 3; // seconds

    // Load Whisper model
    async function loadModel() {
      let myLanguage = document.getElementById('myLanguageSelect').value;
      asrPipeline = await pipeline("automatic-speech-recognition", "Xenova/whisper-base", { language: myLanguage });
      document.getElementById('loadModelButton').disabled = true;
      document.getElementById('startButton').disabled = false;
      console.log("Model loaded.");
    }

    // Start recording with real-time transcription
    async function startRecording() {
      document.getElementById('startButton').disabled = true;
      document.getElementById('stopButton').disabled = false;

      audioContext = new AudioContext({ sampleRate: sampleRate });
      mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });
      const source = audioContext.createMediaStreamSource(mediaStream);

      // Create a ScriptProcessorNode for real-time processing
      processorNode = audioContext.createScriptProcessor(4096, 1, 1);
      source.connect(processorNode);
      processorNode.connect(audioContext.destination);

      processorNode.onaudioprocess = async (event) => {
        const audioData = event.inputBuffer.getChannelData(0);
        audioBuffer.push(...audioData);

        if (audioBuffer.length >= sampleRate * chunkDuration) {
          const chunk = audioBuffer.slice(0, sampleRate * chunkDuration);
          audioBuffer = audioBuffer.slice(sampleRate * chunkDuration);

          // Convert chunk to Float32Array
          const float32Chunk = new Float32Array(chunk);

          const result = await asrPipeline(float32Chunk);
          console.log(result);
          if (result.text !== '[BLANK_AUDIO]') {
            document.getElementById('transcription').innerText += result.text;
          }
        }
      };

      console.log("Live transcription started...");
    }

    // Stop recording
    function stopRecording() {
      document.getElementById('startButton').disabled = false;
      document.getElementById('stopButton').disabled = true;

      processorNode.disconnect();
      mediaStream.getTracks().forEach(track => track.stop());
      audioContext.close();

      console.log("Recording stopped.");
    }
  </script>
</head>
<body>
  <h1>Live Whisper-base Transcription</h1>
  <button id="loadModelButton" onclick="loadModel()">Load Model</button>
  <button id="startButton" onclick="startRecording()" disabled>Start Live Transcription</button>
  <button id="stopButton" onclick="stopRecording()" disabled>Stop</button><br>
  
  <select id="myLanguageSelect">
    <option value="af">Afrikaans</option>
    <option value="ar">Arabic</option>
    <option value="bn">Bengali</option>
    <option value="bg">Bulgarian</option>
    <option value="zh">Chinese</option>
    <option value="hr">Croatian</option>
    <option value="cs">Czech</option>
    <option value="da">Danish</option>
    <option value="nl">Dutch</option>
    <option value="en" selected>English</option>
    <option value="et">Estonian</option>
    <option value="fi">Finnish</option>
    <option value="fr">French</option>
    <option value="de">German</option>
    <option value="el">Greek</option>
    <option value="gu">Gujarati</option>
    <option value="he">Hebrew</option>
    <option value="hi">Hindi</option>
    <option value="hu">Hungarian</option>
    <option value="is">Icelandic</option>
    <option value="id">Indonesian</option>
    <option value="it">Italian</option>
    <option value="ja">Japanese</option>
    <option value="kn">Kannada</option>
    <option value="ko">Korean</option>
    <option value="lv">Latvian</option>
    <option value="lt">Lithuanian</option>
    <option value="ml">Malayalam</option>
    <option value="mr">Marathi</option>
    <option value="ne">Nepali</option>
    <option value="no">Norwegian</option>
    <option value="fa">Persian</option>
    <option value="pl">Polish</option>
    <option value="pt">Portuguese</option>
    <option value="pa">Punjabi</option>
    <option value="ro">Romanian</option>
    <option value="ru">Russian</option>
    <option value="sr">Serbian</option>
    <option value="sk">Slovak</option>
    <option value="sl">Slovenian</option>
    <option value="es">Spanish</option>
    <option value="sw">Swahili</option>
    <option value="sv">Swedish</option>
    <option value="ta">Tamil</option>
    <option value="te">Telugu</option>
    <option value="th">Thai</option>
    <option value="tr">Turkish</option>
    <option value="uk">Ukrainian</option>
    <option value="ur">Urdu</option>
    <option value="vi">Vietnamese</option>
    <option value="cy">Welsh</option>
    <option value="xh">Xhosa</option>
    <option value="zu">Zulu</option>
  </select>

  <h2>Transcription:</h2>
  <p id="transcription">...</p>
</body>
</html>

@geronimi73
Copy link

dude there's a perfect minimal example called realtime-whisper-webgpu in this repo, 35 years of coding should be enough to understand it

@hpssjellis
Copy link
Author

hpssjellis commented Feb 5, 2025

dude there's a perfect minimal example called realtime-whisper-webgpu in this repo, 35 years of coding should be enough to understand it

@geronimi73
I said I have coded for many years, not that I was really smart. LOL. Actually I have looked at the example, it is just too complex. Any suggestions?

Present issue is how to stop the transcribing, not just the webAudio and how to record and transcribe at the same time. My example above is really chunky. Even determining the language seems confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants