Transcribing pre-recorded Japanese file and my transcript output is '\u3053\u308c\u3042\u308a\u304c\u3068\u3046\u3054' instead of Japanese. #165

JackBarker21 · 2023-05-26T12:38:22Z

JackBarker21
May 26, 2023

Which Deepgram product are you using?

Deepgram API

Details

`from deepgram import Deepgram
import asyncio
import json

Replace with your Deepgram API key

DEEPGRAM_API_KEY = 'my_key'

Replace with your file path and audio mimetype

PATH_TO_FILE = 'path'
filename = PATH_TO_FILE.split('/')[-1]
MIMETYPE = 'audio/wav'

async def main():
# Initializes the Deepgram SDK
dg_client = Deepgram(DEEPGRAM_API_KEY)

# Opens the audio file
with open(PATH_TO_FILE, 'rb') as audio:
    source = {'buffer': audio, 'mimetype': MIMETYPE}

    # Specifies the transcription options
    options = {
        'punctuate': True,
        'diarize': True,
        'paragraphs': True,
        'model': 'general',
        'tier': 'enhanced',
        'language': 'ja'
    }

    # Transcribes the audio file
    response = await dg_client.transcription.prerecorded(source, options)
    
    # Extracts the transcript from the response
    #transcript = response['results']['channels'][0]['alternatives'][0]['paragraphs']['transcript']
    
    save_file = open(f"path/{filename}.json", "w")  
    json.dump(response, save_file, indent = 6)  
    save_file.close()

if name == 'main':
asyncio.run(main())`

this is my code.
It works for English but changing the language to 'ja' and for some reason, I'm not getting a Japanese transcript.

If you are making a request to the Deepgram API, what is the full Deepgram URL you are making a request to?

No response

If you are making a request to the Deepgram API and have a request ID, please paste it below:

No response

If possible, please attach your code or paste it into the text box.

No response

If possible, please attach an example audio file to reproduce the issue.

No response

Answered by jjmaldonis

May 26, 2023

Hey @JackBarker21, I think this is an encoding issue with how Python saves JSON data.

When I print the transcript in your code, I see the output as Japanese characters, and when the data is saved to file it contains the unicode escaped characters (which start with \u). To fix how the data is saved, there are two small changes that you can make when saving the transcript:

    with open(f"path/{filename}.json", "w", encoding="utf8") as save_file:
            json.dump(response, save_file, indent=6, ensure_ascii=False)

Does this solve the issue for you?

Below is the exact code I used (99% yours) and attached is an example audio file in Japanese:

from deepgram import Deepgram  # pip install d…

View full answer

jjmaldonis · 2023-05-26T14:00:48Z

jjmaldonis
May 26, 2023
Maintainer

Hey @JackBarker21, I think this is an encoding issue with how Python saves JSON data.

When I print the transcript in your code, I see the output as Japanese characters, and when the data is saved to file it contains the unicode escaped characters (which start with \u). To fix how the data is saved, there are two small changes that you can make when saving the transcript:

    with open(f"path/{filename}.json", "w", encoding="utf8") as save_file:
            json.dump(response, save_file, indent=6, ensure_ascii=False)

Does this solve the issue for you?

Below is the exact code I used (99% yours) and attached is an example audio file in Japanese:

from deepgram import Deepgram  # pip install deepgram-sdk
import asyncio
import json
import os

# Initialize the Deepgram SDK.
DEEPGRAM_API_KEY = os.environ["DEEPGRAM_API_KEY"]  # Your Deepgram API Key
deepgram = Deepgram(DEEPGRAM_API_KEY)


PATH_TO_FILE = "./test-audio-files/japanese_kyoshitsu4.mp3"
filename = PATH_TO_FILE.split("/")[-1]
MIMETYPE = "audio/mpeg"
[japanese_kyoshitsu4.mp3.zip](https://github.com/deepgram/community/files/11575958/japanese_kyoshitsu4.mp3.zip)


async def main():
    # Initializes the Deepgram SDK
    dg_client = Deepgram(DEEPGRAM_API_KEY)

    # Opens the audio file
    with open(PATH_TO_FILE, "rb") as audio:
        source = {"buffer": audio, "mimetype": MIMETYPE}

        # Specifies the transcription options
        options = {
            "punctuate": True,
            "diarize": True,
            "paragraphs": True,
            "model": "general",
            "tier": "enhanced",
            "language": "ja",
        }

        # Transcribes the audio file
        response = await dg_client.transcription.prerecorded(source, options)

        # Extracts the transcript from the response
        transcript = response['results']['channels'][0]['alternatives'][0]['paragraphs']['transcript']
        print(transcript)

        with open(f"test-audio-files/{filename}.json", "w", encoding="utf8") as save_file:
            json.dump(response, save_file, indent=6, ensure_ascii=False)


if __name__ == "__main__":
    asyncio.run(main())

japanese_kyoshitsu4.mp3.zip

1 reply

JackBarker21 May 27, 2023
Author

That worked great!
Thanks :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepgram

Transcribing pre-recorded Japanese file and my transcript output is '\u3053\u308c\u3042\u308a\u304c\u3068\u3046\u3054' instead of Japanese. #165

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Deepgram

Transcribing pre-recorded Japanese file and my transcript output is '\u3053\u308c\u3042\u308a\u304c\u3068\u3046\u3054' instead of Japanese. #165

JackBarker21 May 26, 2023

Which Deepgram product are you using?

Details

Replace with your Deepgram API key

Replace with your file path and audio mimetype

If you are making a request to the Deepgram API, what is the full Deepgram URL you are making a request to?

If you are making a request to the Deepgram API and have a request ID, please paste it below:

If possible, please attach your code or paste it into the text box.

If possible, please attach an example audio file to reproduce the issue.

Replies: 1 comment · 1 reply

jjmaldonis May 26, 2023 Maintainer

JackBarker21 May 27, 2023 Author

JackBarker21
May 26, 2023

Replies: 1 comment 1 reply

jjmaldonis
May 26, 2023
Maintainer

JackBarker21 May 27, 2023
Author