Transcribing pre-recorded Japanese file and my transcript output is '\u3053\u308c\u3042\u308a\u304c\u3068\u3046\u3054' instead of Japanese. #165
-
Which Deepgram product are you using?Deepgram API Details`from deepgram import Deepgram Replace with your Deepgram API keyDEEPGRAM_API_KEY = 'my_key' Replace with your file path and audio mimetypePATH_TO_FILE = 'path' async def main():
if name == 'main': this is my code. If you are making a request to the Deepgram API, what is the full Deepgram URL you are making a request to?No response If you are making a request to the Deepgram API and have a request ID, please paste it below:No response If possible, please attach your code or paste it into the text box.No response If possible, please attach an example audio file to reproduce the issue.No response |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hey @JackBarker21, I think this is an encoding issue with how Python saves JSON data. When I print the with open(f"path/{filename}.json", "w", encoding="utf8") as save_file:
json.dump(response, save_file, indent=6, ensure_ascii=False) Does this solve the issue for you? Below is the exact code I used (99% yours) and attached is an example audio file in Japanese: from deepgram import Deepgram # pip install deepgram-sdk
import asyncio
import json
import os
# Initialize the Deepgram SDK.
DEEPGRAM_API_KEY = os.environ["DEEPGRAM_API_KEY"] # Your Deepgram API Key
deepgram = Deepgram(DEEPGRAM_API_KEY)
PATH_TO_FILE = "./test-audio-files/japanese_kyoshitsu4.mp3"
filename = PATH_TO_FILE.split("/")[-1]
MIMETYPE = "audio/mpeg"
[japanese_kyoshitsu4.mp3.zip](https://github.com/deepgram/community/files/11575958/japanese_kyoshitsu4.mp3.zip)
async def main():
# Initializes the Deepgram SDK
dg_client = Deepgram(DEEPGRAM_API_KEY)
# Opens the audio file
with open(PATH_TO_FILE, "rb") as audio:
source = {"buffer": audio, "mimetype": MIMETYPE}
# Specifies the transcription options
options = {
"punctuate": True,
"diarize": True,
"paragraphs": True,
"model": "general",
"tier": "enhanced",
"language": "ja",
}
# Transcribes the audio file
response = await dg_client.transcription.prerecorded(source, options)
# Extracts the transcript from the response
transcript = response['results']['channels'][0]['alternatives'][0]['paragraphs']['transcript']
print(transcript)
with open(f"test-audio-files/{filename}.json", "w", encoding="utf8") as save_file:
json.dump(response, save_file, indent=6, ensure_ascii=False)
if __name__ == "__main__":
asyncio.run(main()) |
Beta Was this translation helpful? Give feedback.
Hey @JackBarker21, I think this is an encoding issue with how Python saves JSON data.
When I print the
transcript
in your code, I see the output as Japanese characters, and when the data is saved to file it contains the unicode escaped characters (which start with\u
). To fix how the data is saved, there are two small changes that you can make when saving the transcript:Does this solve the issue for you?
Below is the exact code I used (99% yours) and attached is an example audio file in Japanese: