Replies: 4 comments
-
Small correction if we are talking about complete audio duration |
Beta Was this translation helpful? Give feedback.
-
In practice, the whisper segments do not seem to exactly match the actual audio duration. A different option would be to use ffmpeg directly for this purpose. For example |
Beta Was this translation helpful? Give feedback.
-
The Lines 22 to 49 in f680570 But this will of course slower than more direct method like what @glangford suggested. |
Beta Was this translation helpful? Give feedback.
-
In fact, whisper has produced the right duration details in Whisper show the following details: # sample: 13 seconds
[00:00:00,000 --> 00:00:07,000] 项目的物业 富华行物业 管理长安俱乐部
[00:00:07,000 --> 00:00:37,000] 长安俱乐部是十大俱乐部之首 物业费是24块钱一瓶
# sample: 10 seconds
[00:00.000 --> 00:02.760] 项目配有2000坪的会所
[00:02.760 --> 00:06.160] 里面游泳、健身、疗养与一体
[00:06.160 --> 00:34.160] 还有,还设有会客厅、思想、宴厅等
# sample: 21 seconds
[00:00.000 --> 00:04.100] 项目200米范围内配套有王府井三圈
[00:04.100 --> 00:05.900] 在北京独一无二
[00:05.900 --> 00:08.600] 王府中环APM
[00:08.600 --> 00:10.300] 东营IM88
[00:10.300 --> 00:11.800] 东方新天地
[00:11.800 --> 00:14.000] 王府井百货大楼
[00:14.000 --> 00:15.600] 金宝会
[00:15.600 --> 00:18.400] 哈姆雷斯玩具店
[00:18.400 --> 00:20.800] 汇聚一线高端品牌
[00:20.800 --> 00:30.800] 可足以满足日常的购物休闲等生活 As you see, all the end time of the last segment are wrong... How does the VAD function work? |
Beta Was this translation helpful? Give feedback.
-
Hi,
I am trying to check if I could get audio file duration from any of whispers' methods.
This seems closer to audio duration:
transcription["segments"][0]["end"]-transcription["segments"][0]["start"]
What would be recommended way to do it?
Beta Was this translation helpful? Give feedback.
All reactions