Implementation idea: separating lines by a delimiter instead of the actual timestamps to save up API usage? #63

hengyu95 · 2023-08-12T15:44:29Z

hengyu95
Aug 12, 2023

I got this idea from Subtitle Edit's "Auto-translate via copy-paste" function where they process the .SRT file such that it ends up like this

那就是4万个
*
那其实肯定不止放一个，平均
*
假如说要平均放1.5个，一年6万个支架
*
进口的支架平均一个1到2万块钱

Then you can just toss it into a translator like DeepL and get:

That's 40,000.
*
That must actually put more than one, on average.
*
Let's say it's 1.5 on average, 60,000 stands a year.
*
Imported brackets cost an average of $10,000 to $20,000 a piece.

and the software maps the translation back to the original timestamps since it's a 1-1 mapping separated by each asterisk.

I've been experimenting with this approach with ChatGPT, the translation is often flawless, but the problem is it often ends up combining lines across the asterisks and makes the mapping back desync. But with the functionality of subtrans like batching, validating, re-translating will help enough with this that it becomes a non-problem?

If this works then the consumption of tokens will be largely reduced

machinewrapped · 2023-08-12T21:26:05Z

machinewrapped
Aug 12, 2023
Maintainer

GPT is very prone to merging lines, at least in 3.5 - it took quite a few iterations to arrive at the current prompt, which more or less eliminates desyncs. Feeding it lines with a clear indication where it should fill in the translation helps to keep it on track (it just has to fill in the blanks).

Validation/retry could probably fix desyncs even with a looser format... but it more than doubles the token count for the batch, since it has to resend the whole message chain, so it is unlikely to be a net win on that front! :-)

My long term goal is to allow GPT to merge lines when it helps it produce a more fluent translation, then fix up the timings... it doesn't seem able to do that itself, unfortunately - GPT4 might be able to, but it's so much more expensive that I haven't experimented with it much.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation idea: separating lines by a delimiter instead of the actual timestamps to save up API usage? #63

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Implementation idea: separating lines by a delimiter instead of the actual timestamps to save up API usage? #63

hengyu95 Aug 12, 2023

Replies: 1 comment

machinewrapped Aug 12, 2023 Maintainer

hengyu95
Aug 12, 2023

machinewrapped
Aug 12, 2023
Maintainer