Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use bytestrings #84

Closed
wants to merge 4 commits into from
Closed

use bytestrings #84

wants to merge 4 commits into from

Conversation

milahu
Copy link

@milahu milahu commented Mar 10, 2024

fix #43

simply ignore text encoding, and use raw bytestrings
dealing with text encoding is deferred to the user

this allows handling "broken" files with multiple encodings

probably this is too much change, so i merged this into pysubs2bytes

import pysubs2

subfile_path = "test.srt"

ssa_event_list = pysubs2.load(subfile_path)

for ssa_event in ssa_event_list:
    text = ssa_event.text
    if type(text) == bytes:
        text = text.decode("utf8")
    print("text", repr(text))

@milahu milahu force-pushed the bytestrings branch 2 times, most recently from b9ca41e to 319ad18 Compare March 10, 2024 18:03
@tkarabela tkarabela self-requested a review May 5, 2024 13:54
@tkarabela
Copy link
Owner

I think this is too user-hostile for the use-cases where you actually want to modify the subtitle texts. I would like to get the same effect with Unicode surrogate escapes, which should improve the situation ie. prevent some hard errors when reading subtitle files that happen today, without burdening everyone with bytes all over the place.

@tkarabela tkarabela closed this May 5, 2024
@milahu
Copy link
Author

milahu commented May 5, 2024

mixed-encoding files are a rare problem
for example caused by inserting utf8 advertisments into non-utf8 subtitles

handling mixed-encoding files requires to detect boundaries between different encodings
and for subtitles, these boundaries are the boundaries between text blocks

returning bytes instead of str could be a fallback
when the file encoding was not autodetected (#43)
or when the file has multiple encodings

but pysubs could still return strings
with a more fine-grained encoding detection
(guess encoding per textblock, not per file)

strings would be more user-friendly but less performant
for my app, im ignoring text-encoding completely
with the tradeoff of more complexity in other parts of my code (bytes regex)

Unicode surrogate escapes

that just ignores the problem of mixed-encoding files

the output will also be a mixed-encoding file
but maybe the user wants a single-encoding file

>>> "ö".encode("latin1")
b'\xf6'

>>> "ö".encode("latin1").decode("utf8", errors="surrogateescape")
'\udcf6'

>>> "ö".encode("latin1").decode("utf8", errors="surrogateescape").encode("utf8", errors="surrogateescape")
b'\xf6'

>>> print("ö".encode("latin1").decode("utf8", errors="surrogateescape"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf6' in position 0: surrogates not allowed

@tkarabela
Copy link
Owner

I hope that mixed-encoding files are rare, since they sound like hell to deal with :) I'd like if it was possible to use the library to parse them, at least in principle, but I'm not sure how much support the library itself should provide. I think being able to get the raw bytes of the individual subtitle texts is enough - this is now possible with the latest version of my code, the user can .encode("utf8", errors="surrogateescape") the individual subtitles and then re-interpret them with a different encoding.

The performance aspect is something I haven't really thought about, and isn't my goal for the library. Your bytes-oriented approach will surely be more efficient. But at that point, using a C++/Rust/etc. library would surely be order of magnitude faster still. (Though, I don't know what libraries are available. I've thought about porting this library to Rust as a learning excercise, but I really don't have any use for it, so I haven't ^^;)

>>> print("ö".encode("latin1").decode("utf8", errors="surrogateescape"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf6' in position 0: surrogates not allowed

Thanks for pointing this error, that's a bit nasty. I should note this in the documentation, if the next release is going to have surrogate escapes as default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better handling of files with unknown character encoding
2 participants