-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better handling of files with unknown character encoding #43
Comments
context: im parsing millions of subtitles from opensubtitles.org something like... diff --git a/pysubs2/ssafile.py b/pysubs2/ssafile.py
index 1202a46..ee22ea9 100644
--- a/pysubs2/ssafile.py
+++ b/pysubs2/ssafile.py
@@ -53,7 +53,7 @@ class SSAFile(MutableSequence):
# ------------------------------------------------------------------------
@classmethod
- def load(cls, path: str, encoding: str="utf-8", format_: Optional[str]=None, fps: Optional[float]=None, **kwargs) -> "SSAFile":
+ def load(cls, path: str, encoding: Optional[str]="utf8", format_: Optional[str]=None, fps: Optional[float]=None, **kwargs) -> "SSAFile":
"""
Load subtitle file from given path.
@@ -67,7 +67,8 @@ class SSAFile(MutableSequence):
Arguments:
path (str): Path to subtitle file.
encoding (str): Character encoding of input file.
- Defaults to UTF-8, you may need to change this.
+ Default is "utf8".
+ Set to None to autodetect the encoding.
format_ (str): Optional, forces use of specific parser
(eg. `"srt"`, `"ass"`). Otherwise, format is detected
automatically from file contents. This argument should
@@ -100,6 +101,13 @@ class SSAFile(MutableSequence):
>>> subs3 = pysubs2.load("subrip-subtitles-with-fancy-tags.srt", keep_unknown_html_tags=True)
"""
+ if encoding == None:
+ # guess encoding
+ import charset_normalizer
+ with open(path, "rb") as fp:
+ content_bytes = fp.read()
+ charset_matches = charset_normalizer.from_bytes(content_bytes)
+ encoding = str(charset_matches.best().encoding)
with open(path, encoding=encoding) as fp:
return cls.from_file(fp, format_, fps=fps, **kwargs)
edit - encoding = charset_matches.encoding
+ encoding = str(charset_matches.best().encoding) also |
push this is such an easy fix...
related: |
@milahu I see your point, but I also don't like having |
This addresses long-standing ergonomic issue #43 when dealing with files that have various or unknown character encoding. Previously, the library assumed both input and output files should be UTF-8, and it failed in case this was incorrect, forcing the user to provide appropriate character encoding. After this commit, UTF-8 is still the default input/output encoding, but default error handling changed from "strict" to "surrogateescape", ie. non-UTF-8 characters will be read into Unicode surrogate pairs which will be turned to the original non-UTF-8 characters on output. To get the previous behaviour, use `SSAFile.load(..., errors=None)` and `SSAFile.save(..., errors=None)`. For text processing, you still should specify the encoding explicitly, otherwise you will get surrogate pairs instead of non-ASCII characters when inspecting the SSAFile. Note that multi-byte encodings may still break the parser; parsing with surrogate escapes will work best with ASCII-like encodings.
As of 1.2.0, we default to UTF-8 encoding. If this is not correct, the user has to specify the proper encoding manually. To improve the experience, we could try some autodetection before bailing out, to improve UX.
This is already something that users are dealing with, see:
Consider adding https://github.com/chardet/chardet as (optional?) dependency.
(This is another idea from the original
pysubs
library.)The text was updated successfully, but these errors were encountered: