-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH adding write_html
to TableReport
#1190
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Guillaume Lemaitre <[email protected]>
report = TableReport(df) | ||
|
||
with TemporaryDirectory() as td: | ||
f_name = Path(td) / Path("report.html") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's call it if filename
f_name = Path(td) / Path("report.html") | ||
|
||
if filename_path == "str": | ||
report.write_html(f_name.absolute()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to pass .absolute
here?
if filename_path == "str": | ||
report.write_html(f_name.absolute()) | ||
|
||
if filename_path == "Path": | ||
report.write_html(f_name) | ||
|
||
if filename_path == "file_object": | ||
file_object = open(f_name, "w", encoding="utf-8") | ||
report.write_html(file_object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer something like:
tmp_file_path = ...
if filename_type == "str":
filename = str(tmp_file_path)
elif filename_type == "file_object":
filename = open(tmp_file_path, "w", encoding="utf-8")
else:
filename = tmp_file_path
report.write_html(filename)
assert tmp_file_path.exists()
Co-authored-by: Guillaume Lemaitre <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mrastgoo , thanks a lot for working on this! It's a great addition 🎉 🚀 . Even in the skrub tests and documentation scripts there are a few places we could use it :)
also the fact that the reports are self-contained and can be seen after the python process finishes is one of their main advantages (which required some compromises) and by adding this method you'll make that feature more visible and help users take advantage of it :)
skrub/_reporting/_table_report.py
Outdated
@@ -197,6 +198,30 @@ def _repr_mimebundle_(self, include=None, exclude=None): | |||
def _repr_html_(self): | |||
return self._repr_mimebundle_()["text/html"] | |||
|
|||
def write_html(self, filename): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! In the issue I forgot to mention a couple of things:
- Sometimes generating the report can fail (unfortunately!). We want to make sure we close the file (if we opened it) in this case; also we may want to avoid truncating or creating it before we know we have something to write.
- The report will always set the charset to utf-8 in the page's metadata. So if the user passes an open file we may optionally check the file's encoding if we want to be extra cautious. Also note unless the doctring we write forbids it the user could pass a binary stream so we may want to handle that case, too.
- I'm not sure we need to enforce the filename suffix. I could imagine someone wanting to save to "report.htm" or "report.html.tmp" or something so I'd rather let the user name the file however they want.
- If we allow both a path and a file object maybe the parameter could be just "file" instead of "filename"?
So taking those details into account I guess the function could be modified slightly to look similar to this:
def write_html(self, file):
html = self.html()
if isinstance(file, (str, Path)):
with open(file, "w", encoding="utf8") as stream:
stream.write(html)
return
try:
file.write(html.encode("utf-8"))
return
except TypeError:
pass
if (encoding := getattr(file, "encoding", None)) is not None:
try:
assert codecs.lookup(encoding).name == "utf-8"
except (AssertionError, LookupError):
raise ValueError(
f"If `file` is a text file it should use utf-8 encoding; got: {encoding!r}"
)
file.write(html)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you comments @jeromedockes , I have couple of questions.
- What are the reasons behind
TypeError
- if the encoding is None it seems the locale.encoding will be used, so we want to check for that too and raise an error if it is not utf-8 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mrastgoo ,
- What are the reasons behind TypeError
Trying to write bytes and catching the TypeError
is an easy way to detect if we are dealing with a text stream or a binary stream.
A file object (a.k.a. stream) can be:
- In binary mode, in which case it expects bytes that can be written to a file without transformation. If we want to write text we must encode it into bytes ourselves. Passing a
str
towrite
raises aTypeError
. - In text mode, in which case it expects strings and it will transparently encode them into bytes (and optionally translate newline characters) before writing the resulting bytes to the underlying binary buffer. Passing a
bytes
towrite
raises aTypeError
(the one we catch).
>>> 'é'
'é'
>>> 'é'.encode('utf-8')
b'\xc3\xa9'
Binary mode:
>>> binary_stream = open('thefile', 'wb') # 'wb' = 'write + binary'
>>> binary_stream.write('é'.encode('utf-8')) # we can write bytes
2
>>> binary_stream.write(b'\xc3\xa9')
2
>>> binary_stream.write('é') # trying to write a string directly is a TypeError
Traceback (most recent call last):
...
TypeError: a bytes-like object is required, not 'str'
>>> binary_stream.close()
Conversely, if we use text mode (the default):
>>> text_stream = open('thefile', 'wt', encoding='utf-8') # 'write + text', or 'w' as text is the default
>>> text_stream.write('é') # we can write strings
1
>>> text_stream.write(b'\xc3\xa9'.decode('utf-8'))
1
and this is the error we catch in the current version of TableReport.write_html
:
>>> text_stream.write(b'\xc3\xa9') # trying to write bytes to a text stream is a TypeError
Traceback (most recent call last):
...
TypeError: write() argument must be str, not bytes
>>> text_stream.close()
So we need to know if the file object the user gave us is a text or binary stream. If we knew it was created with open
we could inspect its mode:
>>> with open('thefile', 'w') as stream:
... print(stream.mode)
w
>>> with open('thefile', 'wb') as stream:
... print(stream.mode)
wb
But that may not always work with other kinds of file objects:
>>> import io
>>> stream = io.BytesIO()
>>> stream.write(b'\xc3\xa9')
2
>>> stream.mode
Traceback (most recent call last):
...
AttributeError: '_io.BytesIO' object has no attribute 'mode'
All well-behaved file objects however will respect the rule of raising a TypeError if we give them the wrong type of data. So what we can do is first suppose it is in binary mode and try to write bytes, and if we get a TypeError try again with a string (or we could do it the other way around, too).
- if the encoding is None it seems the locale.encoding will be used, so we want to check for that too and raise an error if it is not utf-8 ?
If we pass encoding=None
to open
it will default to using locale.encoding
, but AFAIK the resulting file object will always have an explicit encoding (not None
).
>>> with open('thefile', 'w', encoding=None) as f:
... print(f.encoding)
UTF-8
So I think it should be ok to inspect the encoding of the file object in this case.
Some other kinds of text stream may not have an explicit encoding though. For example io.StringIO
writes to an in-memory python string, so it accepts strings but does not need to encode them (it doesn't need to convert to bytes because the backing storage is a unicode string)
>>> stream = io.StringIO()
>>> stream.encoding is None
True
>>> stream.write('é') # ok
1
In the case of StringIO
it's fine to copy the report string to the stream's buffer even if we couldn't figure out what encoding it uses (because it doesn't use any). In some other cases I guess looking at the encoding
attribute may not give a result even if a (possibly wrong) encoding is being used. But I guess at this point we have to trust the user to know what they are doing and at least we have made a reasonable effort to prevent the most likely issue, which is doing this on windows:
>>> with open('thefile', 'w') as f:
... report.write_html(f)
(Even that will go away in python 3.15 when utf-8 becomes the default).
So my suggestion is that if we easily detect a wrong encoding is being used by checking the encoding
attribute, we raise an error because we are sure that writing will result in a file with an incorrect <meta charset='utf-8'>
tag which is likely to cause it to display incorrectly in a browser. If we cannot figure out for sure that a wrong encoding was used we just go ahead and write the report and hope for the best.
sorry for the long answer, hope this makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fail to see the use case for allowing the user to pass a file instead of a filename. According to some path, this feature should address the simple use case of writing an HTML file to disk. Can you motivate this a little bit? It looks like plain over-engineering to me TBH.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it was a suggestion from @glemaitre .
but if you look at functions offering to save some object as text such as matplotlib.savefig, pandas or polars to_csv etc, xml.etree.write, plotly write_html and others it's hard to find an example that does not allow passing a file object.
one example use case would be using a temporary file, testing, situations in which you may need to write either to an actual file or an in-memory buffer, ...
from personal experience a while ago in a completely unrelated context NiftiImages from the nibabel library could only be saved to a filesystem path and I can't remember what were the exact situation where it was a problem for me but I remember it was annoying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jeromedockes for the explanation. it helped a lot.
@@ -123,6 +127,42 @@ def test_duration(df_module): | |||
assert re.search(r"2(\.0)?\s+days", TableReport(df).html()) | |||
|
|||
|
|||
@pytest.mark.parametrize("filename_type", ["str", "Path", "file_object"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we want to be super exhaustive maybe we could test files both in text and binary modes?
df = pd_module.make_dataframe({"a": [1, 2], "b": [3, 4]}) | ||
report = TableReport(df) | ||
|
||
with TemporaryDirectory() as td: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could use the pytest tmp_path
fixture?
filename = tmp_file_path | ||
|
||
report.write_html(filename) | ||
assert tmp_file_path.exists() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to check the full content but maybe we could see that the file contains "</html>
" to see that the full report has been written
Users will be interested in this feature so we also need a changelog entry :) thanks! |
I am block to make the test for platform encoding. The following test doesn't creates the error message for the moment. def test_write_html_platform_encoding_not_utf8(tmp_path, pd_module, monkeypatch):
df = pd_module.make_dataframe({"a": [1, 2], "b": [3, 4]})
report = TableReport(df)
monkeypatch.setattr(locale, "getencoding", lambda: "ascii")
filename = open(tmp_path / Path("report.html"), "w", encoding=None)
err_msg = f"Platform encoding is not utf-8; got {locale.getencoding()}"
with pytest.raises(ValueError, match=err_msg):
report.write_html(filename) In addition the CI fails for the binary mode. Any suggestion @jeromedockes |
I don't think we can set the default encoding just by patching >>> import locale
>>> locale.getencoding = lambda: 'ascii'
>>> with open("thefile", "w", encoding=None) as f:
... print(f.encoding)
UTF-8 I think it's fine not to test this, just testing when we open a file for which we explicitly set a wrong encoding is enough IMO.
from the log it seems it fails when trying to read back the report to check the content, right? I think it is because we saved the report in utf-8 (as we should) but when we open it for reading we don't set an explicit encoding so on windows it ends up using the incorrect default encoding. I think if we replace with open(tmp_file_path, "r") as file: with with open(tmp_file_path, "r", encoding="utf-8") as file: it should fix the issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all the tests are passing 🎉 a couple more details but we are almost there :)
"filename_type", | ||
["str", "Path", "file_object", "binary_mode"], | ||
) | ||
def test_write_html(tmp_path, pd_module, filename_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test is looking great! the last thing we need to take care of is to make sure we close the file if we opened it (otherwise we "leak" a resource: the opened file handle. that could be a problem for example to clean up the temp directory on windows as it refuses to remove files that have an open file handle). Usually we ensure that with a simple context manager like:
with open(tmp_file_path, 'w', encoding='utf-8') as file:
file.write('hello')
but here we have a tricky situation because in some cases we have a string or path (which require no closing) and sometimes we have a file object which does require closing.
The standard library module contextlib
provides 2 ways to deal with that situation easily. The first is ExitStack
: it creates a context and we can push as many context managers as we want to its stack; when it exits it unwinds the stack, calling each manager's __exit__
when it is popped. So we could use it like:
with contextlib.ExitStack() as stack:
if file_type == 'str':
file = str(tmp_file_path)
elif file_type == 'text_file_object':
file = stack.enter_context(open(tmp_file_path, 'w', encoding='utf-8'))
# ...
report.write_html(file)
# if we opened it the file is closed here when we exit the `with` block
This option using ExitStack is my favorite because the file is being managed by a context manager as soon as it is opened.
Another way is to use nullcontext
in the cases where we do not open the file, so that later we can treat all options as if they were open files that implement the context manager protocol. nullcontext
returns an object that implements the context manager protocol but whose __enter__
just returns the object we gave it and __exit__
does nothing:
if file_type == 'str':
file = contextlib.nullcontext(str(tmp_file_path))
elif file_type == 'text_file_object':
file = open(tmp_file_path, 'w', encoding='utf-8')
with file:
report.write_html(file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow contextlib.ExitStack() is very nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could add a comment on L139 to explain why it's a good idea to use contextlib here?
df = pd_module.make_dataframe({"a": [1, 2], "b": [3, 4]}) | ||
report = TableReport(df) | ||
|
||
filename = open(tmp_path / Path("report.html"), "w", encoding="latin-1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here also we want to use a context manager to close the file
report.write_html(filename) | ||
assert not filename.exists() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
report.write_html(filename) | |
assert not filename.exists() | |
report.write_html(filename) | |
assert not filename.exists() |
otherwise the assert is never executed because of the exception raised on the line above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually change this to check if the file doesn't contain html
to make sure we don't modify it.
sorry I forgot to answer that. The reason is that every time we generate the report with actually I'm thinking now that we don't need to vary the id in the |
Co-authored-by: Jérôme Dockès <[email protected]>
Co-authored-by: Jérôme Dockès <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! thanks so much for this very nice addition @mrastgoo 🚀
@glemaitre and @Vincent-Maladiere LMK if you want to have another look otherwise I'll merge it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a few lines of comment to explain the rational behind the steps of the function, from what has been discussed here
f" {encoding!r}" | ||
) | ||
|
||
file.write(html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment explaining what html
is expected (or not expected) to be at line 230?
Additionally, light inline documentation/comments on the steps above would help readability :)
"filename_type", | ||
["str", "Path", "file_object", "binary_mode"], | ||
) | ||
def test_write_html(tmp_path, pd_module, filename_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could add a comment on L139 to explain why it's a good idea to use contextlib here?
I added some comments in the function, Let me know if this is ok |
closes #1174
Adding
write_html
toTableReport
.The added test only checks for the raise error and if the file exist
The following code for checking the content failed and needs further investigation