-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use html5-parser instead of lxml #410
Conversation
Replace all `BeautifulSoup(html, 'lxml')` with `html5_parser.parse(html, treebuilder='soup', return_root=False)`. The aim is to improve robustness and speed.
I'm concerned about It is important that lxml is installed with the --no-binary flag. This is because without it, lxml uses a static copy of libxml2. For html5-parser to work it must use the same libxml2 implementation as lxml. This is only possible if libxml2 is loaded dynamically. https://html5-parser.readthedocs.io/en/latest/ I couldn't find a way to include |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, except for the fact that we are tripping over a problematic bug in html5-parser that eventually causes non-body parts of a page to get moved to the body. We need to wait for that to get fixed upstream or apply a workaround here (possibly by re-implementing html5_parser.soup.parse
here).
I couldn't find a way to include
--no-binary
insetup.py
.
The current lxml docs talk about needing to set STATIC_DEPS=true
in order to get what sounds like the problematic behavior the html5-parser docs describe. I wonder if this has changed since that note was written (it’s two years old). In any case, everything seems to be working find for me without specifying --no-binary
.
^^ @danielballan do you have any experience with/thoughts about the --no-binary
issue?
Ah, it looks like we need to web-monitoring-processing/Dockerfile Lines 5 to 6 in 76a3710
|
You also need |
Required to install `html5-parser` and `libxml2`.
Huh, you’re right, that’s really clearly documented. I’m surprised building the Docker image didn’t give me a compiler error on the |
This looks promising though I have never encountered it in the wild myself. |
I confirmed that python -m venv ~/venv/test-no-binary
echo "lxml --no-binary lxml" > req.txt
pip install -r req.txt compiles from source, circumventing the binary wheel, per the link above, but this does fall in the category of things that are valid in |
Alright, so the real question is: is that OK? It’s certainly not ideal. I lean towards “yes”:
Agree/disagree? |
All issues are addressed now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal (I don’t think it would ever impact functionality), but we should probably change links_diff._create_empty_soup
so that we are consistent in how we parse across the whole links_diff
module.
requirements.txt
Outdated
@@ -5,6 +5,7 @@ diff_match_patch_python ~=1.0.2 | |||
docopt ~=0.6.2 | |||
git+https://github.com/anastasia/htmldiffer@develop | |||
git+https://github.com/danielballan/htmltreediff@customize | |||
html5-parser ~=0.4.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per @danielballan’s comment, I think this still needs --no-binary lxml
on the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielballan that’s how this should be set up, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mr0grog when I do that, I get an error. Check out:
https://travis-ci.org/edgi-govdata-archiving/web-monitoring-processing/builds/541947661?utm_source=github_status&utm_medium=notification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like you’ll need to improve the parsing in setup.py
to support this:
web-monitoring-processing/setup.py
Lines 22 to 25 in 19d9de2
# Requirements not on PyPI can't be installed through `install_requires`. | |
# They have to be installed manually or with `pip install -r requirements.txt`. | |
requirements = [r for r in read('requirements.txt').splitlines() | |
if not r.startswith('git+https://')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mr0grog I'm sorry but I'm a bit confused.
What do you suggest to do here? Remove --no-binary lxml
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the format of lines in a pip file is:
<requirement specifier> [; markers] [[--option]...]
Whereas install_requires
in setup.py
can only have <requirement specifier>
. So it seems like the right move is to remove anything after the specifier.
It’s not perfectly accurate, but the easiest thing might be to just look for ;
or --
on the line and, if present, remove that string and everything after it. That would probably do the job well enough for our purposes, at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, its done! Thank you for the explanation @Mr0grog . Could you please merge now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you are still some linting errors, but will take a look and see about merging when I am off work tonight or tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linting error fixed!
Noted a couple tiny issues, otherwise this looks good to me. Thanks, @vbanos! |
Use `html5_parser.parse` instead of `BeautifulSoup('...', 'lxml')` in `web_monitoring.links_diff._create_empty_soup` so that we are consistent in how we parse across the whole links_diff module.
Look for for `;` or `--` on each requirements line and, if present, remove that string and everything after it.
Remove duplicate import
👍 Looks great! |
I had tested a previous version of the dockerfile in #410, but not the final one, which fails to build properly. It was missing libz. :(
In CI, this splits the image building part of the `publish_docker` job off into a `build_docker` job, and then runs that job alongside our normal tests. That means all our PRs will try to build (but not publish) Docker images, and we’ll know if there’s a problem with changes to the Dockerfile or some other part of the build process. In #410, I did a bad job testing the Docker build (I tested an earlier version, but forgot to test the final version that we merged, which failed). That led to a hotfix (#419), which is not fun. This should help prevent that kind of situation.
Use html5-parser instead of lxml when parsing HTML for diffs. (It still builds BeautifulSoup documents at the end of the day, so the code impact is minimal.) We think this should parse slightly faster, but more importantly, will do a better job parsing various malformed bits of HTML we encounter. Because this needs to link with the version of lxml we use, installation gets a bit more complicated (See https://html5-parser.readthedocs.io/en/latest/). At the end of the day, pip can still automate most of it, but we now have some special instructions in the pipfile. If we ever distribute the diff code as a package, we’ll have to make special note of this because `setup.py` doesn’t support proper installation.
…ata-archiving#419) I had tested a previous version of the dockerfile in edgi-govdata-archiving#410, but not the final one, which fails to build properly. It was missing libz. :(
…g#422) In CI, this splits the image building part of the `publish_docker` job off into a `build_docker` job, and then runs that job alongside our normal tests. That means all our PRs will try to build (but not publish) Docker images, and we’ll know if there’s a problem with changes to the Dockerfile or some other part of the build process. In edgi-govdata-archiving#410, I did a bad job testing the Docker build (I tested an earlier version, but forgot to test the final version that we merged, which failed). That led to a hotfix (edgi-govdata-archiving#419), which is not fun. This should help prevent that kind of situation.
Replace all
BeautifulSoup(html, 'lxml')
withhtml5_parser.parse(html, treebuilder='soup', return_root=False)
.The aim is to improve robustness and speed.