Use html5-parser instead of lxml #410

vbanos · 2019-05-20T19:16:12Z

Replace all BeautifulSoup(html, 'lxml') with
html5_parser.parse(html, treebuilder='soup', return_root=False).

The aim is to improve robustness and speed.

Replace all `BeautifulSoup(html, 'lxml')` with `html5_parser.parse(html, treebuilder='soup', return_root=False)`. The aim is to improve robustness and speed.

vbanos · 2019-05-20T19:20:05Z

I'm concerned about html5-parser installation.
You need to install it using pip install --no-binary lxml html5-parser because of the following:

It is important that lxml is installed with the --no-binary flag. This is because without it, lxml uses a static copy of libxml2. For html5-parser to work it must use the same libxml2 implementation as lxml. This is only possible if libxml2 is loaded dynamically. https://html5-parser.readthedocs.io/en/latest/

I couldn't find a way to include --no-binary in setup.py.

Mr0grog

This looks good, except for the fact that we are tripping over a problematic bug in html5-parser that eventually causes non-body parts of a page to get moved to the body. We need to wait for that to get fixed upstream or apply a workaround here (possibly by re-implementing html5_parser.soup.parse here).

I couldn't find a way to include --no-binary in setup.py.

The current lxml docs talk about needing to set STATIC_DEPS=true in order to get what sounds like the problematic behavior the html5-parser docs describe. I wonder if this has changed since that note was written (it’s two years old). In any case, everything seems to be working find for me without specifying --no-binary.

^^ @danielballan do you have any experience with/thoughts about the --no-binary issue?

web_monitoring/html_diff_render.py

Mr0grog · 2019-05-21T00:33:06Z

Ah, it looks like we need to apt get install pkg-config libxml2-dev in the Dockerfile now, as well:

web-monitoring-processing/Dockerfile

Lines 5 to 6 in 76a3710

    
           RUN apt-get update && apt-get install -y --no-install-recommends \ 
        
               git gcc g++

vbanos · 2019-05-22T20:33:16Z

You also need libxslt-dev in addition to libxml2-dev I think. https://lxml.de/installation.html

Required to install `html5-parser` and `libxml2`.

Mr0grog · 2019-05-22T21:34:42Z

Huh, you’re right, that’s really clearly documented. I’m surprised building the Docker image didn’t give me a compiler error on the pip install step!

danielballan · 2019-05-23T04:38:21Z

This looks promising

pypa/pip#4307 (comment)

though I have never encountered it in the wild myself.

danielballan · 2019-05-23T06:07:26Z

I confirmed that

python -m venv ~/venv/test-no-binary
echo "lxml --no-binary lxml" > req.txt
pip install -r req.txt

compiles from source, circumventing the binary wheel, per the link above, but this does fall in the category of things that are valid in requirements.txt but not valid in the install_requires block of setup.py.

Mr0grog · 2019-05-30T21:41:09Z

this does fall in the category of things that are valid in requirements.txt but not valid in the install_requires block of setup.py.

Alright, so the real question is: is that OK? It’s certainly not ideal.

I lean towards “yes”:

Our local install instructions suggest using requirements.txt, so it shouldn’t matter for that case.
We don’t publish this on PyPI, and I think (?) we plan to do Radical suggestion: Deprecate this repo and break up pieces #206 before we do, so the set of things this might affect hopefully isn’t big.
I think we are getting a pretty huge improvement in parsing accuracy and performance here, and I feel like a slightly more complicated install is probably worth that.
- I’m taking the fact that @vbanos submitted this PR instead of a comment on the issue (Refactor on top of html5-parser #138) as an indicator that Internet Archive sees this similarly. Let me know if I’m misunderstanding.

Agree/disagree?

vbanos · 2019-06-04T06:22:30Z

All issues are addressed now.
If you finally decide that the installation process is cool, that would be great.

Mr0grog

Not a big deal (I don’t think it would ever impact functionality), but we should probably change links_diff._create_empty_soup so that we are consistent in how we parse across the whole links_diff module.

Mr0grog · 2019-06-05T17:24:31Z

requirements.txt

@@ -5,6 +5,7 @@ diff_match_patch_python ~=1.0.2
 docopt ~=0.6.2
 git+https://github.com/anastasia/htmldiffer@develop
 git+https://github.com/danielballan/htmltreediff@customize
+html5-parser ~=0.4.7


Per @danielballan’s comment, I think this still needs --no-binary lxml on the end.

@danielballan that’s how this should be set up, right?

@Mr0grog when I do that, I get an error. Check out:
https://travis-ci.org/edgi-govdata-archiving/web-monitoring-processing/builds/541947661?utm_source=github_status&utm_medium=notification

Seems like you’ll need to improve the parsing in setup.py to support this:

web-monitoring-processing/setup.py

Lines 22 to 25 in 19d9de2

# Requirements not on PyPI can't be installed through `install_requires`.

# They have to be installed manually or with `pip install -r requirements.txt`.

requirements = [r for r in read('requirements.txt').splitlines()

if not r.startswith('git+https://')]

@Mr0grog I'm sorry but I'm a bit confused.
What do you suggest to do here? Remove --no-binary lxml?

Well, the format of lines in a pip file is:

<requirement specifier> [; markers] [[--option]...]

Whereas install_requires in setup.py can only have <requirement specifier>. So it seems like the right move is to remove anything after the specifier.

It’s not perfectly accurate, but the easiest thing might be to just look for ; or -- on the line and, if present, remove that string and everything after it. That would probably do the job well enough for our purposes, at least.

OK, its done! Thank you for the explanation @Mr0grog . Could you please merge now?

Looks like you are still some linting errors, but will take a look and see about merging when I am off work tonight or tomorrow.

Linting error fixed!

Mr0grog · 2019-06-05T17:30:44Z

Noted a couple tiny issues, otherwise this looks good to me. Thanks, @vbanos!

Use `html5_parser.parse` instead of `BeautifulSoup('...', 'lxml')` in `web_monitoring.links_diff._create_empty_soup` so that we are consistent in how we parse across the whole links_diff module.

Look for for `;` or `--` on each requirements line and, if present, remove that string and everything after it.

Remove duplicate import

Mr0grog · 2019-06-10T05:33:30Z

👍 Looks great!

I had tested a previous version of the dockerfile in #410, but not the final one, which fails to build properly. It was missing libz. :(

…data-archiving/web-monitoring-processing#419

In CI, this splits the image building part of the `publish_docker` job off into a `build_docker` job, and then runs that job alongside our normal tests. That means all our PRs will try to build (but not publish) Docker images, and we’ll know if there’s a problem with changes to the Dockerfile or some other part of the build process. In #410, I did a bad job testing the Docker build (I tested an earlier version, but forgot to test the final version that we merged, which failed). That led to a hotfix (#419), which is not fun. This should help prevent that kind of situation.

Use html5-parser instead of lxml when parsing HTML for diffs. (It still builds BeautifulSoup documents at the end of the day, so the code impact is minimal.) We think this should parse slightly faster, but more importantly, will do a better job parsing various malformed bits of HTML we encounter. Because this needs to link with the version of lxml we use, installation gets a bit more complicated (See https://html5-parser.readthedocs.io/en/latest/). At the end of the day, pip can still automate most of it, but we now have some special instructions in the pipfile. If we ever distribute the diff code as a package, we’ll have to make special note of this because `setup.py` doesn’t support proper installation.

…ata-archiving#419) I had tested a previous version of the dockerfile in edgi-govdata-archiving#410, but not the final one, which fails to build properly. It was missing libz. :(

…g#422) In CI, this splits the image building part of the `publish_docker` job off into a `build_docker` job, and then runs that job alongside our normal tests. That means all our PRs will try to build (but not publish) Docker images, and we’ll know if there’s a problem with changes to the Dockerfile or some other part of the build process. In edgi-govdata-archiving#410, I did a bad job testing the Docker build (I tested an earlier version, but forgot to test the final version that we merged, which failed). That led to a hotfix (edgi-govdata-archiving#419), which is not fun. This should help prevent that kind of situation.

Use html5-parser instead of lxml

8a918c0

Replace all `BeautifulSoup(html, 'lxml')` with `html5_parser.parse(html, treebuilder='soup', return_root=False)`. The aim is to improve robustness and speed.

vbanos mentioned this pull request May 20, 2019

Refactor on top of html5-parser #138

Closed

5 tasks

Remove unused BeautifulSoup import and redundant space

a7ee6cc

Mr0grog requested changes May 21, 2019

View reviewed changes

web_monitoring/html_diff_render.py Outdated Show resolved Hide resolved

Vangelis Banos added 2 commits May 22, 2019 20:35

Include missing system pkgs in Dockerfile

59295b2

Required to install `html5-parser` and `libxml2`.

Remove workaround in html_diff_render

0b4c613

Upgrade html5-parser to 0.4.7

0d60596

Mr0grog reviewed Jun 5, 2019

View reviewed changes

Vangelis Banos and others added 5 commits June 5, 2019 20:13

Add --no-binary lxml to html5-parser installation

4b7336e

Use html5_parser instead of BeautifulSoup in links_diff

e89b368

Use `html5_parser.parse` instead of `BeautifulSoup('...', 'lxml')` in `web_monitoring.links_diff._create_empty_soup` so that we are consistent in how we parse across the whole links_diff module.

setup.py fix

cd852d3

Look for for `;` or `--` on each requirements line and, if present, remove that string and everything after it.

Fix minor linting error

0a0bd1a

Remove duplicate import

Clean up comments and varible naming in setup.py

54de4a2

Mr0grog approved these changes Jun 10, 2019

View reviewed changes

Mr0grog merged commit 8d65ff4 into edgi-govdata-archiving:master Jun 10, 2019

Mr0grog added a commit that referenced this pull request Jun 10, 2019

Release #410

90edee1

Mr0grog added a commit that referenced this pull request Jun 10, 2019

HOTFIX: #410 broke our docker builds

156f18f

I had tested a previous version of the dockerfile in #410, but not the final one, which fails to build properly. It was missing libz. :(

Mr0grog mentioned this pull request Jun 10, 2019

HOTFIX: #410 broke our docker builds #419

Merged

Mr0grog added a commit that referenced this pull request Jun 10, 2019

HOTFIX: #410 broke our docker builds (#419)

95965b1

I had tested a previous version of the dockerfile in #410, but not the final one, which fails to build properly. It was missing libz. :(

Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-ops that referenced this pull request Jun 10, 2019

Deploy edgi-govdata-archiving/web-monitoring-processing#410, edgi-gov…

d767a01

…data-archiving/web-monitoring-processing#419

Mr0grog mentioned this pull request Jun 13, 2019

Make a separate build_docker job in CircleCI #422

Merged

vbanos deleted the html5parser branch August 4, 2019 10:43

Mr0grog mentioned this pull request Sep 19, 2019

html_token: Processing instruction nodes are converted to text #149

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use html5-parser instead of lxml #410

Use html5-parser instead of lxml #410

vbanos commented May 20, 2019

vbanos commented May 20, 2019

Mr0grog left a comment

Mr0grog commented May 21, 2019

vbanos commented May 22, 2019

Mr0grog commented May 22, 2019

danielballan commented May 23, 2019

danielballan commented May 23, 2019

Mr0grog commented May 30, 2019 •

edited

Loading

vbanos commented Jun 4, 2019

Mr0grog left a comment

Mr0grog Jun 5, 2019

Mr0grog Jun 5, 2019

vbanos Jun 5, 2019

Mr0grog Jun 6, 2019 •

edited

Loading

vbanos Jun 7, 2019

Mr0grog Jun 7, 2019

vbanos Jun 7, 2019

Mr0grog Jun 7, 2019

vbanos Jun 8, 2019

Mr0grog commented Jun 5, 2019

Mr0grog commented Jun 10, 2019

	# Requirements not on PyPI can't be installed through `install_requires`.
	# They have to be installed manually or with `pip install -r requirements.txt`.
	requirements = [r for r in read('requirements.txt').splitlines()
	if not r.startswith('git+https://')]

Use html5-parser instead of lxml #410

Use html5-parser instead of lxml #410

Conversation

vbanos commented May 20, 2019

vbanos commented May 20, 2019

Mr0grog left a comment

Choose a reason for hiding this comment

Mr0grog commented May 21, 2019

vbanos commented May 22, 2019

Mr0grog commented May 22, 2019

danielballan commented May 23, 2019

danielballan commented May 23, 2019

Mr0grog commented May 30, 2019 • edited Loading

vbanos commented Jun 4, 2019

Mr0grog left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mr0grog Jun 6, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mr0grog commented Jun 5, 2019

Mr0grog commented Jun 10, 2019

Mr0grog commented May 30, 2019 •

edited

Loading

Mr0grog Jun 6, 2019 •

edited

Loading