Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use html5-parser instead of lxml #410

Merged
merged 10 commits into from
Jun 10, 2019

Conversation

vbanos
Copy link
Contributor

@vbanos vbanos commented May 20, 2019

Replace all BeautifulSoup(html, 'lxml') with
html5_parser.parse(html, treebuilder='soup', return_root=False).

The aim is to improve robustness and speed.

Replace all `BeautifulSoup(html, 'lxml')` with
`html5_parser.parse(html, treebuilder='soup', return_root=False)`.

The aim is to improve robustness and speed.
@vbanos vbanos mentioned this pull request May 20, 2019
5 tasks
@vbanos
Copy link
Contributor Author

vbanos commented May 20, 2019

I'm concerned about html5-parser installation.
You need to install it using pip install --no-binary lxml html5-parser because of the following:

It is important that lxml is installed with the --no-binary flag. This is because without it, lxml uses a static copy of libxml2. For html5-parser to work it must use the same libxml2 implementation as lxml. This is only possible if libxml2 is loaded dynamically. https://html5-parser.readthedocs.io/en/latest/

I couldn't find a way to include --no-binary in setup.py.

Copy link
Member

@Mr0grog Mr0grog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, except for the fact that we are tripping over a problematic bug in html5-parser that eventually causes non-body parts of a page to get moved to the body. We need to wait for that to get fixed upstream or apply a workaround here (possibly by re-implementing html5_parser.soup.parse here).

I couldn't find a way to include --no-binary in setup.py.

The current lxml docs talk about needing to set STATIC_DEPS=true in order to get what sounds like the problematic behavior the html5-parser docs describe. I wonder if this has changed since that note was written (it’s two years old). In any case, everything seems to be working find for me without specifying --no-binary.

^^ @danielballan do you have any experience with/thoughts about the --no-binary issue?

@Mr0grog
Copy link
Member

Mr0grog commented May 21, 2019

Ah, it looks like we need to apt get install pkg-config libxml2-dev in the Dockerfile now, as well:

RUN apt-get update && apt-get install -y --no-install-recommends \
git gcc g++

@vbanos
Copy link
Contributor Author

vbanos commented May 22, 2019

You also need libxslt-dev in addition to libxml2-dev I think. https://lxml.de/installation.html

Vangelis Banos added 2 commits May 22, 2019 20:35
@Mr0grog
Copy link
Member

Mr0grog commented May 22, 2019

Huh, you’re right, that’s really clearly documented. I’m surprised building the Docker image didn’t give me a compiler error on the pip install step!

@danielballan
Copy link
Contributor

This looks promising

pypa/pip#4307 (comment)

though I have never encountered it in the wild myself.

@danielballan
Copy link
Contributor

I confirmed that

python -m venv ~/venv/test-no-binary
echo "lxml --no-binary lxml" > req.txt
pip install -r req.txt 

compiles from source, circumventing the binary wheel, per the link above, but this does fall in the category of things that are valid in requirements.txt but not valid in the install_requires block of setup.py.

@Mr0grog
Copy link
Member

Mr0grog commented May 30, 2019

this does fall in the category of things that are valid in requirements.txt but not valid in the install_requires block of setup.py.

Alright, so the real question is: is that OK? It’s certainly not ideal.

I lean towards “yes”:

  • Our local install instructions suggest using requirements.txt, so it shouldn’t matter for that case.
  • We don’t publish this on PyPI, and I think (?) we plan to do Radical suggestion: Deprecate this repo and break up pieces #206 before we do, so the set of things this might affect hopefully isn’t big.
  • I think we are getting a pretty huge improvement in parsing accuracy and performance here, and I feel like a slightly more complicated install is probably worth that.
    • I’m taking the fact that @vbanos submitted this PR instead of a comment on the issue (Refactor on top of html5-parser #138) as an indicator that Internet Archive sees this similarly. Let me know if I’m misunderstanding.

Agree/disagree?

@vbanos
Copy link
Contributor Author

vbanos commented Jun 4, 2019

All issues are addressed now.
If you finally decide that the installation process is cool, that would be great.

Copy link
Member

@Mr0grog Mr0grog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal (I don’t think it would ever impact functionality), but we should probably change links_diff._create_empty_soup so that we are consistent in how we parse across the whole links_diff module.

requirements.txt Outdated
@@ -5,6 +5,7 @@ diff_match_patch_python ~=1.0.2
docopt ~=0.6.2
git+https://github.com/anastasia/htmldiffer@develop
git+https://github.com/danielballan/htmltreediff@customize
html5-parser ~=0.4.7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per @danielballan’s comment, I think this still needs --no-binary lxml on the end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielballan that’s how this should be set up, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@Mr0grog Mr0grog Jun 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you’ll need to improve the parsing in setup.py to support this:

# Requirements not on PyPI can't be installed through `install_requires`.
# They have to be installed manually or with `pip install -r requirements.txt`.
requirements = [r for r in read('requirements.txt').splitlines()
if not r.startswith('git+https://')]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mr0grog I'm sorry but I'm a bit confused.
What do you suggest to do here? Remove --no-binary lxml?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the format of lines in a pip file is:

<requirement specifier> [; markers] [[--option]...]

Whereas install_requires in setup.py can only have <requirement specifier>. So it seems like the right move is to remove anything after the specifier.

It’s not perfectly accurate, but the easiest thing might be to just look for ; or -- on the line and, if present, remove that string and everything after it. That would probably do the job well enough for our purposes, at least.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, its done! Thank you for the explanation @Mr0grog . Could you please merge now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you are still some linting errors, but will take a look and see about merging when I am off work tonight or tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linting error fixed!

@Mr0grog
Copy link
Member

Mr0grog commented Jun 5, 2019

Noted a couple tiny issues, otherwise this looks good to me. Thanks, @vbanos!

Vangelis Banos and others added 5 commits June 5, 2019 20:13
Use `html5_parser.parse` instead of `BeautifulSoup('...', 'lxml')` in
`web_monitoring.links_diff._create_empty_soup` so that we are consistent
in how we parse across the whole links_diff module.
Look for for `;` or `--` on each requirements line and, if present, remove
that string and everything after it.
Remove duplicate import
@Mr0grog
Copy link
Member

Mr0grog commented Jun 10, 2019

👍 Looks great!

@Mr0grog Mr0grog merged commit 8d65ff4 into edgi-govdata-archiving:master Jun 10, 2019
Mr0grog added a commit that referenced this pull request Jun 10, 2019
Mr0grog added a commit that referenced this pull request Jun 10, 2019
I had tested a previous version of the dockerfile in #410, but not the final one, which fails to build properly. It was missing libz. :(
Mr0grog added a commit that referenced this pull request Jun 10, 2019
I had tested a previous version of the dockerfile in #410, but not the final one, which fails to build properly. It was missing libz. :(
Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-ops that referenced this pull request Jun 10, 2019
Mr0grog added a commit that referenced this pull request Jun 14, 2019
In CI, this splits the image building part of the `publish_docker` job off into a `build_docker` job, and then runs that job alongside our normal tests. That means all our PRs will try to build (but not publish) Docker images, and we’ll know if there’s a problem with changes to the Dockerfile or some other part of the build process.

In #410, I did a bad job testing the Docker build (I tested an earlier version, but forgot to test the final version that we merged, which failed). That led to a hotfix (#419), which is not fun. This should help prevent that kind of situation.
ftsalamp pushed a commit to ftsalamp/web-monitoring-processing that referenced this pull request Jun 19, 2019
Use html5-parser instead of lxml when parsing HTML for diffs. (It still builds BeautifulSoup documents at the end of the day, so the code impact is minimal.) We think this should parse slightly faster, but more importantly, will do a better job parsing various malformed bits of HTML we encounter.

Because this needs to link with the version of lxml we use, installation gets a bit more complicated (See https://html5-parser.readthedocs.io/en/latest/). At the end of the day, pip can still automate most of it, but we now have some special instructions in the pipfile. If we ever distribute the diff code as a package, we’ll have to make special note of this because `setup.py` doesn’t support proper installation.
ftsalamp pushed a commit to ftsalamp/web-monitoring-processing that referenced this pull request Jun 19, 2019
…ata-archiving#419)

I had tested a previous version of the dockerfile in edgi-govdata-archiving#410, but not the final one, which fails to build properly. It was missing libz. :(
ftsalamp pushed a commit to ftsalamp/web-monitoring-processing that referenced this pull request Jun 19, 2019
…g#422)

In CI, this splits the image building part of the `publish_docker` job off into a `build_docker` job, and then runs that job alongside our normal tests. That means all our PRs will try to build (but not publish) Docker images, and we’ll know if there’s a problem with changes to the Dockerfile or some other part of the build process.

In edgi-govdata-archiving#410, I did a bad job testing the Docker build (I tested an earlier version, but forgot to test the final version that we merged, which failed). That led to a hotfix (edgi-govdata-archiving#419), which is not fun. This should help prevent that kind of situation.
@vbanos vbanos deleted the html5parser branch August 4, 2019 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants