Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return an error when trying to decompose a node with html tag #87

Open
HugoLaurencon opened this issue Apr 28, 2023 · 4 comments
Open

Comments

@HugoLaurencon
Copy link

Hello, I believe it would be beneficial to generate an error message when we execute the command node.decompose() on a node that has the html tag. Currently, the procedure freezes and does not produce any error messages.

When dealing with a large number of HTML files, there are often some atypical ones containing an attribute we want to remove nodes with directly in the main HTML node. However, without any error messages, it is challenging to identify the source of the problem, making debugging a difficult process.

@rushter
Copy link
Owner

rushter commented Apr 29, 2023

Please provide an example that hangs.

import selectolax.parser

html = "<body><div></div></body>"
html_parser = selectolax.parser.HTMLParser(html)
print(html_parser.root.decompose())

Works fine for me (.root is the <html> element).

@HugoLaurencon
Copy link
Author

Sure, here is my example that results in an infinite loop

from selectolax.parser import HTMLParser

html_str = """
<!DOCTYPE html>
<html class="site-info">
</html>
"""

def _remove_nodes_matching_css_rules(selectolax_tree):
    modification = True
    while modification:
        found_a_node = False
        for node in selectolax_tree.css("[class~='site-info']"):
            node.decompose()
            found_a_node = True
            break
        if not found_a_node:
            modification = False
    return selectolax_tree


selectolax_tree = HTMLParser(html_str)

selectolax_tree = _remove_nodes_matching_css_rules(
    selectolax_tree=selectolax_tree,
)

Actually you're right that we can decompose the html node, but then there is an infinite loop because I think the attributes of the html node are kept after calling the decompose operation

@rushter
Copy link
Owner

rushter commented Apr 30, 2023

Your code will work if you switch to lexbor backed.

Why do you need to remove the html tag? It's essential for any document and gets automatically created even if you don't provide it.

@lexborisov What's the best way to handle this? I'm using myhtml_tree_node_remove. Can I just set myhtml_tree_t->node_html manually? The result of remove is not propagated to the myhtml_tree_t structure because we are trying to remove the root node and myhtml_tree_t points to the root node already.

@lexborisov
Copy link

@rushter

What's the best way to handle this?

I don't understand why I have to remove the html node at all?
But never mind.

Can I just set myhtml_tree_t->node_html manually?

Yes, you can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants