Return an error when trying to decompose a node with `html` tag #87

HugoLaurencon · 2023-04-28T00:31:47Z

Hello, I believe it would be beneficial to generate an error message when we execute the command node.decompose() on a node that has the html tag. Currently, the procedure freezes and does not produce any error messages.

When dealing with a large number of HTML files, there are often some atypical ones containing an attribute we want to remove nodes with directly in the main HTML node. However, without any error messages, it is challenging to identify the source of the problem, making debugging a difficult process.

The text was updated successfully, but these errors were encountered:

rushter · 2023-04-29T16:42:53Z

Please provide an example that hangs.

import selectolax.parser

html = "<body><div></div></body>"
html_parser = selectolax.parser.HTMLParser(html)
print(html_parser.root.decompose())

Works fine for me (.root is the <html> element).

HugoLaurencon · 2023-04-29T22:48:10Z

Sure, here is my example that results in an infinite loop

from selectolax.parser import HTMLParser

html_str = """
<!DOCTYPE html>
<html class="site-info">
</html>
"""

def _remove_nodes_matching_css_rules(selectolax_tree):
    modification = True
    while modification:
        found_a_node = False
        for node in selectolax_tree.css("[class~='site-info']"):
            node.decompose()
            found_a_node = True
            break
        if not found_a_node:
            modification = False
    return selectolax_tree


selectolax_tree = HTMLParser(html_str)

selectolax_tree = _remove_nodes_matching_css_rules(
    selectolax_tree=selectolax_tree,
)

Actually you're right that we can decompose the html node, but then there is an infinite loop because I think the attributes of the html node are kept after calling the decompose operation

rushter · 2023-04-30T10:26:52Z

Your code will work if you switch to lexbor backed.

Why do you need to remove the html tag? It's essential for any document and gets automatically created even if you don't provide it.

@lexborisov What's the best way to handle this? I'm using myhtml_tree_node_remove. Can I just set myhtml_tree_t->node_html manually? The result of remove is not propagated to the myhtml_tree_t structure because we are trying to remove the root node and myhtml_tree_t points to the root node already.

lexborisov · 2023-04-30T13:31:49Z

@rushter

What's the best way to handle this?

I don't understand why I have to remove the html node at all?
But never mind.

Can I just set myhtml_tree_t->node_html manually?

Yes, you can.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return an error when trying to decompose a node with `html` tag #87

Return an error when trying to decompose a node with `html` tag #87

HugoLaurencon commented Apr 28, 2023

rushter commented Apr 29, 2023 •

edited

Loading

HugoLaurencon commented Apr 29, 2023

rushter commented Apr 30, 2023

lexborisov commented Apr 30, 2023

Return an error when trying to decompose a node with html tag #87

Return an error when trying to decompose a node with html tag #87

Comments

HugoLaurencon commented Apr 28, 2023

rushter commented Apr 29, 2023 • edited Loading

HugoLaurencon commented Apr 29, 2023

rushter commented Apr 30, 2023

lexborisov commented Apr 30, 2023

Return an error when trying to decompose a node with `html` tag #87

Return an error when trying to decompose a node with `html` tag #87

rushter commented Apr 29, 2023 •

edited

Loading