Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modified _st function to detect doc type #234

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

shivamshan
Copy link

This PR modifies _st function to now automatically detect document type when None is passed. Thus, it closes #194 .


Code Proof

For HTML

>>> from parsel import Selector
>>> selector = Selector(text="""<html>
...     <body>
...         <h1>Hello, Parsel!</h1>
...         <ul>
...             <li><a href="http://example.com">Link 1</a></li>
...             <li><a href="http://scrapy.org">Link 2</a></li>
...         </ul>
...     </body>
...     </html>""")
>>> selector.type
'html'

For XML

>>> from parsel import Selector
>>> selector = Selector(text="""<?xml version = "1.0"?>
... <contactinfo>
...     <address category = "college">
...         <name>G4G</name>
...         <College>Geeksforgeeks</College>
...         <mobile>2345456767</mobile>
...     </address>
... </contactinfo>""")
>>> selector.type
'xml'

Signed-off-by: Shivam Shandilya [email protected]

@Gallaecio
Copy link
Member

This implementation does not seem very reliable.

For example, this would be interpreted as XML:

<!DOCTYPE html>
<html lang="en">
  <head>...</head>
  <body>...</body>
</html>

While this would be interpreted as HTML:

<?xml version="1.0" encoding="UTF-8"?>
<description>
    <plain>foo</plain>
    <html><![CDATA[<b>f</b>oo]]></html>
</description>
{
    "plain": "foo",
    "html": "<html></html>"
}

Moreover, this breaks existing tests and does not provide new tests to cover the new behavior.

@shivamshan
Copy link
Author

Okay, will try to come up something more robust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Detect automatically what is the type of document that will be parsed.
2 participants