-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Selectors return corrupted, recursive DOM on some sites #184
Comments
Transferred the issue here since it doesn't seem to be a problem with Scrapy specifically but rather with Parsel, the underlying selector library: In [1]: from parsel import Selector, __version__
In [2]: __version__
Out[2]: '1.5.2'
In [3]: import requests
In [4]: sel = Selector(text=requests.get("https://www.base-search.net/Search/Results?lookfor=graph+visualisation").text)
In [5]: len(sel.css(".record-panel"))
Out[5]: 10
In [6]: len(sel.css(".link-gruen"))
Out[6]: 10
In [7]: len(sel.css(".record-panel").css(".link-gruen"))
Out[7]: 55 |
👀 >>> for panel in sel.css('.record-panel'):
... print(len(panel.css('.link-gruen')))
...
10
9
8
7
6
5
4
3
2
1 |
There is a bug in the source HTML which browsers manage to fix but Workaround: >>> text = requests.get("https://www.base-search.net/Search/Results?lookfor=graph+visualisation").text
>>> text = text.replace('--!>', '-->')
>>> sel = Selector(text=text)
>>> len(sel.css(".record-panel").css(".link-gruen"))
10
>>> for panel in sel.css('.record-panel'):
... print(len(panel.css('.link-gruen')))
...
1
1
1
1
1
1
1
1
1
1 I suggest we leave this open as a feature request. Hopefully #83 will allow fixing this, but this issue should remain open: if a new parser introduced as part of #83 does not fix this issue, we should look for alternative parsers that do support this issue, or get support for this upstream on one of the supported parsers. |
Description
On a specific site, scrapy selectors (css and xpath) corrupt the DOM recursively and return an incorrect amount of items as a result. I've encountered this issue while parsing base-search.net search results but this bug might occur on other sites as well.
Steps to Reproduce
Example for base-search.net
scrapy shell "https://www.base-search.net/Search/Results?lookfor=graph+visualisation"
response.css(".record-panel")
, output should be 10 itemsresponse.css(".link-gruen")
, output should also be only 10 itemsresponse.css(".record-panel").css(".link-gruen")
, output now returns 55(!) items, when it has been determined there are only 10 .link-gruen items in the DOMresponse.css(".record-panel .record-panel")
returns a non-zero amount of items, however on the original DOM no item with such a class existsresponse.css(".record-panel").css(".record-panel").css(".link-gruen")
returns 220 items,response.css(".record-panel").css(".record-panel").css(".record-panel").css(".link-gruen")
returns 715 itemsExpected behavior:
Only ten items should be returned in this example.
Actual behavior:
Each selector has a DOM that contains their own .record-panel, but also all following .record-panel divs, nested recursively. Chaining selectors on this corrupted DOM corrupts it even further, increasing the amount of items returned infinitely.
Reproduces how often: Always
Versions
Scrapy : 1.8.0
lxml : 4.5.0.0
libxml2 : 2.9.10
cssselect : 1.1.0
parsel : 1.5.2
w3lib : 1.21.0
Twisted : 19.10.0
Python : 3.7.5 (default, Nov 7 2019, 10:50:52) - [GCC 8.3.0]
pyOpenSSL : 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019)
cryptography : 2.8
Platform : Linux-4.15.0-76-generic-x86_64-with-Ubuntu-18.04-bionic
Additional context
Issue happens on both css and xpath selectors. Using equivalent xpath selectors lead to the same result.
Notice by opening
view(response)
that the DOM scrapy receives for parsing does not contain any recursive items, for example selecting.record-panel .record-panel
yields no results on the browser selector (local file, not the internet result). However, on scrapy selectingresponse.css(".record-panel .record-panel")
returns 9 items,response.css(".record-panel .record-panel .record-panel")
returns 8 items, and so on.The text was updated successfully, but these errors were encountered: