-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support AnyResponse #28
Conversation
a4c67a6
to
29eb601
Compare
1279996
to
300869d
Compare
"Whether to perform extraction using a browser request " | ||
"(browserHtml) or an HTTP request (httpResponseBody)." | ||
), | ||
default=ExtractFrom.browserHtml, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate, why is this change needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Zyte API's Auto-Configurator isn't released yet, we can't retrieve the extraction source that was used on a given page. This forces the a default page type to be included in the Zyte API request.
We can revert back to None
when the Auto-Configurator is released soon as we'd have a way to get either browserHtml
or httpResponseBody
that was used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change fixing something, or is it done just in case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's done so that ZyteApiProvider
can properly request the proper extraction source when building the Zyte API Request: https://github.com/scrapy-plugins/scrapy-zyte-api/pull/161/files#diff-67cea92ffa3989fe30ffd7f4bbe868f953810e0f4fdf2ddfe7f18bc54fba505eR125-R135
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would it work if a user changes it back to None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, why would there be a regression, from scrapy-zyte-api point of view?
Sorry, I was referring to zyte-spider-templates.
I assume that the code which uses AnyResponse should work fine both with HttpResponse and BrowserResponse, that's why it's declaring AnyResponse dependency. So, when some code declares that it needs AnyResponse, and this code gets HttpResponse, we don't consider it as a regression - it's the expected behavior.
This is currently the case with scrapy-zyte-api and follows the expected behavior. 👍
If there are no other parameters, or no other parameters which require rendering, we can default to HttpResponse - it shouldn't affect anything else, just make AnyResponse use a cheaper method.
I guess my confusion comes from the fact that the ecommerce spider, uses HeuristicsProductNavigationPage which have the following dependencies:
ProductNavigation
AnyResponse
PageParams
In this case, there are no other parameters which require explicit rendering and so scrapy-zyte-api requests for HttpResponse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, there are no other parameters which require explicit rendering and so scrapy-zyte-api requests for HttpResponse.
But currently ProductNavigation does require rendering, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, if ProductNavigation
is requested alongside AnyResponse
, then it would request the following in Zyte API:
{
"url": url,
"productNavigation": True,
"productNavigationOptions": {"extractFrom": "httpResponseBody"},
"httpResponseBody": True,
"httpResponseHeaders": True,
}
Okay hmmm, I guess I already understand what you mean from the previous comment: "1. For now, we use extractFrom=browserHtml when user doesn't state explicitly the extraction method, and AnyResponse is needed by a strategy."
So in this case, having the ProductNavigation
and AnyResponse
parameters should have:
{
"url": url,
"productNavigation": True,
"productNavigationOptions": {"extractFrom": "browserHtml"},
"browserHtml": True,
}
But if it's only AnyResponse
, then it'd be httpResponseBody
and httpResponseHeaders
that are requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe even
{
"url": url,
"producNavigation": True,
"browserHtml": True,
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the clarifications @kmike !
Updated scrapy-zyte-api scrapy-plugins/scrapy-zyte-api@a51f961 and zyte-spider-templates 53b9dfd according to this.
Codecov Report
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #28 +/- ##
=======================================
Coverage 98.61% 98.61%
=======================================
Files 12 12
Lines 504 506 +2
=======================================
+ Hits 497 499 +2
Misses 7 7
|
…s into fix-dupe-requests
e3548ec
to
e066993
Compare
Failing |
I cannot reproduce it either, but there should be no caching in CI, so I am quite puzzled here. |
One of the great mysteries of GH 😄 I presume it would resolve itself on master. |
Solved the mystery! Merge the main branch. (CI does not actually run in this branch, but in a virtual merge with the main one) |
…s into fix-dupe-requests
No luck there. I'm tempted to just create another PR once new versions of scrapy-poet and scrapy-zyte-api have been released. It'd be cleaner all together. |
I can actually reproduce the issue now with 2a4af89. |
Aha, I see the issue now. Thanks for the help @Gallaecio ! 🙌 |
Fixes #25.
Related PRs:
TODO:
scrapy-zyte-api
scrapy-poet