support AnyResponse #28

BurnzZ · 2024-01-16T10:40:19Z

Fixes #25.

Related PRs:

TODO:

Use new release of scrapy-zyte-api
Use new release of scrapy-poet

zyte_spider_templates/spiders/ecommerce.py

kmike · 2024-01-17T09:28:03Z

zyte_spider_templates/spiders/base.py

+            "Whether to perform extraction using a browser request "
+            "(browserHtml) or an HTTP request (httpResponseBody)."
+        ),
+        default=ExtractFrom.browserHtml,


Could you please elaborate, why is this change needed?

Since Zyte API's Auto-Configurator isn't released yet, we can't retrieve the extraction source that was used on a given page. This forces the a default page type to be included in the Zyte API request.

We can revert back to None when the Auto-Configurator is released soon as we'd have a way to get either browserHtml or httpResponseBody that was used.

Is this change fixing something, or is it done just in case?

It's done so that ZyteApiProvider can properly request the proper extraction source when building the Zyte API Request: https://github.com/scrapy-plugins/scrapy-zyte-api/pull/161/files#diff-67cea92ffa3989fe30ffd7f4bbe868f953810e0f4fdf2ddfe7f18bc54fba505eR125-R135

How would it work if a user changes it back to None?

Hm, why would there be a regression, from scrapy-zyte-api point of view?

Sorry, I was referring to zyte-spider-templates.

I assume that the code which uses AnyResponse should work fine both with HttpResponse and BrowserResponse, that's why it's declaring AnyResponse dependency. So, when some code declares that it needs AnyResponse, and this code gets HttpResponse, we don't consider it as a regression - it's the expected behavior.

This is currently the case with scrapy-zyte-api and follows the expected behavior. 👍

If there are no other parameters, or no other parameters which require rendering, we can default to HttpResponse - it shouldn't affect anything else, just make AnyResponse use a cheaper method.

I guess my confusion comes from the fact that the ecommerce spider, uses HeuristicsProductNavigationPage which have the following dependencies:

ProductNavigation

AnyResponse

PageParams

In this case, there are no other parameters which require explicit rendering and so scrapy-zyte-api requests for HttpResponse.

In this case, there are no other parameters which require explicit rendering and so scrapy-zyte-api requests for HttpResponse.

But currently ProductNavigation does require rendering, right?

Currently, if ProductNavigation is requested alongside AnyResponse, then it would request the following in Zyte API:

{ "url": url, "productNavigation": True, "productNavigationOptions": {"extractFrom": "httpResponseBody"}, "httpResponseBody": True, "httpResponseHeaders": True, }

Okay hmmm, I guess I already understand what you mean from the previous comment: "1. For now, we use extractFrom=browserHtml when user doesn't state explicitly the extraction method, and AnyResponse is needed by a strategy."

So in this case, having the ProductNavigation and AnyResponse parameters should have:

{ "url": url, "productNavigation": True, "productNavigationOptions": {"extractFrom": "browserHtml"}, "browserHtml": True, }

But if it's only AnyResponse, then it'd be httpResponseBody and httpResponseHeaders that are requested.

Or maybe even

{ "url": url, "producNavigation": True, "browserHtml": True, }

Thanks for all the clarifications @kmike !

Updated scrapy-zyte-api scrapy-plugins/scrapy-zyte-api@a51f961 and zyte-spider-templates 53b9dfd according to this.

codecov-commenter · 2024-01-31T11:46:46Z

Codecov Report

Merging #28 (ed384b4) into main (7df8e6c) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head ed384b4 differs from pull request most recent head 3ba3a98. Consider uploading reports for the commit 3ba3a98 to get more accurate results

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #28   +/-   ##
=======================================
  Coverage   98.61%   98.61%           
=======================================
  Files          12       12           
  Lines         504      506    +2     
=======================================
+ Hits          497      499    +2     
  Misses          7        7

Files	Coverage Δ
...r_templates/pages/product_navigation_heuristics.py	`100.00% <100.00%> (ø)`
zyte_spider_templates/spiders/base.py	`100.00% <100.00%> (ø)`
zyte_spider_templates/spiders/ecommerce.py	`98.79% <ø> (-0.10%)`	⬇️

…s into fix-dupe-requests

BurnzZ · 2024-02-07T06:05:23Z

Failing docs CI seems to be due to a cache issue in the rst files. I cannot reproduce it locally.

This reverts commit 0e57930.

Gallaecio · 2024-02-07T08:54:54Z

I cannot reproduce it either, but there should be no caching in CI, so I am quite puzzled here.

BurnzZ · 2024-02-07T08:56:15Z

One of the great mysteries of GH 😄 I presume it would resolve itself on master.

Gallaecio · 2024-02-07T08:57:12Z

Solved the mystery! Merge the main branch. (CI does not actually run in this branch, but in a virtual merge with the main one)

…s into fix-dupe-requests

BurnzZ · 2024-02-07T09:03:17Z

No luck there. I'm tempted to just create another PR once new versions of scrapy-poet and scrapy-zyte-api have been released. It'd be cleaner all together.

Gallaecio · 2024-02-07T09:22:17Z

I can actually reproduce the issue now with 2a4af89.

BurnzZ · 2024-02-07T10:08:12Z

Aha, I see the issue now. Thanks for the help @Gallaecio ! 🙌

This was referenced Jan 16, 2024

add new AnyResponse scrapinghub/web-poet#195

Merged

support AnyResponse scrapy-plugins/scrapy-zyte-api#161

Merged

BurnzZ changed the title ~~use HttpOrBrowserRespose~~ POC: use HttpOrBrowserRespose Jan 16, 2024

BurnzZ force-pushed the fix-dupe-requests branch from a4c67a6 to 29eb601 Compare January 16, 2024 10:42

BurnzZ commented Jan 16, 2024

View reviewed changes

zyte_spider_templates/spiders/ecommerce.py Outdated Show resolved Hide resolved

BurnzZ force-pushed the fix-dupe-requests branch 2 times, most recently from 1279996 to 300869d Compare January 16, 2024 10:47

kmike reviewed Jan 17, 2024

View reviewed changes

BurnzZ mentioned this pull request Jan 18, 2024

create a weak_cache in Injector scrapinghub/scrapy-poet#184

Merged

wRAR changed the title ~~POC: use HttpOrBrowserRespose~~ POC: use HttpOrBrowserResponse Jan 23, 2024

BurnzZ changed the title ~~POC: use HttpOrBrowserResponse~~ POC: use AnyResponse Jan 31, 2024

BurnzZ added 7 commits January 31, 2024 20:41

use HttpOrBrowserResponse

a0e9991

use new AnyResponse instead of HttpOrBrowserResponse

802e4f4

move extractFrom from EcommerceSpider to BaseSpider

6203050

fix imports and casting

976340e

use new scrapy-poet with weak_cache

ab5aebd

fix tests and linters

63f408f

Merge branch 'main' of ssh://github.com/zytedata/zyte-spider-template…

e066993

…s into fix-dupe-requests

BurnzZ force-pushed the fix-dupe-requests branch from e3548ec to e066993 Compare February 2, 2024 10:58

BurnzZ changed the title ~~POC: use AnyResponse~~ support AnyResponse Feb 2, 2024

BurnzZ marked this pull request as ready for review February 2, 2024 11:00

return back extract_from=None by default

53b9dfd

Gallaecio added 2 commits February 7, 2024 09:52

Try running the docs with Python 3.11

0e57930

Revert "Try running the docs with Python 3.11"

0f4aeb6

This reverts commit 0e57930.

Merge branch 'main' of ssh://github.com/zytedata/zyte-spider-template…

2a4af89

…s into fix-dupe-requests

Fix the docs

3ba3a98

update deps: scrapy-zyte-api>=0.16.0 and scrapy-poet>=0.21.0

6a6f09a

BurnzZ requested review from wRAR, Gallaecio, kmike and PyExplorer February 8, 2024 12:13

Gallaecio approved these changes Feb 8, 2024

View reviewed changes

wRAR approved these changes Feb 8, 2024

View reviewed changes

PyExplorer approved these changes Feb 8, 2024

View reviewed changes

kmike merged commit 8718a5f into main Feb 9, 2024
10 checks passed

wRAR deleted the fix-dupe-requests branch February 9, 2024 09:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support AnyResponse #28

support AnyResponse #28

BurnzZ commented Jan 16, 2024 •

edited by Gallaecio

Loading

kmike Jan 17, 2024

BurnzZ Jan 17, 2024

kmike Jan 17, 2024

BurnzZ Jan 17, 2024

kmike Jan 30, 2024

BurnzZ Feb 5, 2024

kmike Feb 5, 2024

BurnzZ Feb 5, 2024 •

edited

Loading

kmike Feb 5, 2024

BurnzZ Feb 5, 2024

codecov-commenter commented Jan 31, 2024 •

edited

Loading

BurnzZ commented Feb 7, 2024

Gallaecio commented Feb 7, 2024

BurnzZ commented Feb 7, 2024

Gallaecio commented Feb 7, 2024

BurnzZ commented Feb 7, 2024

Gallaecio commented Feb 7, 2024 •

edited

Loading

BurnzZ commented Feb 7, 2024

support AnyResponse #28

support AnyResponse #28

Conversation

BurnzZ commented Jan 16, 2024 • edited by Gallaecio Loading

TODO:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BurnzZ Feb 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jan 31, 2024 • edited Loading

Codecov Report

BurnzZ commented Feb 7, 2024

Gallaecio commented Feb 7, 2024

BurnzZ commented Feb 7, 2024

Gallaecio commented Feb 7, 2024

BurnzZ commented Feb 7, 2024

Gallaecio commented Feb 7, 2024 • edited Loading

BurnzZ commented Feb 7, 2024

BurnzZ commented Jan 16, 2024 •

edited by Gallaecio

Loading

BurnzZ Feb 5, 2024 •

edited

Loading

codecov-commenter commented Jan 31, 2024 •

edited

Loading

Gallaecio commented Feb 7, 2024 •

edited

Loading