-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow spider attr #15
base: master
Are you sure you want to change the base?
Changes from 5 commits
39740cb
9efd145
652fd6e
2e7407d
bed8998
a9b5323
586ab58
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,6 +6,7 @@ | |
from scrapy.exceptions import NotConfigured | ||
|
||
from scrapy import log | ||
from scrapy.http.response.html import HtmlResponse | ||
from scrapy.http.headers import Headers | ||
|
||
|
||
|
@@ -32,6 +33,14 @@ def __init__(self, crawler, splash_base_url, slot_policy): | |
self.splash_base_url = splash_base_url | ||
self.slot_policy = slot_policy | ||
|
||
def get_splash_options(self, request, spider): | ||
if request.meta.get("dont_proxy"): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. after sleeping on it I think it's bad idea to reuse 'dont_proxy' key here, because it's unexpected for developers who are using splash and crawlera middleware together in the project - they would expect it to work only with crawlera mw, so they can enable crawlera for spider (for all requests) and for some specific requests they might want to disable crawlera using 'dont_proxy' and add 'splash' arguments. I think explicit 'dont_splash' would be better here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'm not sure. But it seems like this is just a question of terminology, "dont_proxy" to my ears sounds like general signal - 'dont use any kind of proxy mechanism for this request' so I thought we could reuse it here. On the other hand "splash" here is not used as proxy so it may be confusing, also |
||
return | ||
|
||
spider_options = getattr(spider, "splash", {}) | ||
request_options = request.meta.get("splash") | ||
return request_options or spider_options | ||
|
||
@classmethod | ||
def from_crawler(cls, crawler): | ||
splash_base_url = crawler.settings.get('SPLASH_URL', cls.default_splash_url) | ||
|
@@ -43,24 +52,26 @@ def from_crawler(cls, crawler): | |
return cls(crawler, splash_base_url, slot_policy) | ||
|
||
def process_request(self, request, spider): | ||
splash_options = request.meta.get('splash') | ||
splash_options = self.get_splash_options(request, spider) | ||
if not splash_options: | ||
return | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @pawelmhm why you changed this? this key replacement looked good enough There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is related to https://github.com/scrapinghub/scrapyjs/pull/15/files#r28138976 (avoiding generating multiple splash requests). If splash options are in request meta as content of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you added check
but after your explanation I think the way it's implemented in PR is better, because someone might expect to have 'splash' key in response.meta, and with current implementation it's missed. @kmike are you okay with that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It turns out I applied similar changes here: 95f0079. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But maybe your approach is better |
||
|
||
elif request.meta.get("_splash_processed"): | ||
return | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this for retries? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is because if you return new request from middleware it will go through all middleware chain again, so we need to disable infinite loop from occurring There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. got it. note - I think it would be a bit safer to use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sounds good |
||
|
||
if request.method != 'GET': | ||
log.msg("Currently only GET requests are supported by SplashMiddleware; %s " | ||
"will be handled without Splash" % request, logging.WARNING) | ||
return request | ||
|
||
meta = request.meta | ||
del meta['splash'] | ||
meta['_splash_processed'] = splash_options | ||
|
||
slot_policy = splash_options.get('slot_policy', self.slot_policy) | ||
self._set_download_slot(request, meta, slot_policy) | ||
|
||
args = splash_options.setdefault('args', {}) | ||
args.setdefault('url', request.url) | ||
args['url'] = request.url | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kmike are you okay with this change. What was usecase of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe I was thinking of allowing user not to use 'url' argument at all. This argument is not required for Splash scripts - e.g. you can render a chunk of HTML using splash:set_content. But the meta syntax doesn't make much sense in this case; something in lines of #12 could be a better fit. |
||
|
||
body = json.dumps(args, ensure_ascii=False) | ||
|
||
if 'timeout' in args: | ||
|
@@ -86,6 +97,7 @@ def process_request(self, request, spider): | |
endpoint = splash_options.setdefault('endpoint', self.default_endpoint) | ||
splash_base_url = splash_options.get('splash_url', self.splash_base_url) | ||
splash_url = urljoin(splash_base_url, endpoint) | ||
meta['_splash_processed'] = True | ||
|
||
req_rep = request.replace( | ||
url=splash_url, | ||
|
@@ -96,18 +108,16 @@ def process_request(self, request, spider): | |
# are not respected. | ||
headers=Headers({'Content-Type': 'application/json'}), | ||
) | ||
|
||
self.crawler.stats.inc_value('splash/%s/request_count' % endpoint) | ||
return req_rep | ||
|
||
def process_response(self, request, response, spider): | ||
splash_options = request.meta.get("_splash_processed") | ||
splash_options = self.get_splash_options(request, spider) | ||
if splash_options: | ||
endpoint = splash_options['endpoint'] | ||
self.crawler.stats.inc_value( | ||
'splash/%s/response_count/%s' % (endpoint, response.status) | ||
) | ||
|
||
return response | ||
|
||
def _set_download_slot(self, request, meta, slot_policy): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like unused import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, this is fixed now