Skip to content
This repository has been archived by the owner on Sep 28, 2022. It is now read-only.

Spider closes on exception #5

Open
samos123 opened this issue May 18, 2013 · 4 comments
Open

Spider closes on exception #5

samos123 opened this issue May 18, 2013 · 4 comments

Comments

@samos123
Copy link
Contributor

If exception is raised in parse method of a WebdriverResponse/WebdriverRequest whole spider closes/exits and doesnt continue

Steps to reproduce:
In any of your parse methods which parse WebDriverResponses raise an exception

Current result:
Scrapy stops crawling

Expected result:
Scrapy continues crawling next requests / urls

When parsing a normal scrapy Request / Response and you raise an error it seems to just continue. I did some quick testing on this, so I may be wrong though. This is a related error log:

2013-05-18 00:10:43+0800 [xxxxxx] ERROR: Spider error processing <GET http://item.xxxxx.com/>
        Traceback (most recent call last):
          File "/usr/lib/python2.7/site-packages/twisted/internet/base.py", line 824, in runUntilCurrent
            call.func(*call.args, **call.kw)
          File "/usr/lib/python2.7/site-packages/twisted/internet/task.py", line 607, in _tick
            taskObj._oneWorkUnit()
          File "/usr/lib/python2.7/site-packages/twisted/internet/task.py", line 484, in _oneWorkUnit
            result = next(self._iterator)
          File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr>
            work = (callable(elem, *args, **named) for elem in iterable)
        --- <exception caught here> ---
          File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback
            yield it.next()
          File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line $
8, in process_spider_output
            for x in result:
          File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy_webdriver/middlewares.py", line 36, in proc$
ss_spider_output
            for item_or_request in self._process_requests(result):
          File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy_webdriver/middlewares.py", line 51, in _pro$
ess_requests
            for request in iter(items_or_requests):
          File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line $
2, in <genexpr>
            return (_set_referer(r) for r in result or ())
          File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", lin$
 33, in <genexpr>
            return (r for r in result or () if _filter(r))
          File "/home/samos/.virtualenvs/scrapy/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50$
 in <genexpr>
            return (r for r in result or () if _filter(r))
          File "/home/samos/workspace/alex-scrapy/crawler/spiders/xxx_spider.py", line 50, in parse_item
            raise Exception("test")
        exceptions.Exception: test
@stringertheory
Copy link

I've been trying to figure this out, and thought that the issue might be that the lock on the webdriver instance was not getting released when there is an exception in the parse method (It is released when the parse method is successful in the process_spider_output method). However, I tried adding in a process_spider_exception method:

def process_spider_exception(self, response, exception, spider):
    if isinstance(response.request, WebdriverRequest):
        self.manager.release(response.request.url)
        return None

with no luck. The first exception is clearly getting logged by the handle_spider_error method in https://github.com/scrapy/scrapy/blob/master/scrapy/core/scraper.py, but I can't follow the scrapy source code through all of the callbacks/errbacks well enough to understand.

@ncadou
Copy link
Collaborator

ncadou commented Jul 11, 2013

If you could submit a pull request with a failing test case, that'd be awesome.

@stringertheory
Copy link

I'll add a test case as soon as I can figure out how to do it. My attempt to add one keeps failing miserably with ReactorNotRestartable errors. Any suggestions for good resources for understanding twisted?

@ncadou
Copy link
Collaborator

ncadou commented Jul 13, 2013

Not that I know of. To make some sense of twisted, I looked at its documentation and at scrapy source code, and googled for specific problems I encountered. This blog post was also useful: http://jessenoller.com/blog/2009/02/11/twisted-hello-asynchronous-programming

tonal pushed a commit to tonal/scrapy-webdriver that referenced this issue Apr 14, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants