- One may get failure when crawling, when start again, how to prevent crawled url being crawled again?
- may be done with database -> store, compare, add.
- or use Job Directory provided by scrapy.
- need some time to shut down gracefully.
- Note: current structure of traversing through pages may have some inconvinences! How about revision?
- Get blocked if crawling too fast. But not the kind of 404/403... Not sure whether blocked by ip or cookie or something else. How to deal with that?