Skip to content

scrapyd - http crawler (import scrapy project from git/disk and expose produced files via http)

License

Notifications You must be signed in to change notification settings

gp3t1/scrapyd-docker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scrapyd-docker

0.1.0 Repository issues

Scrapyd docker image

  • loads spiders(http crawlers) from local/git scrapy projects
  • customize spiders scheduling with cron syntax
  • files produced by exporters are exposed through http

Installation

  • Install docker

Usage

Here's a short explanation how to use scrapyd-docker:

  • in your scrapy project settings.py,

    • define EXPORT_PATH (e.g. EXPORT_PATH = "/tmp/export")
    • configure your exporters to write files in EXPORT_PATH folder (settings.get('EXPORT_PATH'))
    • when deployed to scrapyd, the scheduler will override this setting to make sure all exported files are stored in the correct http folder
  • Print the help :

      docker run --rm <gp3t1/scrapyd[:tag]> help
    
  • Run or create the container with the following command:

      docker <create|run [--rm] [-d] ...>
          -v </host/path/to/scrapyd/config>:/etc/scrapyd/
          -v </host/path/to/scrapyd/data>:/var/lib/scrapyd/
          -v </host/path/to/logs>:var/log/supervisor/
          -e SCRAPYD_USER=<scrapyd_user>
          -p <http_port>:8000
         [-p <scrapyd_api_port>:6800]
          --name <my-scrapyd>
          <gp3t1/scrapyd[:tag]>
          <scrapyd|help>
    
    • <create|run [--rm] [-d] ...>: see docker run reference
    • </host/path/to/scrapyd/config>: your host path mapped to scrapyd config volume
    • </host/path/to/scrapyd/data>: your host path mapped to scrapyd data volume(populated with scrapyd logs, scrapy projects, exported files)
    • </host/path/to/logs>: your host path mapped to supervisor logs volume
    • <scrapyd_user>: start scrapyd as scrapyd_user (default='scrapyd')
    • <http_port>: available TCP port on your host mapped to scrapyd exported files
    • <scrapyd_api_port>: available TCP port on your host mapped to scrapyd api (not recommended)
    • --name <my-scrapyd>: name of the container to build
    • <gp3t1/scrapyd[:tag]>: image name (and tag) to build from
    • <scrapyd|help>: scrapyd container command. scrapyd launch scrapyd, the scheduler and the http server. help print the help
  • Configure your spiders

    • Spiders configuration file is found in your scrapyd data volume/spiders/spiders.json (create it before running the container or modify it after) format of the file:

        {
        	"spiders": {
        		"<your-spider-id/name-from-git>":{
        			"giturl":"https://path/to/your/scrapy-repo.git",
        			"cron":"0 * * * *",
        			"custom_settings":{},
        			"custom_args":{}
        		},
        		"<another-spider-id/name-from-local>":{
        			"cron":"0 8 * * 0,6",
        			"custom_settings":{},
        			"custom_args":{}
        		}
        	}
        }
      
      • <your-spider-id/name>: must be a valid linux folder name. It will be used to store/read your spidere in scrapyd data volume/spiders/<your-spider-id/name>
      • giturl: don't use it if you want scrapyd to laod your spider from scrapyd data volume/spiders/<your-spider-id/name>
      • cron: standard cron syntax used for your spider scheduling see crontab.guru for help
      • custom_settings: will be passed to scrapyd to override project settings during scheduling
      • custom_args: will be passed to scrapyd as arguments during scheduling
  • Administration

    • use docker <start|stop|restart|kill|logs> <my-scrapyd>. see docker command line
    • scrapyd configuration can be found in scrapyd config volume (for stability, some settings, like urls and paths, are forced during startup)
    • logs can be found in scrapyd data volume/logs and supervisor logs volume
    • the http folder containing exported files can be found in scrapyd data volume/spiders
    • use docker exec -ti <my-scrapyd> /bin/sh to get a terminal from your running container
    • In order to reload spiders configuration, you can
      • restart the container
      • restart the applications of the container with docker kill -s SIGHUP <my-scrapyd>
      • inside the container, execute supervisorctl start initSpiders
    • if you need to schedule a job manually, run su - <scrapyd_user> -c "runp /usr/local/bin/spiders.py crawl:<spider-id/name>" from inside the container

Contributing

  1. Fork it
  2. Create your feature branch: git checkout -b feature/my-new-feature
  3. Commit your changes: git commit -am 'Add some feature'
  4. Push to the branch: git push origin feature/my-new-feature
  5. Submit a pull request

Requirements / Dependencies

Version

0.1.0

License

Copyright (C)

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

About

scrapyd - http crawler (import scrapy project from git/disk and expose produced files via http)

Resources

License

Stars

Watchers

Forks

Packages

No packages published