Skip to content
This repository has been archived by the owner on Apr 22, 2022. It is now read-only.

Latest commit

 

History

History
477 lines (349 loc) · 13.3 KB

README.md

File metadata and controls

477 lines (349 loc) · 13.3 KB

Judgement Document Spider

Table of Contents

Installation

Git

Use git clone [email protected]:Thesharing/court-spider.git --recursive to clone this repo into local.

Python

For Linux or Windows:

Requires Python >= 3.5, and the newest version of Anaconda 3 is better.

conda install certifi chardet idna numpy requests six urllib3 werkzeug Flask lxml python-dateutil beautifulsoup4 lxml
pip install opencv-python PyExecJS pymongo redis

* If you are using pure python, install all the packages via pip rather than conda.

* You can use virtualenv or conda create command to create an isolated environment.

NOTE

For Windows:

Edit path of site-packages/execjs/_external_runtime.py and replace the function _exec_with_pipe like this:

Original _exec_with_pipe :

def _exec_with_pipe(self, source):
    cmd = self._runtime._binary()
    p = None
    try:
        p = Popen(cmd, stdin=PIPE, stdout=PIPE, stderr=PIPE, cwd=self._cwd, universal_newlines=True)
        input = self._compile(source)
        if six.PY2:
            input = input.encode(sys.getfilesystemencoding())
        stdoutdata, stderrdata = p.communicate(input=input)
        ret = p.wait()
    finally:
        del p

    self._fail_on_non_zero_status(ret, stdoutdata, stderrdata)
    return stdoutdata

Edited _exec_with_pipe :

def _exec_with_pipe(self, source):
    cmd = self._runtime._binary()

    p = None
    try:
        p = Popen(cmd, stdin=PIPE, stdout=PIPE, stderr=PIPE, cwd=self._cwd, universal_newlines=False)
        input = self._compile(source).encode(sys.getfilesystemencoding())
        stdoutdata, stderrdata = p.communicate(input=input)
        ret = p.wait()
    finally:
        del p

    self._fail_on_non_zero_status(ret, stdoutdata, stderrdata)
    return stdoutdata.decode(sys.getdefaultencoding())

And then the execjs library will work out fine.

* Or install PyExecJS with pip install https://github.com/Thesharing/PyExecJS.git instead of original PyExecJS.

Redis

For Windows:

Use Ubuntu on bash on Windows if you are using Windows 10.

Use VMWare or VirtualBox if you are using Windows 8 or 7. In VMWare you need to config port forwarding of NAT network in order to access Redis in Windows.

* You just need to run Redis in bash, other components can still run in Windows.

* Also you can install pre-compiled version for Windows referring to this page: Redis Install | Runoob

For Linux, or Bash on Ubuntu on Windows:

wget http://download.redis.io/releases/redis-4.0.11.tar.gz
tar xzf redis-4.0.11.tar.gz
rm redis-4.0.11.tar.gz
mv redis-4.0.11 redis
cd redis
make

Reference to install Redis

MongoDB

For Linux:

Refer to the official installation document.

For Windows:

Download the executable installer and run it, follow the installer guide. After finishing the installation MongoDB will start automatically.

Also refer to the official installation document.

* You can run the MongDB Compass to see if MongoDB is ready.

NodeJS

For Linux:

curl -sL https://deb.nodesource.com/setup_8.x | sudo -E bash -
sudo apt-get install -y nodejs
nodejs --version

For Windows:

Download lastest LST version executable installer to install.

Reference to install NodeJS

Create New Folders

For Linux or Windows:

cd spider
mkdir content data download log temp

Config

Refer to Thesharing/proxy_pool to modify the config of Redis in proxy.

At first run, copy config.example.json as config.json and modify the configs.

Customize the spider in config.json:

{
  "start": { 						// 设定Spider的起始日期、地区和法院
    "date": "2017-05-11",
    "district": "北京市",
    "court": null
  },
  "search": {						// 设定检索条件,这些会生成文字版的检索条件
    "keyword": "*", 				// 关键词
    "type": null, 					// 案件类型
    "reason": {						// 案由
      "value": "知识产权与竞争纠纷",	  // 案由名称
      "level": 2 // 案由等级,例如二级案由
    },
    "court": {						// 法院
      "value": null,				// 法院名称
      "level": 0,					// 法院层级,例如中级人民法院
      "indicator": false			// 是检索当前层级还是其下一层级的法院
    },
    "district": null 				// 地区
  },
  "condition": {
    "法院层级": null,  				 // 检索条件,这些会直接加到检索条件里
    "案件类型": "民事案件",
    "审判程序": null,
    "文书类型": null
  },
  "config": { 						// Spider程序设定
    "max_retry": 10, 				// 抓取失败时的最大重试次数
    "proxy": true, 					// 是否通过代理抓取
    "timeout": 60 					// 抓取的超时时间
  },
  "database": { 					// 数据库设定
    "redis": {						// Redis
      "host": "localhost", 			// 地址
      "port": 6379 					// 端口号
    },
    "mongodb": {					// MongoDB
      "host": "localhost", 			// 地址
      "port": 27017, 				// 端口号
      "database": "spider" 			// 集合名称
    }
  },
  "log": { 							// 日志设置
    "level": "INFO" 				// 日志等级
      								//有DEBUG, INFO, WARNING, ERROR, CRITICAL几个等级
  },
  "multiprocess": { 				// 多进程模式设置
    "total": 8, 					// 总进程数
    "spider": 0, 					// Spider数量
    "downloader": 8, 				// Downloader数量
    "notifier": 0 					// Notifier数量
  },
  "notifier": {						// Notifier设置
    "type": "wechat",				// 类型
    "period": 10,					// 监视周期
    "wechat": {						// 微信模式设置
      "receiver": "filehelper", 	// 接收人的微信名
      "cmd": false					// 在命令行中显示二维码还是保存成图片
    },
    "email": { 						// 邮件模式设置
      "sender": "[email protected]",		// 发送邮箱地址
      "password": "aaabbbccc",		// 发送邮箱的密码
      "server_addr": "smtp.bbb.ccc", // 发送邮箱的STMP地址
      "receiver": "[email protected]",	// 接收消息的邮箱地址
      "ssl": false					// 是否通过SSL加密链接
    }
  }
}

If you want to resume from a breakpoint, modify start part like this:

{
  "start": {
    "date": "2017-05-11",
    "district": "北京市",
    "court": null
  },
  "...": "..."
}

Execution

Instructions

First, run Redis:

*Use Ubuntu on bash on Windows if you are using Windows 10.*

(In the folder path of Redis)

src/redis-server

Second, run MongoDB:

For Windows:

If you install MongoDB as service, you can start it in services.msc. Usually it starts automatically.

For Linux:

sudo service mongod start

Third, run Proxy.

(In the folder path court-spider/proxy/Run)

python main.py

Access 127.0.0.1:5010/get_status to check the status of proxy pool.

raw_proxy means the count of unvalidated proxy, useful_proxy means the count of validated proxy.

You can change the website used to validate the proxies in Util/utilFunction.py.

Make sure the number of useful_proxy > 0, so that the spider can run with available proxies.

Finally, run Spider with the functions you need as indicated below.

(In folder path court-spider/spider)

Functions

Spider

To run spider, run the command below:

python main.py spider --date

Downloader

To run downloader, run the command below:

python main.py downloader --download

Before you run the downloader, you may need to extract all the DocID from local files stored in ./temp to Redis. If you want to download docs from DocID in ./data, you need to move it from ./data to ./temp manually, and move back to ./data after reading. Then run the command below:

python main.py downloader --read

Specify config file

If you are running two o more task at the same time, you can specify different configs like this:

python main.py --config config.json

Multi-Process Mode

To enable multi-process mode, specify the multi-process config in config.json like this:

{
    "...": "...",
  "multiprocess": {
    "total": 8,
    "spider": 0,
    "downloader": 8,
    "notifier": 0
  },
  "...": "..."
}

total is the total number of processes to run spiders and downloaders, which should be equal to the number of logic cores of the CPU on your device (you can see the number in Task Manager in Windows, or run grep -c ^processor /proc/cpuinfo in Linux).

spider , downloader and notifier is the number of spiders and downloaders (at present multi-process spider is not supported).

Then run without none of spiser , downloader and notifier like this:

python main.py

Argument Lists

Optional arguments:

Argument name Short name Explanation
spider --district s --district Start a spider to crawl data by district
downloader --clean d --clean Delete all the data in Redis
downloader --read d --read Read content from doc-id files
downloader --download d --download Start a downloader
notifier n Start a notifier
--config [config.json] -c [config.json] Specify the filename of config
--help -h Show the help message

spider , downloaderand notifier should not be used at the same time.

Enable multi-process mode if none of spider, downloader and notifier is not used.

Update

Run the commands below in the root folder of this project to fetch the newest updates.

git pull
git submodule update --recursive --remote

To sync commits with upstream repos, refer to this doc: Syncing a fork | Github.

Commit

If you want to commit changes to this repo, please first push the changes in the submodule, then go back to the root folder of this project and run the commands below.

git submodule update --remote
git add . 
git commit -m '<Commit Message>'
git push

TODO List

  • Downloader

  • Multi-process support for Document Downloader

  • Notifier

  • Extractor [ONGOING]

    • 争议焦点划分

    • 审判程序之间的关联

    • 诉讼参与人信息拆分

    • 整段文段的划分

  • Crawl by date

  • Multi-process support for Task Distributor and Content List Downloader

Reference

sixs/wenshu_spider

jhao104/proxy_pool

doloopwhile/PyExecJS


_________                   _____     ________       ______________            
__  ____/_________  __________  /_    __  ___/__________(_)_____  /____________
_  /    _  __ \  / / /_  ___/  __/    _____ \___  __ \_  /_  __  /_  _ \_  ___/
/ /___  / /_/ / /_/ /_  /   / /_      ____/ /__  /_/ /  / / /_/ / /  __/  /    
\____/  \____/\__,_/ /_/    \__/      /____/ _  .___//_/  \__,_/  \___//_/     
                                             /_/