Skip to content
This repository has been archived by the owner on Jun 10, 2024. It is now read-only.

Fix typos #977

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/Deployment-demo.pyspider.org.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,15 +112,15 @@ With the config, you can change the scale by `docker-compose scale phantomjs=2 p

#### load balance

phantomjs-lb, fetcher-lb, webui-lb are automaticlly configed haproxy, allow any number of upstreams.
phantomjs-lb, fetcher-lb, webui-lb are automatically configured haproxy, allow any number of upstreams.

#### phantomjs

phantomjs have memory leak issue, memory limit applied, and it's recommended to restart it every hour.

#### fetcher

fetcher is implemented with aync IO, it supportes 100 concurrent connections. If the upstream queue are not choked, one fetcher should be enough.
fetcher is implemented with aync IO, it supports 100 concurrent connections. If the upstream queue are not choked, one fetcher should be enough.

#### processor

Expand Down
2 changes: 1 addition & 1 deletion docs/Frequently-Asked-Questions.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,4 +56,4 @@ You can have only have one scheduler, and multiple fetcher/processor/result_work

For example, the number between scheduler and fetcher indicate the queue size of scheduler to fetchers, when it's hitting 100 (default maximum queue size), fetcher might crashed, or you should considered adding more fetchers.

The number `0+0` below fetcher indicate the queue size of new tasks and status packs between processors and schduler. You can put your mouse over the numbers to see the tips.
The number `0+0` below fetcher indicate the queue size of new tasks and status packs between processors and scheduler. You can put your mouse over the numbers to see the tips.
2 changes: 1 addition & 1 deletion docs/Quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ to install binary packages first.

please install PhantomJS if needed: http://phantomjs.org/build.html

note that PhantomJS will be enabled only if it is excutable in the `PATH` or in the System Environment
note that PhantomJS will be enabled only if it is executable in the `PATH` or in the System Environment

**Note:** `pyspider` command is running pyspider in `all` mode, which running components in threads or subprocesses. For production environment, please refer to [Deployment](Deployment).

Expand Down
4 changes: 2 additions & 2 deletions docs/tutorial/Render-with-PhantomJS.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Level 3: Render with PhantomJS

Sometimes web page is too complex to find out the API request. It's time to meet the power of [PhantomJS].

To use PhantomJS, you should have PhantomJS [installed](http://phantomjs.org/download.html). If you are running pyspider with `all` mode, PhantomJS is enabled if excutable in the `PATH`.
To use PhantomJS, you should have PhantomJS [installed](http://phantomjs.org/download.html). If you are running pyspider with `all` mode, PhantomJS is enabled if executable in the `PATH`.

Make sure phantomjs is working by running
```
Expand Down Expand Up @@ -43,7 +43,7 @@ Running JavaScript on Page

We will try to scrape images from [http://www.pinterest.com/categories/popular/](http://www.pinterest.com/categories/popular/) in this section. Only 25 images is shown at the beginning, more images would be loaded when you scroll to the bottom of the page.

To scrape images as many as posible we can use a [`js_script` parameter](/apis/self.crawl/#enable-javascript-fetcher-need-support-by-fetcher) to set some function wrapped JavaScript codes to simulate the scroll action:
To scrape images as many as possible we can use a [`js_script` parameter](/apis/self.crawl/#enable-javascript-fetcher-need-support-by-fetcher) to set some function wrapped JavaScript codes to simulate the scroll action:

```
class Handler(BaseHandler):
Expand Down
2 changes: 1 addition & 1 deletion pyspider/database/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@ def _connect_sqlalchemy(parsed, dbtype,url, other_scheme):


def _connect_elasticsearch(parsed, dbtype):
# in python 2.6 url like "http://host/?query", query will not been splitted
# in python 2.6 url like "http://host/?query", query will not been split
if parsed.path.startswith('/?'):
index = parse_qs(parsed.path[2:])
else:
Expand Down
2 changes: 1 addition & 1 deletion pyspider/database/basedb.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ class BaseDB:
'''
BaseDB

dbcur should be overwirte
dbcur should be overwrite
'''
__tablename__ = None
placeholder = '%s'
Expand Down
2 changes: 1 addition & 1 deletion pyspider/fetcher/tornado_fetcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -265,7 +265,7 @@ def pack_tornado_request_parameters(self, url, task):
_t = track_headers.get('etag')
if _t and 'If-None-Match' not in fetch['headers']:
fetch['headers']['If-None-Match'] = _t
# last modifed
# last modified
if task_fetch.get('last_modified', task_fetch.get('last_modifed', True)):
last_modified = task_fetch.get('last_modified', task_fetch.get('last_modifed', True))
_t = None
Expand Down
4 changes: 2 additions & 2 deletions pyspider/libs/base_handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -440,7 +440,7 @@ def _on_cronjob(self, response, task):

# When triggered, a '_on_cronjob' task is sent from scheudler with 'tick' in
# Response.save. Scheduler may at least send the trigger task every GCD of the
# inverval of the cronjobs. The method should check the tick for each cronjob
# interval of the cronjobs. The method should check the tick for each cronjob
# function to confirm the execute interval.
for cronjob in self._cron_jobs:
if response.save['tick'] % cronjob.tick != 0:
Expand All @@ -449,7 +449,7 @@ def _on_cronjob(self, response, task):
self._run_func(function, response, task)

def _on_get_info(self, response, task):
"""Sending runtime infomation about this script."""
"""Sending runtime information about this script."""
for each in response.save or []:
if each == 'min_tick':
self.save[each] = self._min_tick
Expand Down
2 changes: 1 addition & 1 deletion pyspider/libs/response.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ def etree(self):
except LookupError:
# lxml would raise LookupError when encoding not supported
# try fromstring without encoding instead.
# on windows, unicode is not availabe as encoding for lxml
# on windows, unicode is not available as encoding for lxml
self._elements = lxml.html.fromstring(self.content)
if isinstance(self._elements, lxml.etree._ElementTree):
self._elements = self._elements.getroot()
Expand Down
8 changes: 4 additions & 4 deletions pyspider/message_queue/rabbitmq.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,11 +68,11 @@ def __init__(self, name, amqp_url='amqp://guest:guest@localhost:5672/%2F',
amqp_url: https://www.rabbitmq.com/uri-spec.html
maxsize: an integer that sets the upperbound limit on the number of
items that can be placed in the queue.
lazy_limit: as rabbitmq is shared between multipul instance, for a strict
lazy_limit: as rabbitmq is shared between multiple instance, for a strict
limit on the number of items in the queue. PikaQueue have to
update current queue size before every put operation. When
`lazy_limit` is enabled, PikaQueue will check queue size every
max_size / 10 put operation for better performace.
max_size / 10 put operation for better performance.
"""
self.name = name
self.amqp_url = amqp_url
Expand Down Expand Up @@ -201,11 +201,11 @@ def __init__(self, name, amqp_url='amqp://guest:guest@localhost:5672/%2F',
amqp_url: https://www.rabbitmq.com/uri-spec.html
maxsize: an integer that sets the upperbound limit on the number of
items that can be placed in the queue.
lazy_limit: as rabbitmq is shared between multipul instance, for a strict
lazy_limit: as rabbitmq is shared between multiple instance, for a strict
limit on the number of items in the queue. PikaQueue have to
update current queue size before every put operation. When
`lazy_limit` is enabled, PikaQueue will check queue size every
max_size / 10 put operation for better performace.
max_size / 10 put operation for better performance.
"""
self.name = name
self.amqp_url = amqp_url
Expand Down
6 changes: 3 additions & 3 deletions pyspider/processor/processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@


class ProcessorResult(object):
"""The result and logs producted by a callback"""
"""The result and logs produced by a callback"""

def __init__(self, result=None, follows=(), messages=(),
logs=(), exception=None, extinfo=None, save=None):
Expand All @@ -45,7 +45,7 @@ def logstr(self):
"""handler the log records to formatted string"""

result = []
formater = LogFormatter(color=False)
formatter = LogFormatter(color=False)
for record in self.logs:
if isinstance(record, six.string_types):
result.append(pretty_unicode(record))
Expand All @@ -54,7 +54,7 @@ def logstr(self):
a, b, tb = record.exc_info
tb = hide_me(tb, globals())
record.exc_info = a, b, tb
result.append(pretty_unicode(formater.format(record)))
result.append(pretty_unicode(formatter.format(record)))
result.append(u'\n')
return u''.join(result)

Expand Down
4 changes: 2 additions & 2 deletions pyspider/result/result_worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ def on_result(self, task, result):
result=result
)
else:
logger.warning('result UNKNOW -> %.30r' % result)
logger.warning('result UNKNOWN -> %.30r' % result)
return

def quit(self):
Expand Down Expand Up @@ -83,5 +83,5 @@ def on_result(self, task, result):
'updatetime': time.time()
}))
else:
logger.warning('result UNKNOW -> %.30r' % result)
logger.warning('result UNKNOWN -> %.30r' % result)
return
4 changes: 2 additions & 2 deletions pyspider/scheduler/scheduler.py
Original file line number Diff line number Diff line change
Expand Up @@ -245,7 +245,7 @@ def _update_project(self, project):
},
})

# load task queue when project is running and delete task_queue when project is stoped
# load task queue when project is running and delete task_queue when project is stopped
if project.active:
if not project.task_loaded:
self._load_tasks(project)
Expand Down Expand Up @@ -989,7 +989,7 @@ def on_task_failed(self, task):

def on_select_task(self, task):
'''Called when a task is selected to fetch & process'''
# inject informations about project
# inject information about project
logger.info('select %(project)s:%(taskid)s %(url)s', task)

project_info = self.projects.get(task['project'])
Expand Down
2 changes: 1 addition & 1 deletion tests/test_message_queue.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ def setUpClass(self):
self.q3 = connect_message_queue('test_queue_for_threading_test')


#@unittest.skipIf(six.PY3, 'pika not suport python 3')
#@unittest.skipIf(six.PY3, 'pika not support python 3')
@unittest.skipIf(os.environ.get('IGNORE_RABBITMQ') or os.environ.get('IGNORE_ALL'), 'no rabbitmq server for test.')
class TestPikaRabbitMQ(TestMessageQueue, unittest.TestCase):

Expand Down