The main idea of py-robot is to simplify the code, and improve the performance of web crawlers.
pip install ciag-robot
Bellow we have a simple example of crawler that needs to get a page, and for each specific item get another page. Because it was written without the use of async requests, it will make a request and make the another one only when the previous has finished.
# examples/
import requests
import json
from lxml import html
from pyquery.pyquery import PyQuery as pq
page = requests.get('')
dom = pq(html.fromstring(page.content.decode()))
result = []
for link in dom.find('.theiaStickySidebar ul li'):
news = {
'category': pq(link).find('span').text(),
'url': pq(link).find('a[href]').attr('href'),
news_page = requests.get(news['url'])
dom = pq(news_page.content.decode())
news['body'] = dom.find('p').text()
news['title'] = dom.find('').text()
print(json.dumps(result, indent=4))
We can rewrite that using py-robot, and it will look like that:
# examples/
import json
from robot import Robot
from robot.collector.shortcut import *
import logging
collector = pipe(
css('.theiaStickySidebar ul li'),
css('a[href]'), attr('href'), any(),
body=pipe(css('p'), as_text()),
title=pipe(css(''), as_text()),
category=pipe(css('span'), as_text()),
url=pipe(css('a[href]'), attr('href'), any(), url())
with Robot() as robot:
result = robot.sync_run(collector)
print(json.dumps(result, indent=4))
Now all the requests will be async, so it will start all the requests for each item at the same time, and it will improve the performance of the crawler.