Skip to content

capimichi/crawler1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Crawler

This is a super easy and fast configurable crawler.

Introduction

It is based on two main objects:

  • Archive (the page containg the list of produts, and the pagination links)
  • Single (the page containing the single product, with ipotetical title, image and all fields you need)

Installation

composer require capimichi/crawler

Getting started

The crawler is very fast to use, all you need is to include autoload.php file:

include_once '/path/to/src/autoload.php'

then you can create your crawler like this:

use Crawler\CrawlerBuilder;
use Crawler\Selector;
use Crawler\SelectorTypes;
use Crawler\Single\Fields\FieldBuilder;
use Crawler\Single\Fields\FieldTypes;

$crawlerBuilder = (new CrawlerBuilder())->addStartingUrl("http://example-archive.com/products")
            ->addItemSelector(new Selector(SelectorTypes::CLASSNAME, "product-item-info"))
            ->addItemSelector(new Selector(SelectorTypes::CLASSNAME, "product"))
            ->addNextpageSelector(new Selector(SelectorTypes::CLASSNAME, "pages-items"))
            ->addNextpageSelector(new Selector(SelectorTypes::CLASSNAME, "next"));
            
$field = (new FieldBuilder(FieldTypes::STRING))->setName("title")
            ->setMultiple(false)
            ->addSelector(new Selector(SelectorTypes::TAGNAME, "h1"))->build();
$crawlerBuilder->addField($field);
$crawler = $crawlerBuilder->build();
$archives = $crawler->getArchives();
$items = $crawler->getItems();
$export = array();
foreach($items as $item){
    array_push($export, $item->getExport());
}

Here the explanation of the code:
CrawlerBuilder let you add starting urls to your crawler with method:

addStartingUrl($url)

then you can add items selectors, in our example we ad an archive like this:

<ul class="products">
    <li class="product-item-info"><a class="product" href="/url-to-item"></a>...</li>
    <li class="product-item-info"><a class="product" href="/url-to-item"></a>...</li>
    <li class="product-item-info"><a class="product" href="/url-to-item"></a>...</li>
</ul>

so we added these selectors:

addItemSelector(new Selector(SelectorTypes::CLASSNAME, "product-item-info"))
addItemSelector(new Selector(SelectorTypes::CLASSNAME, "product"))

Our page had pagination like this:

<ul class="pages-items">
    <li class="item"><a class="page" href="/url-to-item">1</a></li>
    <li class="item"><a class="page" href="/url-to-item">2</a></li>
    <li class="item"><a class="next" href="/url-to-item">Next page</a></li>
</ul>

so we added these selectors:

addNextpageSelector(new Selector(SelectorTypes::CLASSNAME, "pages-items"))
addNextpageSelector(new Selector(SelectorTypes::CLASSNAME, "next")

Note: The selectors order is important, and it should be as specific as possible

Then our example had pages of single products like this:

<div class="product">
    <h1>Title of our product</h1>
    <img src="/image.png">
</div> 

we liked to grab only the title of the product, so we added this field:

$field = (new FieldBuilder(FieldTypes::STRING))->setName("title")
            ->setMultiple(false)
            ->addSelector(new Selector(SelectorTypes::TAGNAME, "h1"))->build();
$crawlerBuilder->addField($field);

Then we created the crawler with build method:

$crawler = $crawlerBuilder->build();

and now we can get all needed informations:

$archives = $crawler->getArchives();
$items = $crawler->getItems();
$export = array();
foreach($items as $item){
    array_push($export, $item->getExport());
}

About

Crawler easily configurable.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages