Crawler

This is a super easy and fast configurable crawler.

Introduction

It is based on two main objects:

Archive (the page containg the list of produts, and the pagination links)
Single (the page containing the single product, with ipotetical title, image and all fields you need)

Installation

composer require capimichi/crawler

Getting started

The crawler is very fast to use, all you need is to include autoload.php file:

include_once '/path/to/src/autoload.php'

then you can create your crawler like this:

use Crawler\CrawlerBuilder;
use Crawler\Selector;
use Crawler\SelectorTypes;
use Crawler\Single\Fields\FieldBuilder;
use Crawler\Single\Fields\FieldTypes;

$crawlerBuilder = (new CrawlerBuilder())->addStartingUrl("http://example-archive.com/products")
            ->addItemSelector(new Selector(SelectorTypes::CLASSNAME, "product-item-info"))
            ->addItemSelector(new Selector(SelectorTypes::CLASSNAME, "product"))
            ->addNextpageSelector(new Selector(SelectorTypes::CLASSNAME, "pages-items"))
            ->addNextpageSelector(new Selector(SelectorTypes::CLASSNAME, "next"));
            
$field = (new FieldBuilder(FieldTypes::STRING))->setName("title")
            ->setMultiple(false)
            ->addSelector(new Selector(SelectorTypes::TAGNAME, "h1"))->build();
$crawlerBuilder->addField($field);
$crawler = $crawlerBuilder->build();
$archives = $crawler->getArchives();
$items = $crawler->getItems();
$export = array();
foreach($items as $item){
    array_push($export, $item->getExport());
}

Here the explanation of the code:
CrawlerBuilder let you add starting urls to your crawler with method:

addStartingUrl($url)

then you can add items selectors, in our example we ad an archive like this:

<ul class="products">
    <li class="product-item-info"><a class="product" href="/url-to-item"></a>...</li>
    <li class="product-item-info"><a class="product" href="/url-to-item"></a>...</li>
    <li class="product-item-info"><a class="product" href="/url-to-item"></a>...</li>
</ul>

so we added these selectors:

addItemSelector(new Selector(SelectorTypes::CLASSNAME, "product-item-info"))
addItemSelector(new Selector(SelectorTypes::CLASSNAME, "product"))

Our page had pagination like this:

<ul class="pages-items">
    <li class="item"><a class="page" href="/url-to-item">1</a></li>
    <li class="item"><a class="page" href="/url-to-item">2</a></li>
    <li class="item"><a class="next" href="/url-to-item">Next page</a></li>
</ul>

so we added these selectors:

addNextpageSelector(new Selector(SelectorTypes::CLASSNAME, "pages-items"))
addNextpageSelector(new Selector(SelectorTypes::CLASSNAME, "next")

Note: The selectors order is important, and it should be as specific as possible

Then our example had pages of single products like this:

<div class="product">
    <h1>Title of our product</h1>
    <img src="/image.png">
</div>

we liked to grab only the title of the product, so we added this field:

$field = (new FieldBuilder(FieldTypes::STRING))->setName("title")
            ->setMultiple(false)
            ->addSelector(new Selector(SelectorTypes::TAGNAME, "h1"))->build();
$crawlerBuilder->addField($field);

Then we created the crawler with build method:

$crawler = $crawlerBuilder->build();

and now we can get all needed informations:

$archives = $crawler->getArchives();
$items = $crawler->getItems();
$export = array();
foreach($items as $item){
    array_push($export, $item->getExport());
}

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
composer.json		composer.json
console.php		console.php

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawler

Introduction

Installation

Getting started

About

Releases

Packages

Languages

License

capimichi/crawler1

Folders and files

Latest commit

History

Repository files navigation

Crawler

Introduction

Installation

Getting started

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages