💫 Spider is a PHP library with easily module integration for crawling website that allows you to scrape informations.
Spider is a crawler of website modulable write in PHP. The tool allows you to retrieve information and execute code on website pages. It can be useful for SEO or security audit purposes. Users have the possibility to use the modules created by the community or to create their own modules (written in PHP via a web interface).
A crawler is an indexing robot, it automatically explores the pages of a website. Using a crawler can have several interests:
- Information search & retrieval
- Validation of the SEO of your website
- Integration test
- Execution of PHP code on several pages in an automated way
- Get all links from website
- Check HTTP response
- Create your own Modules (Crawl & execute your PHP code)
- No database, Pure PHP
- Output json file
- Use default modules from the kernel for basic SEO audit. (Metadata, Images, HttpCode, Links...)
- Autoloader php class for code integration easily. mediashare/modules-provider
- Website bot crawler. mediashare/crawler
- Scraper with DomCrawler integration. mediashare/scraper
I would be happy to receive your ideas and contributions to the project 😃
Use Spider library in your project & create your own modules.
composer require mediashare/spider
<?php
// ./index.php
require 'vendor/autoload.php';
use Mediashare\Spider\Entity\Config;
use Mediashare\Spider\Entity\Url;
use Mediashare\Spider\Spider;
// Website Config
$config = new Config();
$config->setWebspider(true); // Crawl all website
$config->setPathRequires(['/Kernel/']); // Not crawl other path
$config->setPathExceptions(['/CodeSnippet/']); // Not crawl this path
// Modules
$config->setReportsDir(__DIR__.'/reports/'); // Reports path
$config->setModulesDir(__DIR__.'/modules/'); // Modules path
$config->enableDefaultModule(true); // Enable default SEO kernel modules
$config->removeModule('FileDownload'); // Disable Module
// Prompt Console / Dump
$config->setVerbose(true); // Prompt verbose output
$config->setJson(false); // Prompt json output
// Url
$url = new Url('https://mediashare.fr');
// Run Spider
$spider = new Spider($url, $config);
$result = $spider->run();
git clone https://github.com/Mediashare/Spider
cd Spider
composer install
bin/console spider:run https://mediashare.fr
curl -O https://raw.githubusercontent.com/Mediashare/Spider/master/spider.phar
chmod 755 spider.phar
./spider.phar spider:run https://mediashare.fr
Modules are tools created by the community to add features when crawling a website. Adding a module to a crawler allows the automation of code execution on one or more pages of a website. More information...
- The name of your class needs to be the same as the name of the .php file.
- The entry point for executing modules is the run() function, so it is mandatory to have a run() function in your module.
DomCrawler is symfony component for DOM navigation for HTML and XML documents. You can retrieve Documentation Here.
bin/console spider:module Href
<?php
// ./modules/Href.php
namespace Mediashare\Modules;
class Href {
public $dom;
public function run() {
$links = [];
foreach($this->dom->filter('a') as $link) {
if (!empty($link)) {
$href = rtrim(ltrim($link->getAttribute('href')));
if ($href) {
if (isset($links[$href])) {
$links[$href]['counter']++;
} else {
$links[$href]['counter'] = 1;
}
}
}
}
return $links;
}
}