Skip to content

Microdata, RDFa Lite & JSON-LD structured data reader for PHP

License

Notifications You must be signed in to change notification settings

brick/structured-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Brick\StructuredData

A PHP library to read Microdata, RDFa Lite & JSON-LD structured data in HTML pages.

This library is a foundation to read schema.org structured data in brick/schema, but may be used with other vocabularies.

Build Status Coverage Status Latest Stable Version License

Installation

This library is installable via Composer:

composer require brick/structured-data

Requirements

This library requires PHP 7.2 or later. It makes use of the following extensions:

These extensions are enabled by default, and should be available in most PHP installations.

Project status & release process

This library is under development. It is likely to change fast in the early 0.x releases. However, the library follows a strict BC break convention:

The current releases are numbered 0.x.y. When a non-breaking change is introduced (adding new methods, fixing bugs, optimizing existing code, etc.), y is incremented.

When a breaking change is introduced, a new 0.x version cycle is always started.

It is therefore safe to lock your project to a given release cycle, such as 0.1.*.

If you need to upgrade to a newer release cycle, check the release history for a list of changes introduced by each further 0.x.0 version.

Introduction

The library unifies reading the 3 supported formats (Microdata, RDFa Lite & JSON-LD) under a common interface:

interface Brick\StructuredData\Reader
{
    /**
     * Reads the items contained in the given document.
     *
     * @param DOMDocument $document The DOM document to read.
     * @param string      $url      The URL the document was retrieved from. This will be used only to resolve relative
     *                              URLs in property values. No attempt will be performed to connect to this URL.
     *
     * @return Item[] The top-level items.
     */
    public function read(DOMDocument $document, string $url) : array;
}

There are 3 implementations of this interface, one for each format:

  • MicrodataReader
  • RdfaLiteReader
  • JsonLdReader

The read() method returns the top-level items found in the document. Every Item consists of:

  • An optional id (itemid in Microdata, resource in RDFa Lite, @id in JSON-LD)
  • An array of zero or more types; each type is a URL, for example http://schema.org/Product
  • An associative array of zero or more properties; each property has a URL as a key, for example http://schema.org/price, and maps to an array of one or more values; values can be plain strings, or nested Item objects

Quickstart

Here is a working example that reads Microdata from a web page. Just change the URL and give it a try:

use Brick\StructuredData\Reader\MicrodataReader;
use Brick\StructuredData\HTMLReader;
use Brick\StructuredData\Item;

// Let's read Microdata here;
// You could also use RdfaLiteReader, JsonLdReader,
// or even use all of them by chaining them in a ReaderChain
$microdataReader = new MicrodataReader();

// Wrap into HTMLReader to be able to read HTML strings or files directly,
// i.e. without manually converting them to DOMDocument instances first
$htmlReader = new HTMLReader($microdataReader);

// Replace this URL with that of a website you know is using Microdata
$url = 'http://www.example.com/';
$html = file_get_contents($url);

// Read the document and return the top-level items found
// Note: the URL is only required to resolve relative URLs; no attempt will be made to connect to it
$items = $htmlReader->read($html, $url);

// Loop through the top-level items
foreach ($items as $item) {
    echo implode(',', $item->getTypes()), PHP_EOL;

    foreach ($item->getProperties() as $name => $values) {
        foreach ($values as $value) {
            if ($value instanceof Item) {
                // We're only displaying the class name in this example; you would typically
                // recurse through nested Items to get the information you need
                $value = '(' . implode(', ', $value->getTypes()) . ')';
            }

            // If $value is not an Item, then it's a plain string

            echo "  - $name: $value", PHP_EOL;
        }
    }
}

Current limitations

  • No support for the itemref attribute in MicroDataReader
  • No support for the prefix attribute in RdfaLiteReader; only predefined prefixes are supported right now
  • No proper support for @context in JsonLdReader; right now, only strings are accepted in @context, and they are considered a vocabulary identifier; this works fine with simple markup like the one used in the examples on schema.org, but may fail with more complex documents.

Note about JSON-LD's @context

While JsonLdReader should be able to handle a proper context object in the future, its goal will never be to be a fully compliant JSON-LD parser; in particular, it will never attempt to fetch a JSON-LD context referenced by a URL.

This is consistent with how indexing robots typically crawl the web, they do not fetch remote contexts, which relieves them from fetching additional documents to extract structured data from a web page.

The aim of JsonLdReader, and the other Reader implementations for that matter, is to be able to parse a document with the same capabilities as Google Structured Data Testing Tool or Yandex Structured data validator, no more, no less. These tools do not load external context files.

About

Microdata, RDFa Lite & JSON-LD structured data reader for PHP

Resources

License

Stars

Watchers

Forks

Sponsor this project

 

Packages

No packages published