A PHP library to read Microdata, RDFa Lite & JSON-LD structured data in HTML pages.
This library is a foundation to read schema.org structured data in brick/schema, but may be used with other vocabularies.
This library is installable via Composer:
composer require brick/structured-data
This library requires PHP 7.2 or later. It makes use of the following extensions:
These extensions are enabled by default, and should be available in most PHP installations.
This library is under development. It is likely to change fast in the early 0.x
releases. However, the library follows a strict BC break convention:
The current releases are numbered 0.x.y
. When a non-breaking change is introduced (adding new methods, fixing bugs,
optimizing existing code, etc.), y
is incremented.
When a breaking change is introduced, a new 0.x
version cycle is always started.
It is therefore safe to lock your project to a given release cycle, such as 0.1.*
.
If you need to upgrade to a newer release cycle, check the release history
for a list of changes introduced by each further 0.x.0
version.
The library unifies reading the 3 supported formats (Microdata, RDFa Lite & JSON-LD) under a common interface:
interface Brick\StructuredData\Reader
{
/**
* Reads the items contained in the given document.
*
* @param DOMDocument $document The DOM document to read.
* @param string $url The URL the document was retrieved from. This will be used only to resolve relative
* URLs in property values. No attempt will be performed to connect to this URL.
*
* @return Item[] The top-level items.
*/
public function read(DOMDocument $document, string $url) : array;
}
There are 3 implementations of this interface, one for each format:
MicrodataReader
RdfaLiteReader
JsonLdReader
The read()
method returns the top-level items found in the document. Every Item
consists of:
- An optional id (
itemid
in Microdata,resource
in RDFa Lite,@id
in JSON-LD) - An array of zero or more types; each type is a URL, for example
http://schema.org/Product
- An associative array of zero or more properties; each property has a URL as a key, for example
http://schema.org/price
, and maps to an array of one or more values; values can be plain strings, or nestedItem
objects
Here is a working example that reads Microdata from a web page. Just change the URL and give it a try:
use Brick\StructuredData\Reader\MicrodataReader;
use Brick\StructuredData\HTMLReader;
use Brick\StructuredData\Item;
// Let's read Microdata here;
// You could also use RdfaLiteReader, JsonLdReader,
// or even use all of them by chaining them in a ReaderChain
$microdataReader = new MicrodataReader();
// Wrap into HTMLReader to be able to read HTML strings or files directly,
// i.e. without manually converting them to DOMDocument instances first
$htmlReader = new HTMLReader($microdataReader);
// Replace this URL with that of a website you know is using Microdata
$url = 'http://www.example.com/';
$html = file_get_contents($url);
// Read the document and return the top-level items found
// Note: the URL is only required to resolve relative URLs; no attempt will be made to connect to it
$items = $htmlReader->read($html, $url);
// Loop through the top-level items
foreach ($items as $item) {
echo implode(',', $item->getTypes()), PHP_EOL;
foreach ($item->getProperties() as $name => $values) {
foreach ($values as $value) {
if ($value instanceof Item) {
// We're only displaying the class name in this example; you would typically
// recurse through nested Items to get the information you need
$value = '(' . implode(', ', $value->getTypes()) . ')';
}
// If $value is not an Item, then it's a plain string
echo " - $name: $value", PHP_EOL;
}
}
}
- No support for the
itemref
attribute inMicroDataReader
- No support for the
prefix
attribute inRdfaLiteReader
; only predefined prefixes are supported right now - No proper support for
@context
inJsonLdReader
; right now, only strings are accepted in@context
, and they are considered a vocabulary identifier; this works fine with simple markup like the one used in the examples on schema.org, but may fail with more complex documents.
While JsonLdReader
should be able to handle a proper context object in the future, its goal will never be to be a
fully compliant JSON-LD parser; in particular, it will never attempt to fetch a JSON-LD context referenced by a URL.
This is consistent with how indexing robots typically crawl the web, they do not fetch remote contexts, which relieves them from fetching additional documents to extract structured data from a web page.
The aim of JsonLdReader
, and the other Reader
implementations for that matter, is to be able to parse a document with the same capabilities as Google Structured Data Testing Tool or Yandex Structured data validator, no more, no less. These tools do not load external context files.