A RocksDB-based capture index (CDX) server for web archives.
Features:
- Speaks both OpenWayback (XML) and PyWb (JSON) CDX protocols
- Realtime, incremental updates
- Compressed indexes (varint packing + snappy), typically 1/4 - 1/5 the size of CDX files.
- Access control (experimental, see below)
Things it doesn't do (yet):
- Sharding, replication
- CDXJ
Used in production at the National Library of Australia and British Library with 8-9 billion record indexes.
Build:
mvn package
Run:
java -jar target/outbackcdx*.jar
Command line options:
$ java -jar target/outbackcdx-0.3.2.jar -h
Usage: java outbackcdx.Server [options...]
-b bindaddr Bind to a particular IP address
-d datadir Directory to store index data under
-i Inherit the server socket via STDIN (for use with systemd, inetd etc)
-j jwks-url perm-path Use JSON Web Tokens for authorization
-k url realm clientid Use a Keycloak server for authorization
-p port Local port to listen on
-t count Number of web server threads
-u Use Undertow as the HTTP server instead of NanoHTTPD
-v Verbose logging
The server supports multiple named indexes as subdirectories. Currently indexes are created automatically when you first write records to them.
OutbackCDX does not include a CDX indexing tool for reading WARC or ARC files. Use
the cdx-indexer
scripts included with OpenWayback or PyWb.
You can load records into the index by POSTing them in the (11-field) CDX format Wayback uses:
$ cdx-indexer mycrawlw.warc.gz > records.cdx
$ curl -X POST --data-binary @records.cdx http://localhost:8080/myindex
Added 542 records
The canonicalized URL (first field) is ignored, OutbackCDX performs its own canonicalization.
Limitation: Loading an extremely large number of CDX records in one POST request can cause an out of memory error. Until this is fixed you may need to break your request up into several smaller ones. Most users send one POST per WARC file.
Deleting records works the same way as loading them. POST the records you wish to delete to /{collection}/delete:
$ curl -X POST --data-binary @records.cdx http://localhost:8080/myindex/delete
Deleted 542 records
When deleting OutbackCDX does not check whether the records actually existed in the index. Deleting non-existent records has no effect and will not cause an error.
Records can be queried in CDX format:
$ curl 'http://localhost:8080/myindex?url=example.org'
org,example)/ 20030402160014 http://example.org/ text/html 200 MOH7IEN2JAEJOHYXIEPEEGHOHG5VI=== - - 2248 396 mycrawl.warc.gz
CDX formatted as JSON arrays:
$ curl 'http://localhost:8080/myindex?url=example.org&output=json'
[
[
"org,example)/",
20030402160014,
"http://example.org/",
"text/html",
200,
"MOH7IEN2JAEJOHYXIEPEEGHOHG5VI===",
2248,
396,
"mycrawl.warc.gz"
]
]
OpenWayback "OpenSearch" XML:
$ curl 'http://localhost:8080/myindex?q=type:urlquery+url:http%3A%2F%2Fexample.org%2F'
<?xml version="1.0" encoding="UTF-8"?>
<wayback>
<request>
<startdate>19960101000000</startdate>
<enddate>20180526162512</enddate>
<type>urlquery</type>
<firstreturned>0</firstreturned>
<url>org,example)/</url>
<resultsrequested>10000</resultsrequested>
<resultstype>resultstypecapture</resultstype>
</request>
<results>
<result>
<compressedoffset>396</compressedoffset>
<compressedendoffset>2248</compressedendoffset>
<mimetype>text/html</mimetype>
<file>mycrawl.warc.gz</file>
<redirecturl>-</redirecturl>
<urlkey>org,example)/</urlkey>
<digest>MOH7IEN2JAEJOHYXIEPEEGHOHG5VI===</digest>
<httpresponsecode>200</httpresponsecode>
<robotflags>-</robotflags>
<url>http://example.org/</url>
<capturedate>20030402160014</capturedate>
</result>
</results>
</wayback>
Query URLs that match a given URL prefix:
$ curl 'http://localhost:8080/myindex?url=http://example.org/abc&matchType=prefix'
Find the first 5 URLs with a given domain:
$ curl 'http://localhost:8080/myindex?url=example.org&matchType=domain&limit=5'
Find the next 10 URLs in the index starting from the given URL prefix:
$ curl 'http://localhost:8080/myindex?url=http://example.org/abc&matchType=range&limit=10'
Return results in reverse order:
$ curl 'http://localhost:8080/myindex?url=example.org&sort=reverse'
Return results ordered closest to furthest from a given timestamp:
$ curl 'http://localhost:8080/myindex?url=example.org&sort=closest&closest=20030402172120'
See the API Documentation for more details about the available options.
Point Wayback at a OutbackCDX index by configuring a RemoteResourceIndex. See the example RemoteCollection.xml shipped with OpenWayback.
<property name="resourceIndex">
<bean class="org.archive.wayback.resourceindex.RemoteResourceIndex">
<property name="searchUrlBase" value="http://localhost:8080/myindex" />
</bean>
</property>
Create a pywb config.yaml file containing:
collections:
testcol:
archive_paths: /tmp/warcs/
#archive_paths: http://remote.example.org/warcs/
index:
type: cdx
api_url: http://localhost:8080/myindex?url={url}&closest={closest}&sort=closest
# outbackcdx doesn't serve warc records
# so we blank replay_url to force pywb to read the warc file itself
replay_url: ""
The ukwa-heritrix project includes some classes that allow OutbackCDX to be used as a source of deduplication data for Heritrix crawls.
Experimental support for access control is under early development, experimental support for it can be can be enabled by setting the following environment variable:
EXPERIMENTAL_ACCESS_CONTROL=1
Rules can be configured through the GUI. Have Wayback or other clients query a particular named access point. For example to query the 'public' access point.
http://localhost:8080/myindex/ap/public
Alias records allow the grouping of URLs so they will deliver as if they are different snapshots of the same page.
@alias <source-url> <target-url>
For example:
@alias http://legacy.example.org/page-one http://www.example.org/page1
@alias http://legacy.example.org/page-two http://www.example.org/page2
Aliases do not currently work with url prefix queries. Aliases are resolved after normal canonicalisation rules are applied.
Aliases can be mixed with regular CDX lines either in the same file or separate files and in any order. Any existing records that the alias rule affects the canonicalised URL for will be updated when the alias is added to the index.
Deletion of aliases is not yet implemented.
By default OutbackCDX is unsecured and assumes some external method of authorization such as firewall rules or a reverse proxy are used to secure it. Take care not to expose it to the public internet.
Alternatively one of the following authorization methods can be enabled.
Authorization to modify the index and access control rules can be controlled using JSON Web Tokens. To enable this you will typically use some sort of separate authentication server to sign the JWTs.
OutbackCDX's -j
option takes two arguments, a JWKS URL for the public key of the auth server and a slash-delimited
path for where to find the list of permissions in the JWT received as a HTTP bearer token. Refer to your auth server's
documentation for what to use.
Currently the OutbackCDX web dashboard does not support generic JWT/OIDC authorization. (Patches welcome.)
OutbackCDX can use Keycloak as an auth server to secure both the API and dashboard.
- In your Keycloak realm's settings create a new client for OutbackCDX with the protocol
openid-connect
and the URL of your OutbackCDX instance. - Under the client's roles tab create the following roles:
- index_edit - can create or delete index records
- rules_edit - can create, modify or delete access rules
- policies_edit - can create, modify or delete access policies
- Map your users or service accounts to these client roles as appropriate.
- Run OutbackCDX with this option:
-k https://{keycloak-server}/auth {realm} {client-id}
Note: JWT authentication will be enabled automatically when using Keycloak. You don't need to set the -j
option.