Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warc tester #59

Open
wumpus opened this issue Dec 29, 2018 · 1 comment
Open

Warc tester #59

wumpus opened this issue Dec 29, 2018 · 1 comment

Comments

@wumpus
Copy link
Collaborator

wumpus commented Dec 29, 2018

I built a thing that tests a warc for standards conformance. The cli is similar to "warcio check". It's 440 lines of code so far, likely to be around 1,000 when done.

It will need an extended testing and tweaking period while it's tested against everything in the ecosystem that generates warcs. Discussion might be ... vigorous. I'm currently labeling things as "not standard conforming", "following/not following recommendations", and "comments". Hopefully not too many hairs will be split.

Does this belong in warcio? My hope is that it will be commonly used; with luck that means that the entire web archiving ecosystem will keep warcio installed and part of their testing processes.

@wumpus
Copy link
Collaborator Author

wumpus commented Jan 23, 2019

Work in progress -- now a pullreq #66

$ warcio test test/data/*.warc.gz test/data/*.warc
test/data/example-bad-non-chunked.warc.gz
  saw exception 
    ERROR: non-chunked gzip file detected, gzip block continues
    beyond single record.

    This file is probably not a multi-member gzip but a single gzip file.

    To allow seek, a gzipped WARC must have each record compressed into
    a single gzip member and concatenated together.

    This file is likely still valid and can be fixed by running:

    warcio recompress <path/to/file> <path/to/new_file>
  skipping rest of file
test/data/example-resource.warc.gz
  WARC-Record-ID <urn:uuid:6e7f60da-2c7b-11e7-aaf7-0242ac120007>
    WARC-Type resource
    digest pass
    comment: unknown field, no validation performed Warc-Referer https://webrecorder.io/temp-GRWZVUTV/temp/test/record/http://example.com/
    comment: unknown field, no validation performed Warc-User-Agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36
test/data/example.warc.gz
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest not present
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:e6e395ca-0221-11e7-a18d-0242ac120005>
    WARC-Type revisit
    digest present but not checked
    recommendation: missing recommended header WARC-Refers-To
    comment: field was introduced after this warc version WARC-Refers-To-Target-URI http://example.com/ 1.0
    comment: field was introduced after this warc version WARC-Refers-To-Date 2017-03-06T04:02:06Z 1.0
  WARC-Record-ID <urn:uuid:e6e41fea-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest not present
    error: WARC-IP-Address should be used for http and https requests
test/data/example-wget-bad-target-uri.warc.gz
  WARC-Record-ID <urn:uuid:CEF11DC9-8D86-4F4B-9B8C-2235515B4537>
    WARC-Type request
    digest pass
    error: uri must not be within <> warc-target-uri <http://example.com/>
    error: invalid uri scheme, bad character warc-target-uri <http://example.com/>
  WARC-Record-ID <urn:uuid:FD8A6D04-AF8A-4A36-A889-8094487CDF2D>
    WARC-Type response
    payload digest failed sha1:B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A
    error: uri must not be within <> warc-target-uri <http://example.com/>
    error: invalid uri scheme, bad character warc-target-uri <http://example.com/>
  WARC-Record-ID <urn:uuid:E5AC383F-F107-47BC-99B7-176FD8DE6E94>
    WARC-Type metadata
    digest pass
    error: uri must not be within <> warc-target-uri <metadata://gnu.org/software/wget/warc/MANIFEST.txt>
    error: invalid uri scheme, bad character warc-target-uri <metadata://gnu.org/software/wget/warc/MANIFEST.txt>
  WARC-Record-ID <urn:uuid:543BCA4F-A305-4383-B511-0BCF23F7AD8D>
    WARC-Type resource
    digest pass
    error: uri must not be within <> warc-target-uri <metadata://gnu.org/software/wget/warc/wget_arguments.txt>
    error: invalid uri scheme, bad character warc-target-uri <metadata://gnu.org/software/wget/warc/wget_arguments.txt>
  WARC-Record-ID <urn:uuid:CCD67DB5-13FA-447B-BF05-BF1BDC8ED3A0>
    WARC-Type resource
    digest pass
    error: uri must not be within <> warc-target-uri <metadata://gnu.org/software/wget/warc/wget.log>
    error: invalid uri scheme, bad character warc-target-uri <metadata://gnu.org/software/wget/warc/wget.log>
test/data/example-wrong-chunks.warc.gz
  saw exception Invalid WARC record, first line: <!doctype html>
  skipping rest of file
test/data/post-test.warc.gz
  WARC-Record-ID <urn:uuid:59a6b068-cbc2-4767-9525-33043d2709c7>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:5eb8ee92-cda1-4503-a7a3-c63f1ab6515b>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:c79a62e3-5a4b-450d-a093-3a7fefa09664>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
test/data/example-digest-bad.warc
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    payload digest failed: sha1:1112H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
test/data/example-iana.org-chunked.warc
  WARC-Record-ID <urn:uuid:c46fbf5f-0876-4652-a348-e9b6c322eabb>
    WARC-Type request
    digest pass
    error: WARC-IP-Address should be used for http and https requests
test/data/example-trunc.warc
  WARC-Record-ID <urn:uuid:a9c51e3e-0221-11e7-bf66-0242ac120005>
    WARC-Type response
    block digest failed: sha1:DR5MBP7OD3OPA7RFKWJUD4CTNUQUGFC5
    payload digest failed sha1:G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK
    WARNING: Record not followed by newline, perhaps Content-Length is invalid
    Offset: 2560
    Remainder: b'\x00\x00\r\n'
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest not present
    error: WARC-IP-Address should be used for http and https requests
test/data/example.warc
  WARC-Record-ID <urn:uuid:a9c5c23a-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest not present
    error: WARC-IP-Address should be used for http and https requests
  WARC-Record-ID <urn:uuid:e6e395ca-0221-11e7-a18d-0242ac120005>
    WARC-Type revisit
    digest present but not checked
    recommendation: missing recommended header WARC-Refers-To
    comment: field was introduced after this warc version WARC-Refers-To-Target-URI http://example.com/ 1.0
    comment: field was introduced after this warc version WARC-Refers-To-Date 2017-03-06T04:02:06Z 1.0
  WARC-Record-ID <urn:uuid:e6e41fea-0221-11e7-8fe3-0242ac120007>
    WARC-Type request
    digest not present
    error: WARC-IP-Address should be used for http and https requests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant