Superior SURT. Sensible SURT. Smug SURT.
Transforms scheme://[email protected]:port/path?query#fragment
Old-school Heritrix SURT is described at SSURT resolves the following issues with Heritrix SURTs:
- Matching everything from a domain required a separate rule for each protocol.
- Correct parsing of a SURT is awkward due to ")" being allowed in userinfo but also used to delimit it.
- Reversible. This should be true: unSSURT(SSURT(url)) = url.
Example SSURT prefixes:
au,gov,nla, => * au,gov,nla,// => everything on host '' regardless of scheme and port au,gov,nla,//8000: => everything on port 8000 on host '' au,gov,nla,//:http => http and no port specified on host '' any userinfo au,gov,nla,//8000:http: => http on port 8000 on host '' blank userinfo au,gov,nla,//8000:http@foo:bar: => http on port 8000 on host '' userinfo 'foo:bar' 10. => everything in ipv4 subnet, assuming IP address canonicalization [2001:0db8: => everything in ipv6 subnet 2001:0db8/32, assuming IP address canonicalization [2001:0db8:0000:0042:0000:8a2e:0370:7334]:80:ws:/chat
Indicative grammar (grammars alone are not powerful enough to describe URL parsing):
SSURT = [ [ ssurt_host ] "//" ] ] [ [ port ] ":" ] [ [ scheme ] [ "@" [ userinfo ] ] ":" ] path [ "?" [ query ] ] [ "#" [ fragment ] ] ssurt_host = revdomain "," / IPv4address / "[" IPv6address "]"
SSURT does not imply any particular canonicalization. Funky uncanonicalized urls can be represented in SSURT format.