-
Hello! For our application we're accepting domain names that are not mentioned in the public suffix list and in the first implementation we've overlooked that the specification requires that the rules as well as the looked up domains need to be normalized (see https://github.com/publicsuffix/list/wiki/Format#formal-algorithm ). I thought maybe I could use dns.name to normalize both parts but I'm running into problems with the exception rules, which can start with an exclamation mark, which is not considered part of the domain. So in the fake public suffix list for testing purposes we may have these entries:
And a first implementation of normalizing rule entry and lookup: import os
from django.conf import settings
import dns.name
class PublicDomainSuffixList():
def __init__(self, data_path=None):
if data_path is None:
data_path = settings.SIS_DOMAINS_PUBLIC_SUFFIX_FILE
self.data_path = data_path
with open(self.data_path) as data_file:
self.data = [normalize_rule(line.split(" ")[0]) for line in
data_file.read().splitlines()
if line.strip() and not line.startswith("//")
]
def has_domain(self, raw_domain):
"""Checks for an exact match in the list of defined domains"""
domain = normalize_domain(raw_domain)
return domain in self.data or (
"." in domain
and f"*.{domain.split('.', 1)[1]}" in self.data
and f"!{domain}" not in self.data)
def normalize_rule(line):
domain = line
prefix = ''
# dns.name.from_text refuses to handle unicode labels beginning
# with '!', which to my understanding should be legal in the
# public suffix list and may even make sense.
if line.startswith('!'):
prefix = '!'
domain = line[1:]
return prefix + normalize_domain(domain)
# (Fortunately we do not need a special case for wildcards, as those
# are complete labels which dns.name just ignores.)
def normalize_domain(domain):
return dns.name.from_text(
domain,
idna_codec=dns.name.IDNA_2008_Practical).to_text()[:-1] (The lookup algorithm may not be entirely correct regarding the specification, at least I have my doubts. I didn't write this part. Feedback on this is welcome as well, however the main point is:) Maybe normalizing via I'm puzzled why dns.name even accepts Best regards and thanks |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Domain names are sequences of labels, and labels can contain any octet value. So !, *, NUL, even embedded "." are ok! You don't see them often, as most domain names you encounter are used as hostnames (or parts of them) and are subject to more restrictions (RFC 1123 which updates RFC 952). IDNA names are also subject to their own, incredibly complicated, sets of rules in two different rule systems (IDNA 2003 and IDNA 2008). Note that IDNA = "Internationalized Domain Names in Applications", and it's the "in Applications" part that is restricting things. When you apply the IDNA 2008 codec it will enforce those rules, and "!" is disallowed because it is in the DISALLOWED range per RFC 5892 section 3. Section 3 is confusing if you're not a Unicode expert, but the summary in Appendix B.1 is handy for us non experts, and it makes clear that characters in the range 0x00..0x2C are DISALLOWED, and that range includes "!" (0x21), which is why you get an error for the IDNA name The public suffix list lists names in IDNA Unicode form, not punycode (though it has the punycode as a comment). I don't see a clear choice of 2003 vs 2008 in the spec, which makes me suspect 2003 is used, though they are the same in most cases. The PSL code I've written, which I ought to opensource some day, strips off the "!" or wildcard prefixes as they are really about how matching should be done. I then just use This is sufficient canonicalization for my purposes as the whole PSL library is in python and name comparison in the
But if I wanted to emit canonicalized text for something else to process, then I'd need to call "canonicalize()" too in order to fold case correctly for non-IDNA labels when rendering back to text. E.g.
Sorry for such a long response, but it's an unexpectedly complicated area! |
Beta Was this translation helpful? Give feedback.
-
You might find these tests helpful for testing your PSL compliance if you haven't seen them. That directory also has the test in other forms. I agree that 2008 is the right call for DENIC names. |
Beta Was this translation helpful? Give feedback.
Domain names are sequences of labels, and labels can contain any octet value. So !, *, NUL, even embedded "." are ok! You don't see them often, as most domain names you encounter are used as hostnames (or parts of them) and are subject to more restrictions (RFC 1123 which updates RFC 952). IDNA names are also subject to their own, incredibly complicated, sets of rules in two different rule systems (IDNA 2003 and IDNA 2008). Note that IDNA = "Internationalized Domain Names in Applications", and it's the "in Applications" part that is restricting things. When you apply the IDNA 2008 codec it will enforce those rules, and "!" is disallowed because it is in the DISALLOWED range per RFC 5892 s…