Labels containing * or ! #1047

TauPan · 2024-02-13T14:15:32Z

TauPan
Feb 13, 2024

Hello!

For our application we're accepting domain names that are not mentioned in the public suffix list and in the first implementation we've overlooked that the specification requires that the rules as well as the looked up domains need to be normalized (see https://github.com/publicsuffix/list/wiki/Format#formal-algorithm ).

I thought maybe I could use dns.name to normalize both parts but I'm running into problems with the exception rules, which can start with an exclamation mark, which is not considered part of the domain.

So in the fake public suffix list for testing purposes we may have these entries:

*.wildcard
!exception.wildcard
nörmalizes
!denörmalizes.wildcard

And a first implementation of normalizing rule entry and lookup:

import os

from django.conf import settings
import dns.name


class PublicDomainSuffixList():
    def __init__(self, data_path=None):
        if data_path is None:
            data_path = settings.SIS_DOMAINS_PUBLIC_SUFFIX_FILE
        self.data_path = data_path
        with open(self.data_path) as data_file:
            self.data = [normalize_rule(line.split(" ")[0]) for line in
                         data_file.read().splitlines()
                         if line.strip() and not line.startswith("//")
                         ]

    def has_domain(self, raw_domain):
        """Checks for an exact match in the list of defined domains"""
        domain = normalize_domain(raw_domain)
        return domain in self.data or (
            "." in domain
            and f"*.{domain.split('.', 1)[1]}" in self.data
            and f"!{domain}" not in self.data)


def normalize_rule(line):
    domain = line
    prefix = ''
    # dns.name.from_text refuses to handle unicode labels beginning
    # with '!', which to my understanding should be legal in the
    # public suffix list and may even make sense.
    if line.startswith('!'):
        prefix = '!'
        domain = line[1:]
    return prefix + normalize_domain(domain)
# (Fortunately we do not need a special case for wildcards, as those
# are complete labels which dns.name just ignores.)


def normalize_domain(domain):
    return dns.name.from_text(
        domain,
        idna_codec=dns.name.IDNA_2008_Practical).to_text()[:-1]

(The lookup algorithm may not be entirely correct regarding the specification, at least I have my doubts. I didn't write this part. Feedback on this is welcome as well, however the main point is:)

Maybe normalizing via dns.name.from_text().to_text() is not the best idea? It accepts the leading ! in !exception.wildcard without error, so .has_domain("!exception.wildcard") returns True, however trying to look up !denörmalizes.wildcard throws a dns.name.IDNAException. (With dnspython 2.2.1 ... Sorry if this was fixed at some point. We're still on python 3.6 at the moment. Upgrading to 3.11 might soon be possible.)

I'm puzzled why dns.name even accepts ! and apparently * as well in domain labels. Those are not legal, are they? Is this a bug? Or am I misunderstanding something?

Best regards and thanks
Friedel

Answered by rthalley

Feb 13, 2024

Domain names are sequences of labels, and labels can contain any octet value. So !, *, NUL, even embedded "." are ok! You don't see them often, as most domain names you encounter are used as hostnames (or parts of them) and are subject to more restrictions (RFC 1123 which updates RFC 952). IDNA names are also subject to their own, incredibly complicated, sets of rules in two different rule systems (IDNA 2003 and IDNA 2008). Note that IDNA = "Internationalized Domain Names in Applications", and it's the "in Applications" part that is restricting things. When you apply the IDNA 2008 codec it will enforce those rules, and "!" is disallowed because it is in the DISALLOWED range per RFC 5892 s…

View full answer

rthalley · 2024-02-13T16:46:01Z

rthalley
Feb 13, 2024
Maintainer

Domain names are sequences of labels, and labels can contain any octet value. So !, *, NUL, even embedded "." are ok! You don't see them often, as most domain names you encounter are used as hostnames (or parts of them) and are subject to more restrictions (RFC 1123 which updates RFC 952). IDNA names are also subject to their own, incredibly complicated, sets of rules in two different rule systems (IDNA 2003 and IDNA 2008). Note that IDNA = "Internationalized Domain Names in Applications", and it's the "in Applications" part that is restricting things. When you apply the IDNA 2008 codec it will enforce those rules, and "!" is disallowed because it is in the DISALLOWED range per RFC 5892 section 3. Section 3 is confusing if you're not a Unicode expert, but the summary in Appendix B.1 is handy for us non experts, and it makes clear that characters in the range 0x00..0x2C are DISALLOWED, and that range includes "!" (0x21), which is why you get an error for the IDNA name !denörmalizes.wildcard but not the ordinary domain name !exception. The DNS itself (at the server and protocol levels) and many unaware applications never care about any of this, as all IDNA names get punycoded in the DNS which makes them follow the rules for hostnames.

The public suffix list lists names in IDNA Unicode form, not punycode (though it has the punycode as a comment). I don't see a clear choice of 2003 vs 2008 in the spec, which makes me suspect 2003 is used, though they are the same in most cases. The PSL code I've written, which I ought to opensource some day, strips off the "!" or wildcard prefixes as they are really about how matching should be done. I then just use dns.name.from_text(), which will use the IDNA_2003 codec on any names with Unicode code points that are not ASCII code points too. I use the "!" and wildcard info to change what kind of matching node I make, e.g. a "!" becomes an ExceptionNode, the wildcard prefix becomes a WildNode, and otherwise it's an ExactNode. I put these nodes in a dns.name.NameDict and then use the NameDict's get_deepest_match() method to find the best match and then apply the particular rule type appropriately.

This is sufficient canonicalization for my purposes as the whole PSL library is in python and name comparison in the NameDict will be on Name objects and will be case-insensitive and properly punycoded.

>>> n1 = dns.name.from_text("denörmalizes.wildcard")
>>> n2 = dns.name.from_text("denörmaLizes.Wildcard")
>>> n3 = dns.name.from_text("xn--denrmalizes-tfb.Wildcard")
>>> n1 == n2
True
>>> n2 == n3
True

But if I wanted to emit canonicalized text for something else to process, then I'd need to call "canonicalize()" too in order to fold case correctly for non-IDNA labels when rendering back to text. E.g.

>>> print(dns.name.from_text("denörmaLizes.Wildcard").canonicalize())
xn--denrmalizes-tfb.wildcard.

Sorry for such a long response, but it's an unexpectedly complicated area!

2 replies

rthalley Feb 13, 2024
Maintainer

Also IDNA 2008 is far stricter than IDNA 2003, so much so that dnspython has the "practical" mode for it as the strict mode would break real uses, e.g. "_sip._tcp.Königsgäßchen.example" would be illegal in strict mode. That label "Königsgäßchen" is fun because it maps differently in 2003 vs 2008: "xn--knigsgsschen-lcb0w" in 2003, and "xn--knigsgchen-b4a3dun" in 2008. I'll stop now :)

TauPan Feb 14, 2024
Author

Thanks for your helpful hints.

It seems I do need to recheck if we implement the algorithm correctly and using get_deepest_match() might be helpful here.

You also seem to confirm that dropping the initial '!' from the rule is correct.

Regarding IDNA 2008: We will be accepting a lot of german domain names and the character ß is bound to appear at some point. The german NIC (DENIC) officially supports IDNA 2008 so it would be unreasonable for us to be stuck with IDNA 2003.

Especially this would lead to collisions with legacy names such as "sparkasse-giessen.de" (which is mentioned in the UTS 46 document in the Deviations section). (In IDNA 2003 "sparkasse-gießen.de" maps to "sparkasse-giessen.de" which is a pre-existing domain before 2003 and not even a punycode domain while IDNA 2008 maps it to "xn--sparkasse-gieen-2ib.de". Incidentally the actual customer portal of a savings bank in Gießen is hosted on the old domain name only. But both domains appear to belong to the same organization, so the possible phishing case outlined in UTS 46 is prevented.)

rthalley · 2024-02-14T16:14:50Z

rthalley
Feb 14, 2024
Maintainer

You might find these tests helpful for testing your PSL compliance if you haven't seen them. That directory also has the test in other forms.

I agree that 2008 is the right call for DENIC names.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Labels containing * or ! #1047

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Labels containing * or ! #1047

TauPan Feb 13, 2024

Replies: 2 comments · 2 replies

rthalley Feb 13, 2024 Maintainer

rthalley Feb 13, 2024 Maintainer

TauPan Feb 14, 2024 Author

rthalley Feb 14, 2024 Maintainer

TauPan
Feb 13, 2024

Replies: 2 comments 2 replies

rthalley
Feb 13, 2024
Maintainer

rthalley Feb 13, 2024
Maintainer

TauPan Feb 14, 2024
Author

rthalley
Feb 14, 2024
Maintainer