Skip to content

[\W\D] fails to match alphabetic characters #241

Open
@muellerj2

Description

@muellerj2

The (ECMAScript) regular expression [\W\D] describes a character class that matches the union of (a) all non-alphanumeric characters and (b) all non-digits. So effectively, should be equivalent to [\D] and thus match all non-digits. However, Boost.Regex actually matches only non-alphanumeric characters.

Test case:

#include <iostream>
#include <boost/regex.hpp>

using namespace boost;

int main()
{
    regex re(R"([\W\D])");
    std::cout << "matches alphabetic: " << regex_match("a", re) << '\n'
         << "matches digit: " << regex_match("0", re) << '\n' 
         << "matches non-alphanumeric: " << regex_match(".", re);
    
    return 0;
}

https://godbolt.org/z/jPf79j5nr

This prints:

matches alphabetic: 0
matches digit: 0
matches non-alphanumeric: 1

But it should print:

matches alphabetic: 1
matches digit: 0
matches non-alphanumeric: 1

I think the problem lies here:

void add_negated_class(m_type m)
{
m_negated_classes |= m;
m_empty = false;
}

The negated character classes are bitwise or'ed, but De Morgan's law says that (not w) or (not d) = not (w and d), so the bit masks should really be bitwise and'ed.

But bitwise and'ing would be problematic as well, because no requirement is placed on traits classes that and'ing the character class bit masks corresponds to the intersection of the character classes. I guess and'ing will probably still work for traits classes provided by Boost.Regex (although I haven't checked that), but it's not guaranteed to do the right thing for user-provided traits classes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions