Skip to content

GH-6114: Static path matching #6146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 32 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
0ec8baf
Initial stub implementations and test cases
dantleech Mar 7, 2025
875a816
Temporary utility method to explore current behavior
dantleech Mar 7, 2025
e6259a4
Added initial test cases
dantleech Mar 10, 2025
8bf4038
Add edge case
dantleech Mar 10, 2025
0d51a58
Fixing tests
dantleech Mar 10, 2025
00562d8
Implementing FileMatcher
dantleech Mar 10, 2025
42250d9
Support character groups
dantleech Mar 10, 2025
dfff3ea
Escaping unterminated openening brackets
dantleech Mar 12, 2025
5966038
Update
dantleech Mar 12, 2025
5eb851a
Tokenizing
dantleech Mar 12, 2025
7b10051
Progressing
dantleech Mar 12, 2025
ca252ad
Failing unterminated
dantleech Mar 12, 2025
1a984fd
Unterminated bracket
dantleech Mar 12, 2025
05a1087
Complementation
dantleech Mar 12, 2025
05e58c4
Negated group
dantleech Mar 12, 2025
f3837cc
Fix complementation
dantleech Mar 12, 2025
47efb1f
Ssquare
dantleech Mar 12, 2025
633eb7c
Skip test
dantleech Mar 12, 2025
f9a8bc0
Fix CS
dantleech Mar 12, 2025
901c050
Fix nested brackets
dantleech Mar 12, 2025
8048d0e
Support char classes
dantleech Mar 12, 2025
a1b7467
Add doc
dantleech Mar 12, 2025
6a3eaea
Add explanation
dantleech Mar 12, 2025
3b0c666
Add more comments
dantleech Mar 12, 2025
69209c2
Add comment and additional test case
dantleech Mar 12, 2025
213b5aa
Inplementing the file matcher
dantleech Mar 13, 2025
2b7bd24
Initial implementation in SourceFilter
dantleech Mar 13, 2025
87de370
Fix missing types
dantleech Mar 13, 2025
e07a96f
Failing test after rebase
dantleech Mar 15, 2025
ad68e68
Add missing types
dantleech Mar 18, 2025
ad0b105
Use the filematcherpattern
dantleech Mar 18, 2025
39baf16
Apply PHPStan fixes
dantleech Mar 18, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 56 additions & 13 deletions src/TextUI/Configuration/SourceFilter.php
Original file line number Diff line number Diff line change
Expand Up @@ -9,43 +9,86 @@
*/
namespace PHPUnit\TextUI\Configuration;

use PHPUnit\Util\FileMatcherPattern;
use function array_map;
use PHPUnit\Util\FileMatcher;
use PHPUnit\Util\FileMatcherRegex;

/**
* TODO: Does not take into account suffixes and prefixes - and tests don't cover it.
*
* @no-named-arguments Parameter names are not covered by the backward compatibility promise for PHPUnit
*
* @internal This class is not covered by the backward compatibility promise for PHPUnit
*/
final class SourceFilter
{
private static ?self $instance = null;
private Source $source;

/**
* @var list<FileMatcherRegex>
*/
private array $includeDirectoryRegexes;

/**
* @var array<non-empty-string, true>
* @var list<FileMatcherRegex>
*/
private readonly array $map;
private array $excludeDirectoryRegexes;

public static function instance(): self
{
if (self::$instance === null) {
self::$instance = new self(
(new SourceMapper)->map(
Registry::get()->source(),
),
);
$source = Registry::get()->source();
self::$instance = new self($source);

return self::$instance;
}

return self::$instance;
}

/**
* @param array<non-empty-string, true> $map
*/
public function __construct(array $map)
public function __construct(Source $source)
{
$this->map = $map;
$this->source = $source;
$this->includeDirectoryRegexes = array_map(static function (FilterDirectory $directory)
{
return FileMatcher::toRegEx(new FileMatcherPattern($directory->path()));
}, $source->includeDirectories()->asArray());
$this->excludeDirectoryRegexes = array_map(static function (FilterDirectory $directory)
{
return FileMatcher::toRegEx(new FileMatcherPattern($directory->path()));
}, $source->excludeDirectories()->asArray());
}

public function includes(string $path): bool
{
return isset($this->map[$path]);
$included = false;

foreach ($this->source->includeFiles() as $file) {
if ($file->path() === $path) {
$included = true;
}
}

foreach ($this->includeDirectoryRegexes as $directoryRegex) {
if ($directoryRegex->matches($path)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will need to change to account for include/exclude matching on the basename either in the regex or otherwise.

$included = true;
}
}

foreach ($this->source->excludeFiles() as $file) {
if ($file->path() === $path) {
$included = false;
}
}

foreach ($this->excludeDirectoryRegexes as $directoryRegex) {
if ($directoryRegex->matches($path)) {
$included = false;
}
}

return $included;
}
}
293 changes: 293 additions & 0 deletions src/Util/FileMatcher.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,293 @@
<?php declare(strict_types=1);
/*
* This file is part of PHPUnit.
*
* (c) Sebastian Bergmann <[email protected]>
*
* For the full copyright and license information, please view the LICENSE
* file that was distributed with this source code.
*/
namespace PHPUnit\Util;

use function array_key_last;
use function array_pop;
use function count;
use function ctype_alpha;
use function preg_quote;
use function strlen;

/**
* FileMatcher ultimately attempts to emulate the behavior `php-file-iterator`
* which *mostly* comes down to emulating PHP's glob function on file paths
* based on POSIX.2:
*
* - https://en.wikipedia.org/wiki/Glob_(programming)
* - https://man7.org/linux/man-pages/man7/glob.7.html
*
* The file matcher compiles the regex in three passes:
*
* - Tokenise interesting chars in the glob grammar.
* - Process the tokens and reorient them to produce regex.
* - Map the processed tokens to regular expression segments.
*
* @no-named-arguments Parameter names are not covered by the backward compatibility promise for PHPUnit
*
* @internal This class is not covered by the backward compatibility promise for PHPUnit
*
* @phpstan-type token array{self::T_*,string}
*/
final readonly class FileMatcher
{
private const string T_BRACKET_OPEN = 'bracket_open';
private const string T_BRACKET_CLOSE = 'bracket_close';
private const string T_BANG = 'bang';
private const string T_HYPHEN = 'hyphen';
private const string T_ASTERIX = 'asterix';
private const string T_SLASH = 'slash';
private const string T_BACKSLASH = 'backslash';
private const string T_CHAR = 'char';
private const string T_GREEDY_GLOBSTAR = 'greedy_globstar';
private const string T_QUERY = 'query';
private const string T_GLOBSTAR = 'globstar';
private const string T_COLON = 'colon';
private const string T_CHAR_CLASS = 'char_class';

/**
* Compile a regex for the given glob.
*/
public static function toRegEx(FileMatcherPattern $pattern): FileMatcherRegex
{
$tokens = self::tokenize($pattern->path);
$tokens = self::processTokens($tokens);

return self::mapToRegex($tokens);
}

/**
* @param list<token> $tokens
*/
private static function mapToRegex(array $tokens): FileMatcherRegex
{
$regex = '';

foreach ($tokens as $token) {
$type = $token[0];
$regex .= match ($type) {
// literal char
self::T_CHAR => preg_quote($token[1]),

// literal directory separator
self::T_SLASH => '/',
self::T_QUERY => '.',
self::T_BANG => '^',

// match any segment up until the next directory separator
self::T_ASTERIX => '[^/]*',
self::T_GREEDY_GLOBSTAR => '.*',
self::T_GLOBSTAR => '/([^/]+/)*',
self::T_BRACKET_OPEN => '[',
self::T_BRACKET_CLOSE => ']',
self::T_HYPHEN => '-',
self::T_COLON => ':',
self::T_BACKSLASH => '\\',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COLON and BACKSLASH are not tested

self::T_CHAR_CLASS => '[:' . $token[1] . ':]',
};
}
$regex .= '(/|$)';

return new FileMatcherRegex('{^' . $regex . '}');
}

/**
* @return list<token>
*/
private static function tokenize(string $glob): array
{
$length = strlen($glob);

$tokens = [];

for ($i = 0; $i < $length; $i++) {
$c = $glob[$i];

$tokens[] = match ($c) {
'[' => [self::T_BRACKET_OPEN, $c],
']' => [self::T_BRACKET_CLOSE, $c],
'?' => [self::T_QUERY, $c],
'-' => [self::T_HYPHEN, $c],
'!' => [self::T_BANG, $c],
'*' => [self::T_ASTERIX, $c],
'/' => [self::T_SLASH, $c],
'\\' => [self::T_BACKSLASH, $c],
':' => [self::T_COLON, $c],
default => [self::T_CHAR, $c],
};
}

return $tokens;
}

/**
* @param list<token> $tokens
*
* @return list<token>
*/
private static function processTokens(array $tokens): array
{
$resolved = [];
$escaped = false;
$bracketOpen = false;
$brackets = [];

for ($offset = 0; $offset < count($tokens); $offset++) {
[$type, $char] = $tokens[$offset];
$nextType = $tokens[$offset + 1][0] ?? null;

if ($type === self::T_BACKSLASH && false === $escaped) {
// skip the backslash and set flag to escape next token
$escaped = true;

continue;
}

if ($escaped === true) {
// escaped flag is set, so make this a literal char and unset
// the escaped flag
$resolved[] = [self::T_CHAR, $char];
$escaped = false;

continue;
}

// globstar must be preceded by and succeeded by a directory separator
if (
$type === self::T_SLASH &&
$nextType === self::T_ASTERIX && ($tokens[$offset + 2][0] ?? null) === self::T_ASTERIX && ($tokens[$offset + 3][0] ?? null) === self::T_SLASH
) {
$resolved[] = [self::T_GLOBSTAR, '**'];

// we eat the two `*` and the trailing slash
$offset += 3;

continue;
}

// greedy globstar (trailing?)
// TODO: this should probably only apply at the end of the string according to the webmozart implementation and therefore would be "T_TRAILING_GLOBSTAR"
if (
$type === self::T_SLASH &&
($tokens[$offset + 1][0] ?? null) === self::T_ASTERIX && ($tokens[$offset + 2][0] ?? null) === self::T_ASTERIX
) {
$resolved[] = [self::T_GREEDY_GLOBSTAR, '**'];

// we eat the two `*` in addition to the slash
$offset += 2;

continue;
}

// two consecutive ** which are not surrounded by `/` are invalid and
// we interpret them as literals.
if ($type === self::T_ASTERIX && ($tokens[$offset + 1][0] ?? null) === self::T_ASTERIX) {
$resolved[] = [self::T_CHAR, $char];
$resolved[] = [self::T_CHAR, $char];

continue;
}

// complementation - only parse BANG if it is at the start of a character group
if ($type === self::T_BANG && isset($resolved[array_key_last($resolved)]) && $resolved[array_key_last($resolved)][0] === self::T_BRACKET_OPEN) {
$resolved[] = [self::T_BANG, '!'];

continue;
}

// if this was _not_ a bang preceded by a `[` token then convert it
// to a literal char
if ($type === self::T_BANG) {
$resolved[] = [self::T_CHAR, $char];

continue;
}

// https://man7.org/linux/man-pages/man7/glob.7.html
// > The string enclosed by the brackets cannot be empty; therefore
// > ']' can be allowed between the brackets, provided that it is
// > the first character.
if ($type === self::T_BRACKET_OPEN && $nextType === self::T_BRACKET_CLOSE) {
$bracketOpen = true;
$resolved[] = [self::T_BRACKET_OPEN, '['];
$brackets[] = array_key_last($resolved);
$resolved[] = [self::T_CHAR, ']'];
$offset++;

continue;
}

// if we're already in a bracket and the next two chars are [: then
// start parsing a character class...
if ($bracketOpen && $type === self::T_BRACKET_OPEN && $nextType === self::T_COLON) {
// this looks like a named [:character:] class
$class = '';
$offset += 2;

// parse the character class name
while (ctype_alpha($tokens[$offset][1])) {
$class .= $tokens[$offset++][1];
}

// if followed by a `:` then it's a character class
if ($tokens[$offset][0] === self::T_COLON) {
$offset++;
$resolved[] = [self::T_CHAR_CLASS, $class];

continue;
}

// otherwise it's a harmless literal
$resolved[] = [self::T_CHAR, ':' . $class];
}

// if bracket is already open and we have another open bracket
// interpret it as a literal
if ($bracketOpen === true && $type === self::T_BRACKET_OPEN) {
$resolved[] = [self::T_CHAR, $char];

continue;
}

// if we are NOT in an open bracket and we have an open bracket
// then pop the bracket on the stack and enter bracket-mode.
if ($bracketOpen === false && $type === self::T_BRACKET_OPEN) {
$bracketOpen = true;
$resolved[] = [$type, $char];
$brackets[] = array_key_last($resolved);

continue;
}

// if are in a bracket and we get to bracket close then
// pop the last open bracket off the stack and continue
//
// TODO: $bracketOpen === true below is not tested
if ($bracketOpen === true && $type === self::T_BRACKET_CLOSE) {
// TODO: this is not tested
$bracketOpen = false;

array_pop($brackets);
$resolved[] = [$type, $char];

continue;
}

$resolved[] = [$type, $char];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this handles any "left over tokens" - including T_ASTERISK - maybe all tokens should be handled explicitly?

}

// foreach unterminated bracket replace it with a literal char
foreach ($brackets as $unterminatedBracket) {
$resolved[$unterminatedBracket] = [self::T_CHAR, '['];
}

return $resolved;
}
}
Loading
Loading