At its core, Nosey Parker is a regular expression-based content matcher. It uses a set of rules defined in YAML syntax to determine what matching content to report.
The default rules that Nosey Parker uses get embedded within the compiled noseyparker
binary.
The source for these rules appears in the <data/default/rules> directory.
Nosey Parker's rules are written in YAML syntax.
A rules file contains a top-level YAML object with a rules
field that is a list of rules.
Each rule is a YAML object, comprising a name, a regular expression, a list of references, a list of example inputs, and an optional list of non-example inputs. It is easier to understand this from looking at sample rules.
The GitHub Personal Access Token
rule looks like this:
- name: GitHub Personal Access Token
id: np.github.1
pattern: '\b(ghp_[a-zA-Z0-9]{36})\b'
references:
- https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/about-authentication-to-github
- https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token
- https://github.blog/2021-04-05-behind-githubs-new-authentication-token-formats/
examples:
- 'GITHUB_KEY=ghp_XIxB7KMNdAr3zqWtQqhE94qglHqOzn1D1stg'
- "let g:gh_token='ghp_4U3LSowpDx8XvYE7A8GH56oxU5aWnY2mzIbV'"
- |
## git devaloper settings
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
The name
field is intended primarily for human consumption, and is used heavily in various reports.
This can be any string.
The id
field is a globally-unique identifier for the rule.
It must be a string of alphanumeric segments, optionally separated by hyphens or periods, and at most 20 characters long.
The pattern
field in the rule the most essential part.
This is the regular expression pattern that controls what input gets matched.
Each pattern must have one or more capture groups that indicate where the secret or sensitive content is within the entire match.
In the case of this GitHub Personal Access Token
rule, there is one capture group that is the entire match content.
Not all rules with be this clean; many token formats are more difficult to match precisely, and hence require matching on surrounding context instead of just the secret.
The references
field is a list of freeform strings.
In practice, these are URLs that describe the format of the thing being matched.
The examples
field is a list of strings that are asserted to be matched by the rule.
This field is used for automated testing via noseyparker rules check
.
The negative_examples
field, if provided, is a list of strings that are asserted not to be matched by the rule.
This field is also used for automated testing via noseyparker rules check
.
Nosey Parker uses a combination of regular expression engines in its implementation.
The pattern syntax that is accepted is (approximately) the intersection of Hyperscan and Rust regex
crate syntax.
Specifically, the following constructs are supported:
-
Literal characters
-
Wildcard:
.
-
Alternation:
<PATTERN>|<PATTERN>
-
Repetition:
- Zero or one:
?
- Zero or more:
*
- One or more:
+
i
andj
inclusive:{<i>,<j>}
- At least
i
:{<i>,}
- Example
i
:{<i>}
- Zero or one:
-
Character classes, ranges, and negated character classes, e.g.,
[abc]
,[a-zA-Z]
,[^abc]
-
Specially named character classes:
- Whitespace and non-whitespace:
\s
and\S
- Word and non-word:
\w
and\W
- Digit and non-digit:
\d
and\D
- Whitespace and non-whitespace:
-
Grouping:
(<PATTERN>)
-
Non-capturing grouping:
(?:<PATTERN>)
-
Word boundary and non-word-boundary zero-width assertions:
\b
,\B
-
Start-of-input and end-of-input anchors:
^
,$
-
Inline comments:
(?# <COMMENT>)
-
Inline flags:
- The case-insensitive flag
(?i)
- The "extended syntax" flag
(?x)
(Helpful for making complicated patterns more legible!) - The "dotall" flag
(?s)
- The "multiline" flag
(?m)
- The case-insensitive flag
The following non-inclusive list of constructs are not supported:
- Backreferences, e.g.,
\1
Note: Nosey Parker does regular expression matching over bytestrings, not over UTF-8 or otherwise encoded input.
For more reference:
- The Hyperscan syntax reference
- The Rust
regex
crate syntax reference and caveats of matching bytestrings
If a rule identifies secret or sensitive content used by a well-known service and does so with high precision (i.e., few false positives), it is a candidate to be added to Nosey Parker's default rules. Please open a pull request!
Some types of secrets have well-specified formats with distinctive prefixes or suffixes that are unlikely to appear accidentally.
For example, AWS API keys start with one of a few 4-character prefixes followed by 16 hex digits. A pattern that matches these needs no additional context.
Other types of secrets have a well-specified format but lack distinctiveness, such as Sauce tokens, which appear to be simply version 4 UUIDs. A pattern to match these requires looking at surrounding context, which is more likely to produce false positives.
Each rule pattern must include at least 1 capture group that isolates the content of the secret from the surrounding context.
Multiple captures groups are permitted; this can be useful for some types of secrets that involve multiple parts, such as a username and password.
For an example of this, see the netrc Credentials
rule:
- name: netrc Credentials
id: np.netrc.1
pattern: |
(?x)
(?: (machine \s+ [^\s]+) | default)
\s+
login \s+ ([^\s]+)
\s+
password \s+ ([^\s]+)
references:
- https://everything.curl.dev/usingcurl/netrc
- https://devcenter.heroku.com/articles/authentication#api-token-storage
examples:
- 'machine api.github.com login ziggy^stardust password 012345abcdef'
- |
```
machine raw.github.com
login visionmedia
password pass123
```
- |
"""
machine api.wandb.ai
login user
password 7cc938e45e63e9014f88f811be240ba0395c02dd
"""
This rule uses 3 capture groups to extract the optional machine name, the login name, and the password.
Each rule should include at least 1 reference, typically a URL. This helps maintainers and operators better understand what a match might be.
Rules in Nosey Parker are selected to produce few false positives.
A rule's pattern should be precise as possible while minimizing its size.
It's always possible to expand a pattern to eliminate false positives, but doing so is usually a bad tradeoff in terms of comprehensibility.
For example, the Credentials in ODBC Connection String
and LinkedIn Secret Key
rules are at the borderline of complexity we prefer to see.
Patterns comprehensibility decreases as patterns get longer. A few tricks can ameliorate this:
- Use YAML multiline scalars to avoid having to write escapes.
- Use the "extended syntax" regular expression flag (
(?x)
) to let you split the pattern over multiple lines. (Note that you will need to explicitly escape whitespace in this mode if you want it to match.) - Use inline regular expression comments (
(?# COMMENT )
) judiciously.
The pattern in the JSON Web Token (base64url-encoded)
rule demonstrates all these tricks:
# `header . payload . signature`, all base64-encoded
# Unencoded, the header and payload are JSON objects, usually starting with
# `{"`, which gets base64-encoded starting with `ey`.
pattern: |
(?x)
\b
(
ey[a-zA-Z0-9_-]+ (?# header )
\.
ey[a-zA-Z0-9_-]+ (?# payload )
\.
[a-zA-Z0-9_-]+ (?# signature )
)
(?:[^a-zA-Z0-9_-]|$) (?# this instead of a \b anchor because that doesn't play nicely with `-` )
Each rule should include at least 1 example, preferably something that looks real that was found on the public internet.
The examples are used in automated testing via the noseyparker rules check
command.
Please invalidate any possible credentials in examples that you include by mangling the secret content in a structure-preserving way. We don't want to facilitate drive-by hacking by advertising real leaked secrets here.
The noseyparker rules check PATH
command runs a number of checks over the rules found at PATH
and ensures that the examples in the rule match (or not) as expected.
Nosey Parker's implementation builds on top of the Hyperscan library to efficiently match against all its rules simultaneously. This is much faster than the naive approach of trying to match each regex rule in a loop on each input. Adding more well-written rules to Nosey Parker should not significantly affect performance.