Document the properties of a high quality OSV record (#2193)

Reference from existing documentation on data. Part of #2186 --------- Co-authored-by: Jon <[email protected]> Co-authored-by: Chris Bloom <[email protected]>
google · Jul 15, 2024 · 8c2b48a · 8c2b48a
1 parent 3ecaec6
commit 8c2b48a
Show file tree

Hide file tree

Showing 2 changed files with 111 additions and 0 deletions.
diff --git a/docs/data.md b/docs/data.md
@@ -92,6 +92,10 @@ Between the data served in OSV and the data converted to OSV the following ecosy
 -   RubyGems
 -   SwiftURL
 
+## Data Quality
+
+The quality of the data in OSV.dev [is very important to us](https://google.github.io/osv.dev/faq/#ive-found-something-wrong-with-the-data). The minimum quality bar for OSV records acceptable for import is documented [here](data_quality.md)
+
 ## Data dumps
 
 For convenience, these sources are aggregated and [continuously](https://github.com/google/osv.dev/blob/master/deployment/clouddeploy/gke-workers/base/exporter.yaml) 

diff --git a/docs/data_quality.md b/docs/data_quality.md
@@ -0,0 +1,107 @@
+# Properties of a High Quality OSV Record
+
+## Version
+
+1.0.0 (SEMVER)
+
+## Purpose
+
+Describe the “good enough” OSV record that will be imported by OSV.dev
+
+### Out of scope
+
+This does not discuss the problem of record bit rot over time, after initial successful import. The problem of continuous revalidation and treatment of records that have been successfully imported will be dealt with separately in the companion to this, *Managing the Perishability of OSV Records*.
+
+Deferred to a future iteration: validating the existence of vulnerable functions in the `ecosystem_specific` field, if supplied.
+
+## Audience
+
+1. OSV record producers
+2. Downstream OSV.dev record consumers
+
+## Rationale
+
+OSV.dev seeks to be a comprehensive, accurate and timely database of known vulnerabilities that is highly automation friendly. In order to meet this accuracy goal, a quality bar needs to be both defined and sustainably enforced.
+
+## Properties of a High Quality OSV Record
+
+### Valid
+
+As a prerequisite, it is assumed that a record passes [JSON Schema validation](#appendix-a-osv-schema-validation) for the version of the [OSV Schema](https://ossf.github.io/osv-schema/) it declares itself to comply with in the `schema_version` field, or 1.0.0 if it does not. It is also assumed that the vulnerability discussed in the OSV record is valid and affects the software described.
+
+### Precise
+
+A high quality OSV record allows a consumer of that record to be able to answer the following questions in an **automated** way, at scale:
+
+* "Does this vulnerability, as described, impact me?"
+  * "What version do I need to upgrade to, or what patches do I need to apply, for it not to impact me?"
+  * "Should I replace or remove this (potentially orphaned) package with known unfixed vulnerabilities?"
+
+The definition of "impact" will vary depending on how fine-grained the information available is (i.e. package-level or symbol-level for software library packages). Package-level precision is the minimum standard.
+
+* for version and commit ranges
+  * `affected[]`.`ranges[]`.`events[]`.`introduced` is defined
+  * prefer `affected[]`.`ranges[]`.`events[]`.`fixed` over `affected[]`.`ranges[]`.`events`.`last_affected`
+    * this minimizes false negatives
+  * distinct ranges for `introduced..fixed` and/or `introduced..last_affected` *(i.e. introduced and fixed versions or commits can't be the same)*
+  * values in `introduced` are before/less than `fixed`/`last_affected` according to the canonical package registry or project version control
+  * for version (`ECOSYSTEM` and `SEMVER`) ranges
+    * the versions exist in the specific package ecosystem
+  * for commit (`GIT`) ranges
+    * the commits exist in the specified `repo` *(i.e. they are not from another GitHub fork)*
+* the `package.ecosystem`, and a unique `identifier` prefix for it, are defined in the OSV Schema
+* the `package.name` exists within the defined `package.ecosystem`, and is canonically encoded to be unambiguous *(i.e. normalized)*
+* Package URLs in the `package.url` field conform to the [specification](https://github.com/package-url/purl-spec)
+* `reference` URLs return a 2xx or 3xx response at the time of publication
+
+### Identifiable
+
+* Where relevant, an `alias` to the equivalent CVE record is present
+* Where an OSV record consolidates multiple vulnerabilities in another ecosystem (or universe), multiple `related` identifiers are present
+
+## Examples
+
+* [GO-2024-2687](https://api.osv.dev/v1/vulns/GO-2024-2687)
+  * Has `introduced` and `fixed` versions
+  * Has an alias to a CVE record ID
+  * Has a purl
+* [OSV-2024-98](https://api.osv.dev/v1/vulns/OSV-2024-98)
+  * Has `introduced` and `fixed` commits
+    * commits exist in repo
+* [DSA-5678-1](https://api.osv.dev/v1/vulns/DSA-5678-1)
+  * Has `introduced` and `fixed` versions
+  * Has multiple `related` CVE record IDs
+
+## Appendix A: OSV Schema validation
+
+(As at version 1.6.3, generated by Gemini from the [OSV JSON schema](https://github.com/ossf/osv-schema/blob/v1.6.3/validation/schema.json))
+
+**Top-Level Information:**
+
+* **id:** A unique string identifier for the vulnerability.
+* **modified:** A timestamp (in RFC3339 format, in UTC, ending in "Z") indicating when the vulnerability information was last updated.
+
+**Optional, but validated when present:**
+
+* **schema\_version:** A string specifying the version of the schema being used.
+* **published/withdrawn:** Timestamps (in RFC3339 format, in UTC, ending in "Z") for when the vulnerability was published or withdrawn.
+* **aliases/related:** Arrays of strings for alternate identifiers or related vulnerabilities.
+* **summary/details:** String descriptions of the vulnerability.
+* **severity:** An array of objects detailing the severity using different scoring systems (e.g., CVSS v2, v3, or v4), if available.
+* **affected:** An array of objects describing which packages are affected, including details like:
+  * **package:** The ecosystem (e.g., npm, PyPI), name, and Package URL (PURL) of the affected package.
+  * **severity:** Severity for the specific package (if different from the overall severity).
+  * **ranges:** Information on the affected version ranges, commit ranges, or ecosystem-specific identifiers.
+  * **versions:** A list of specific affected versions.
+  * **ecosystem\_specific/database\_specific:** Additional data specific to the package ecosystem or the vulnerability database.
+* **references:** An array of objects providing URLs to external resources about the vulnerability, categorized by type (e.g., advisory, article, discussion).
+* **credits:** An array of objects giving credit to individuals or organizations involved in discovering, reporting, or fixing the vulnerability.
+* **database\_specific:** A flexible object for any extra information specific to the database using this schema.
+
+**Additional Validation Rules:**
+
+* **timestamp:** A custom definition that ensures timestamps adhere to the RFC3339 date-time format (e.g., "2023-11-15T12:34:56Z").
+* **additionalProperties: false:** This prevents any extra properties from being added to the JSON object beyond those defined in the schema.
+* **Specific Requirements in `affected` Array:
+  * There are conditional validations based on the `type` of range, ensuring the correct properties are present (e.g., `repo` is required when `type` is `GIT`).
+  * A logical check ensures that if `last_affected` is specified in `events`, then `fixed` cannot be present in the same `events` array.