From @rick_C137 - Discord Discussion #42

madhuakula · 2023-04-04T20:19:03Z

madhuakula
Apr 4, 2023

rick_C137

Hi! vet looks like a really interesting project.
While going through the documentation, I stumbled upon the following FAQ:
Can I use this tool without an API Key for Insight Service?
--> Probably no

Although the architecture image shows two data sources

Insights
Engines

The Engines use OpenSSF tools and custom plugins. Does it mean w/o an API key, I would be able to use Engines but not Insights ? With only Enignes, the tool could yet be powerful.

@abhisek
Good question @rick_C137 The API spec for Insights API for which API key is required is open source. We host a server in the backend that:

Fetches package metadata from OSV, Deps.dev (Bigquery dataset export) etc.
Caches locally in a MySQL to avoid expenses queries to external services
Provides an uniform view (as per OpenAPI spec) of package metadata from different sources

We also plan to work on building tools that enriches packages with more metadata such as provenance, malicious behaviour.
Theoritically you can avoid the need for API key and fetch directly from public sources, but it will make the scan significantly slower. If you want to cache, you need to maintain state across your CI runs.

These complexities are currently handled in our backend

rick_C137
Interesting! Makes sense.
You mentioned that the backend services are also opensource, I suppose they do the major heavylifting by putting together data from various data sources and combining them into usable form. Can you share the repository for this backend ? Would it be possible to self host this ?
As per my understanding, the current project has the below (see image) architecture where "APIs" stand for the backend you mentioned.

I also observed that you are using osv's scanner to look for lockfiles, is there a plan to add to the toolset for finding dependencies (ie not only based on lockfiles but on actual presence. Imagine a repository without proper lockfiles or too broad package version range compatibility declarations such as those in setup.cfg ) ?

@abhisek

The backend service implementation is not open source. The API spec implemented by the service is available at https://github.com/safedep/vet/blob/main/api/insights-v1.yml

vet will work with any service implementing this API and not just our backend.

I don't think our backend is easy to self-host which is one of the reason why we have not really open sourced it yet. We import data from:

https://docs.deps.dev/bigquery/v1/
https://github.com/google/osv.dev#data-dumps

We import this to our HBase storage, which acts as like a data lake (not really in true sense) because directly querying BigQuery is super expensive.

We also cache recently queried package data into a MySQL (don't ask why not redis 🙂 because querying HBase is expensive as well.

I would love to learn more about your thoughts on detecting dependencies without lockfiles.

I know lockfiles are easy start because they are easy to parse but they are not entirely reliable. Also some of the lockfile formats like pom.xml do not have transitive dependencies.

rick_C137
There are some tools I've worked with that detect dependencies without lockfiles. I think it'd be nice to start with syft and scancode's codebase. Essentially they try to build a signature out of the filename and use that to figure out the dependency's manifest/purl.

Another problem with lockfiles is that they are not always guaranteed to be present or updated. I've worked with projects with outdated/non-existent lockfiles and a relaxed dependency requirement file that cannot pin-point one dependency but provides a range of compatible ones. For those projects, it becomes much harder to detect installed packages.

Problem escalates when Docker comes into picture, as packages can now be installed system(container)-wide too.
Syft does tackle few of these problems to some extent. (They've yet to solve for transitive dependencies)
While understanding vet's codebase, I realised that if the documentation would have a basic walkthrough of how vet works, it'd be a very nice addition. Based on my findings (and little help from ChatGPT), I created the following piece. Perhaps it could find a place in the docs after required tweaks?

Safedeps/vet is a tool that helps developers check for potential security issues and other problems in OSS projects. It works by analyzing the project's dependencies.

To start, safedeps/vet looks for files called "lock files" or "manifests" in the project directory. These files contain information about the project's dependencies, such as which version of each dependency is being used.

To find lock files or manifests, safedeps/vet walks through the project directory and examines each file. It attempts to treat each file as a lock file and triggers OSV Scanner’s “FindParser" method on it.

If osv's FindParser can successfully parse a file, safedeps/vet considers it to be a lock file and adds the extracted information to an internal list of dependencies for the project.

Next, safedeps/vet adds important information to the the project's dependencies using a process called "Enriching." This process takes each dependency and adds additional information to it, such as whether it has known vulnerabilities or other metrics from various providers.

One way that safedeps/vet enriches dependencies is by using an "Insight Based Enricher." This can communicate with a backend service that provides detailed information about each dependency.

The backend service used by safedeps/vet is made up of multiple sources, including OSSF's OSV, scorecard, and deps.dev. By combining information from these sources, safedeps/vet can provide developers with a comprehensive understanding of their project's dependencies and any potential issues that need to be addressed.

@abhisek
@madhuakula ^^
@rick_C137 Makes sense. I will have a look at Syft. One of the reason why we wanted to take package manifest / lockfile approach is so that we can make a dependency tree (eventually).

Without dependency tree, remediation suggestion is hard. Consider this case when there is a vulnerability / policy violation identified for a library which is N level deep in the dependency tree.

Almost all tools today (including Syft I believe but I will verify) will report the issue and expect users (developers) to figure out how to update the library.

In reality, there is no easy way to update library introduced transitively. Mostly we will have to manually build the dependency graph and identify which direct dependency introduced it and will try to update that.

One of our future goal is to make remediation easier by analysing the dependency graph and suggesting the direct dependencies that should be upgraded to fix most of the issues.

@madhuakula
Thanks @rick_C137 for the feedback, I agree with you on docs part. If you want please feel free to make a PR, I will try to add some additions as well

rick_C137

Almost all tools today (including Syft I believe but I will verify) will report the issue and expect users (developers) to figure out how to update the library.

They are trying to figure it out. There are pros and cons of using lockfiles vs file fingerprinting. There is limited support but they've added fields like metadata that could help in finding out the parent of a package.
Ref: anchore/syft#572

Other tools, like ScanCode, are also working towards it but no single tool - in my knowledge - covers it for all the ecosystems exhaustively or satisfactorily.
The CycloneDX spec does allow room for dependency graph but the limitation lies during the detection itself.

there is no easy way to update library introduced transitively

You're right. Different people are trying to solve it in different manners. For eg, dependabot follows the following approcah:

For npm, Dependabot will raise a pull request to update an explicitly defined dependency to a secure version, even if it means updating the parent dependency or dependencies, or even removing a sub-dependency that is no longer needed by the parent. For other ecosystems, Dependabot is unable to update an indirect or transitive dependency if it would also require an update to the parent dependency.

build the dependency graph and identify which direct dependency introduced it and will try to update that.

This would be the ideal way for remediation. Although, from a another perspective, if there is a security update available for a dependency N level deep but the direct dependency of the project has not updated their dependencies even after updating to the latest version, then the direct dependency must be flagged.

Another case to ponder is where 2 direct dependencies require an outdated transitive package. To update the transitive package, which direct dependency should be chosen ? What if there is no update available for one of the direct dependencies ? In that case, updating the transitive will break one direct dependency. Also, one project cannot have two versions of same transitive dependencies.

goal is to make remediation easier by analysing the dependency graph and suggesting the direct dependencies that should be upgraded to fix most of the issues

The current path, in my understanding, includes fetching direct deps then finding their transitive deps from deps.dev and the corresponding vulnerabilities from osv. This is an interesting model, although delegating the task of finding transitives to deps.dev is a partially effective choice but comes with its own set of limitations.

rick_C137
I'd love to send a PR tomorrow

@abhisek
Thanks for the suggestion. This is an excellent thread. I think we should move this to Github issue so that this doesn't get lost in Discord

The current path, in my understanding, includes fetching direct deps then finding their transitive deps from deps.dev and the corresponding vulnerabilities from osv. This is an interesting model, although delegating the task of finding transitives to deps.dev is a partially effective choice but comes with its own set of limitations.

Yes. You are right. This is the model taken in vet. This model leverages pre-computed dependency data available from deps.dev at the cost of storing, maintaining and querying large data set.

I have had good results in building dependency graph in past by using Maven / Gradle native plugins (my requirement was Java ecosystem only). But that solution makes it strongly coupled with package managers which has its own nuances and corner cases that I want to avoid as much as possible.

Syft like approach may be something to look at. But in any case, effective analysis for mitigation would need a graph representation of dependencies

Hritik14 · 2023-04-17T05:21:25Z

Hritik14
Apr 17, 2023

Hi! Continuing from the discord thread.

abhisek
—
@rick_C137 Do you want to push this up? Now with Google making deps.dev API official, it may be possible to have an implementation of Insights API (spec) completely using Deps.dev and OSV API. Thus decoupling the need for API Key

This sounds really interesting. Although, I was thinking, instead of an Insights API, the concept could be represented as class interfaces than can be implemented locally in a file.
For eg: let's say I want OSV, I implement the required methods that fetch data from OSV and the framework does the rest.
This would ensure local runs with inspect-able and trustable network calls.

1 reply

abhisek Apr 18, 2023
Maintainer

@Hritik14 We can achieve similar outcome with having Insights API defined as an OpenAPI spec and generate interfaces from the spec. Look at https://github.com/safedep/vet/blob/main/gen/insightapi/insights.client.go#L90

The benefit of OpenAPI spec is that it can be implemented by services that are not within the boundary of vet. For example, we have an implementation of Insights API that uses Deps.dev and OSV imported data set with local caching. We may add more metadata in future.

To start with, we are hosting a separate cluster of Insights API without API key. We will introduce a configuration flag for vet to enable community mode, where vet will use this cluster without the need for API key. This will functionally eliminate the mandatory need to register and get API key

Have a look at
#76

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

From @rick_C137 - Discord Discussion #42

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

From @rick_C137 - Discord Discussion #42

madhuakula Apr 4, 2023

Replies: 1 comment · 1 reply

Hritik14 Apr 17, 2023

abhisek Apr 18, 2023 Maintainer

madhuakula
Apr 4, 2023

Replies: 1 comment 1 reply

Hritik14
Apr 17, 2023

abhisek Apr 18, 2023
Maintainer