From @rick_C137 - Discord Discussion #42
Replies: 1 comment 1 reply
-
Hi! Continuing from the discord thread.
This sounds really interesting. Although, I was thinking, instead of an Insights API, the concept could be represented as class interfaces than can be implemented locally in a file. |
Beta Was this translation helpful? Give feedback.
-
rick_C137
Hi! vet looks like a really interesting project.
While going through the documentation, I stumbled upon the following FAQ:
Can I use this tool without an API Key for Insight Service?
--> Probably no
Although the architecture image shows two data sources
The Engines use OpenSSF tools and custom plugins. Does it mean w/o an API key, I would be able to use Engines but not Insights ? With only Enignes, the tool could yet be powerful.
@abhisek
Good question @rick_C137 The API spec for Insights API for which API key is required is open source. We host a server in the backend that:
We also plan to work on building tools that enriches packages with more metadata such as provenance, malicious behaviour.
Theoritically you can avoid the need for API key and fetch directly from public sources, but it will make the scan significantly slower. If you want to cache, you need to maintain state across your CI runs.
These complexities are currently handled in our backend
rick_C137
Interesting! Makes sense.
You mentioned that the backend services are also opensource, I suppose they do the major heavylifting by putting together data from various data sources and combining them into usable form. Can you share the repository for this backend ? Would it be possible to self host this ?
As per my understanding, the current project has the below (see image) architecture where "APIs" stand for the backend you mentioned.
I also observed that you are using osv's scanner to look for lockfiles, is there a plan to add to the toolset for finding dependencies (ie not only based on lockfiles but on actual presence. Imagine a repository without proper lockfiles or too broad package version range compatibility declarations such as those in setup.cfg ) ?
@abhisek
The backend service implementation is not open source. The API spec implemented by the service is available at https://github.com/safedep/vet/blob/main/api/insights-v1.yml
vet will work with any service implementing this API and not just our backend.
I don't think our backend is easy to self-host which is one of the reason why we have not really open sourced it yet. We import data from:
https://docs.deps.dev/bigquery/v1/
https://github.com/google/osv.dev#data-dumps
We import this to our HBase storage, which acts as like a data lake (not really in true sense) because directly querying BigQuery is super expensive.
We also cache recently queried package data into a MySQL (don't ask why not redis 🙂 because querying HBase is expensive as well.
I would love to learn more about your thoughts on detecting dependencies without lockfiles.
I know lockfiles are easy start because they are easy to parse but they are not entirely reliable. Also some of the lockfile formats like pom.xml do not have transitive dependencies.
rick_C137
There are some tools I've worked with that detect dependencies without lockfiles. I think it'd be nice to start with syft and scancode's codebase. Essentially they try to build a signature out of the filename and use that to figure out the dependency's manifest/purl.
Another problem with lockfiles is that they are not always guaranteed to be present or updated. I've worked with projects with outdated/non-existent lockfiles and a relaxed dependency requirement file that cannot pin-point one dependency but provides a range of compatible ones. For those projects, it becomes much harder to detect installed packages.
Problem escalates when Docker comes into picture, as packages can now be installed system(container)-wide too.
Syft does tackle few of these problems to some extent. (They've yet to solve for transitive dependencies)
While understanding vet's codebase, I realised that if the documentation would have a basic walkthrough of how vet works, it'd be a very nice addition. Based on my findings (and little help from ChatGPT), I created the following piece. Perhaps it could find a place in the docs after required tweaks?
@abhisek
@madhuakula ^^
@rick_C137 Makes sense. I will have a look at Syft. One of the reason why we wanted to take package manifest / lockfile approach is so that we can make a dependency tree (eventually).
Without dependency tree, remediation suggestion is hard. Consider this case when there is a vulnerability / policy violation identified for a library which is N level deep in the dependency tree.
Almost all tools today (including Syft I believe but I will verify) will report the issue and expect users (developers) to figure out how to update the library.
In reality, there is no easy way to update library introduced transitively. Mostly we will have to manually build the dependency graph and identify which direct dependency introduced it and will try to update that.
One of our future goal is to make remediation easier by analysing the dependency graph and suggesting the direct dependencies that should be upgraded to fix most of the issues.
@madhuakula
Thanks @rick_C137 for the feedback, I agree with you on docs part. If you want please feel free to make a PR, I will try to add some additions as well
rick_C137
They are trying to figure it out. There are pros and cons of using lockfiles vs file fingerprinting. There is limited support but they've added fields like metadata that could help in finding out the parent of a package.
Ref: anchore/syft#572
Other tools, like ScanCode, are also working towards it but no single tool - in my knowledge - covers it for all the ecosystems exhaustively or satisfactorily.
The CycloneDX spec does allow room for dependency graph but the limitation lies during the detection itself.
You're right. Different people are trying to solve it in different manners. For eg, dependabot follows the following approcah:
For npm, Dependabot will raise a pull request to update an explicitly defined dependency to a secure version, even if it means updating the parent dependency or dependencies, or even removing a sub-dependency that is no longer needed by the parent. For other ecosystems, Dependabot is unable to update an indirect or transitive dependency if it would also require an update to the parent dependency.
This would be the ideal way for remediation. Although, from a another perspective, if there is a security update available for a dependency N level deep but the direct dependency of the project has not updated their dependencies even after updating to the latest version, then the direct dependency must be flagged.
Another case to ponder is where 2 direct dependencies require an outdated transitive package. To update the transitive package, which direct dependency should be chosen ? What if there is no update available for one of the direct dependencies ? In that case, updating the transitive will break one direct dependency. Also, one project cannot have two versions of same transitive dependencies.
The current path, in my understanding, includes fetching direct deps then finding their transitive deps from deps.dev and the corresponding vulnerabilities from osv. This is an interesting model, although delegating the task of finding transitives to deps.dev is a partially effective choice but comes with its own set of limitations.
rick_C137
I'd love to send a PR tomorrow
@abhisek
Thanks for the suggestion. This is an excellent thread. I think we should move this to Github issue so that this doesn't get lost in Discord
The current path, in my understanding, includes fetching direct deps then finding their transitive deps from deps.dev and the corresponding vulnerabilities from osv. This is an interesting model, although delegating the task of finding transitives to deps.dev is a partially effective choice but comes with its own set of limitations.
Yes. You are right. This is the model taken in vet. This model leverages pre-computed dependency data available from deps.dev at the cost of storing, maintaining and querying large data set.
I have had good results in building dependency graph in past by using Maven / Gradle native plugins (my requirement was Java ecosystem only). But that solution makes it strongly coupled with package managers which has its own nuances and corner cases that I want to avoid as much as possible.
Syft like approach may be something to look at. But in any case, effective analysis for mitigation would need a graph representation of dependencies
Beta Was this translation helpful? Give feedback.
All reactions