-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvements for large filesystems #155
Comments
With the changes in the perf_improvements branch, does (de)serializing the dataclasses to/from JSON still result in the same output? The one thing to be careful of is changing the data classes in a way that results in a (CyTRICS) SBOM being output that doesn't pass validation with the (CyTRICS) schema. If the java file changes make a difference, I'd merge a PR with that change now. I'm not sure the early return for the find functions is valid -- if I'm understanding the logic, it looks like it will make the search exit as soon as the first non-matching entry is encountered, without considering that any subsequent entries in the various lists could be the one that actually matches; I think the only time the correct result would be returned in this scenario would be if the matching entry is the first one checked. I've been mulling over decoupling the internal BOM data structures used by Surfactant from the CyTRICS schema -- the current set up definitely makes reading/writing CyTRICS SBOMs easy (json.[load|dump]), but makes other tasks really difficult. Establishing relationships between files based on the filesystem structure would lend itself well to internally storing things as a graph representation, and should make most of the relationship establishing algorithms O(nlogn) instead of O(n^2) -- from a recently run, establishing relationships for a large number of files took most of the time. It would also make so we could just use graph edges to represent relationships between files, without the kind of awkward relationship lists. I think this structure would still let us have a lookup table by hash for software entries. Or at least a set to tracks which hashes have been seen (md5, sha1, sha256), since ideally the common case is an entry not already existing with a matching hash. Thoughts or comments? |
You are correct. On a second look I misunderstood that we were directly serializing the dataclass to JSON. I do believe that the lookup table provided the greatest performance improvements of the changes, so I would prefer to discuss ways to implement that and potentially just tack on the java changes as well. For a quick fix we could just remove that field from the
It doesn't appear that the adjustments to change relationships to Sets or adding the Does this seem like something you would accept a pull request for, or would you want to wait until the internal structure is reworked? |
Also after looking at the "fail fast" logic I wrote previously, I think we should be alright to implement it if we just replace all of the returns I had with "continue" as once So from:
To:
|
Maybe a brief comment saying For the dataclass changes, I'd mostly want to make sure that serializing sets to/from JSON works for all the versions of Python we're supporting (3.8+) -- and same for |
As far as testing against all supported python versions to verify I don't break the schema. The current tests run by Github action don't currently cover validating the test generations against the full schema, correct? Should I proceed with testing manually, or is there a better way? |
I think manually is the best way for now -- there should be some CI tests that try generating a complete SBOM, but there isn't any validation against the CyTRICS schema itself (which technically hasn't been released). The main thing to look for is all of the fields that were changed to a "set" able to loaded and saved correctly, at least on Python 3.8 and 3.12 would probably be enough for reasonable confidence that things also work on the versions in between. A small script like this could be used to test outside of running Surfactant: from surfactant.sbomtypes import SBOM
with open("test_sbom.json") as infile:
SBOM.from_json(infile.read())
# print some fields in SBOM, make sure the loaded info looks right
# create a test sbom = SBOM() object with some fake data added to the various changed field types
print(sbom.to_json(indent=2)) |
I used this script and tested across 3.8, 3.10, and 3.12. Output files attached. Current branch is https://github.com/mcutshaw/Surfactant/tree/perf2. If this is sufficient I would plan to add some minor additional improvements (java, 1 read for all hashes, and then open a pull).
3.8_test.log |
Ah okay, that looks like it will be fine then (I guess dataclasses-json handles support for serializing set). It looks like the java change is just stylistic -- functions called to the right of the |
Is your feature request related to a problem? Please describe.
We have been having some performance issues with "very large" filesystems > 1 million files.
Describe the solution you'd like
A lot of these performance issues seem to stem from iterating through lists that contain each software or relationship entry. If we use Dicts or Sets where applicable we will be reducing the cost of those lookup operations from O(n) to amortized O(1).
Describe alternatives you've considered
None
Additional context
Please see https://github.com/mcutshaw/Surfactant/tree/perf_improvements for some possible adjustments. With one filesystem we've tested with it reduced a run that took several hours to less then 5 minutes.
The text was updated successfully, but these errors were encountered: