Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

don't split licenses from License field from python packaging #5204

Closed
DmitriyLewen opened this issue Sep 18, 2023 Discussed in #5155 · 10 comments · Fixed by #7336
Closed

don't split licenses from License field from python packaging #5204

DmitriyLewen opened this issue Sep 18, 2023 Discussed in #5155 · 10 comments · Fixed by #7336
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@DmitriyLewen
Copy link
Contributor

Discussed in #5155

License field in .dist-info/METADATA files contains textual information about license/exceptions/specific versions/etc... - https://packaging.python.org/en/latest/specifications/core-metadata/#license

We can't split this text correctly.
Save this field without division.

@DmitriyLewen DmitriyLewen added the kind/bug Categorizes issue or PR as related to a bug. label Sep 18, 2023
@DmitriyLewen DmitriyLewen self-assigned this Sep 18, 2023
@dus7eh
Copy link
Contributor

dus7eh commented Feb 7, 2024

I'm not sure if this this the way to handle such case.
The mentioned structlog 23.1.0 contains a License field with full license. The sbom and probably similarly license scans generate license name not license text. If we had the full text as the output would we be able to add some custom classifications for such licenses in trivy config at all?

Thankfully there are additional classifier fields in METADATA file which can serve such info. Part of the file from structlog:

Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python

Maybe it would be a good option to use Classifier: License fields when present and fallback to License field otherwise.

I stumbled on a similar issue with scrapli package and this approach would solve the issue there as well. License scan output:
Screenshot from 2024-02-07 16-11-05

@DmitriyLewen
Copy link
Contributor Author

DmitriyLewen commented Feb 8, 2024

Hello @dus7eh
Thanks for your information!

We already check Classifier: License - https://github.com/aquasecurity/go-dep-parser/blob/4f19ab402b0babc81876dc518be0ed82f93f0255/pkg/python/packaging/parse.go#L51
I also thought about checking Classifier: License before License, but we can miss some moments - aquasecurity/go-dep-parser#256 (comment)

Perhaps we can try to using https://github.com/google/licenseclassifier for License field. But i am not sure that License field contains text of license.

But License and Classifier: License fields are deprecated - https://peps.python.org/pep-0639/#deprecate-license-field.
Such problems will decrease over time.

@dus7eh
Copy link
Contributor

dus7eh commented Feb 8, 2024

Didn't notice earlier that these fields are deprecated. Still it will take some time before they actually go out of use.

For the scrapli package case I've mentioned there's a tab character in the new line after the MIT License statement which messes things up. I've checked that removing it fixes how the License field is parsed.
Are you open for a PR with tab trimming on processing this field?

@DmitriyLewen
Copy link
Contributor Author

DmitriyLewen commented Feb 8, 2024

Can you send License field for METADATA file for scrapli please?

@dus7eh
Copy link
Contributor

dus7eh commented Feb 8, 2024

Sure thing, here is the field with some context around it:

Author-email: Carl Montanari <[email protected]>
License: MIT License
        
        Copyright (c) 2021 Carl Montanari
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Changelog, https://carlmontanari.github.io/scrapli/changelog
Project-URL: Changelog, https://carlmontanari.github.io/scrapli/changelog
Project-URL: Docs, https://carlmontanari.github.io/scrapli/
Project-URL: Homepage, https://github.com/carlmontanari/scrapli
Keywords: arista,automation,cisco,eos,iosxe,iosxr,juniper,junos,netconf,network,nxos,ssh,telnet
Classifier: License :: OSI Approved :: MIT License

Although I was slightly mistaken, it's a series of 8 spaces rather than a tab that messes things. When you remove 4 or all 8 spaces in the line between MIT License and Copyright... trivy simply outputs MIT License in the report which is expected. So this actually might be an issue that exactly 4 leading spaces are trimmed (and not a multiply of 4).

@DmitriyLewen
Copy link
Contributor Author

It's just a lucky coincidence that License field starts with the license name.
Your solution doesn't help in this case - #5155 (comment)

Also we still have a lot of meaningless licenses ( e.g. copy, merge, etc.).

#5211 doesn't separate licenses from License field. I also thought about using only first x (e.g. 30) characters for License field to avoid bit size of license name.

@dus7eh
Copy link
Contributor

dus7eh commented Feb 26, 2024

But License and Classifier: License fields are deprecated - https://peps.python.org/pep-0639/#deprecate-license-field.
Such problems will decrease over time.

Just a quick remartk about the article you've mentioned (regarding License-expression). It's in draft status for almost 4 years now. So I wouldn't state that License and Classifier: License are deprecated yet :)

Another thing is that according to the description and usage (https://packaging.python.org/en/latest/specifications/core-metadata/#license) of License field I'd say tha Classifier type should have higher precedence. If Classifier: License is present it defines the license type with some modifications or additional specifications which might be described in License field. And yes, some nuances may apply but maybe they could be denoted in the output? For example somehow translated into lower confidence level.

If Classifier is not present, then License from METADATA could treated as license. This is similar to how other python specific tools handle the case currently: pip-licenses.

I'm not sure how the full flow works but you've mentioned about using licenseclassifier for recognizing licenses for this case. Python packages in the [package-name]-dist-info dir often contain LICENSE file which could be analyzed as well. Are they taken into account?

@DmitriyLewen
Copy link
Contributor Author

Just a quick remartk about the article you've mentioned (regarding License-expression). It's in draft status for almost 4 years now. So I wouldn't state that License and Classifier: License are deprecated yet :)

Hm... i missed status for PEP639.

Then we can try to change fields priority.

And yes, some nuances may apply but maybe they could be denoted in the output? For example somehow translated into lower confidence level.

I have another idea - we can add new field with text from LICENSE field.
By default, we will include in report license from Classifier: License and add note indicating that this license may have been updated. User can check this in json format ( we will add new filed to json format).
I will try this idea when will move PR from go-dep-parser to Trivy.

Thanks for your help and information about this case!

I'm not sure how the full flow works but you've mentioned about using licenseclassifier for recognizing licenses for this case. Python packages in the [package-name]-dist-info dir often contain LICENSE file which could be analyzed as well. Are they taken into account?

licenseclassifier uses a lot of resources. So we only check LICENSE file if License-File field is present.

@dus7eh
Copy link
Contributor

dus7eh commented Feb 27, 2024

I've adjusted the implementation to:

if l := h.Get("License-Expression"); l != "" {
	licenses = types.LicensesFromString(l, types.LicenseTypeName)
	// } else if l := h.Get("License"); l != "" {
} else {
	// read license name from license classifier
	for _, classifier := range h.Values("Classifier") {
		if strings.HasPrefix(classifier, "License :: ") {
			values := strings.Split(classifier, " :: ")
			// there can be several classifiers with licenses
			licensename := values[len(values)-1]
			// According to the classifier list https://pypi.org/classifiers/ there is one grouping classifier
			// without a specific license definition (Classifier: License :: OSI Approved) - it is skipped
			if licensename != "OSI Approved" {
				licenses = append(licenses, types.LicensesFromString(licensename, types.LicenseTypeName)...)
			}
		}
	}
	// The license field can contain information about different licenses, license exceptions , etc.:
	// https://packaging.python.org/en/latest/specifications/core-metadata/#license,
	// but it is impossible to define a delimiter to separate them.
	// Mark them, so we don't have to separate them later.
        if l := h.Get("License"); l != "" && len(licenses) == 0 {
		licenses = types.LicensesFromString(l, types.LicenseTypeNonSeparable)
	}
}
if len(licenses) == 0 && h.Get("License-File") != "" {
	licenses = types.LicensesFromStringSlice(h.Values("License-File"), types.LicenseTypeFile)
}

which seems to work quite well. If you handle the go-dep-parser migration I could prepare a PR with proper changes. For now I started off from your dont-split-python-licenses branch and managed to get more expected python licenses results.

@DmitriyLewen
Copy link
Contributor Author

It takes time to update my PR and integrate these changes.
Therefore, you can open a new PR (if you have time) with only the priority changed (use the master branch as a base).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
2 participants