Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spotlessSetLicenseHeaderYearsFromGitHistory should respect existing headers and current date #783

Closed
roxspring opened this issue Jan 29, 2021 · 3 comments

Comments

@roxspring
Copy link

I'm trying use Spotless 2.7.0 to onboard a repository and inject license headers, via the maven plugin. Existing sources are in a variety of states, many without any header, some with a outdated header that hasn't been maintained, some with a history pre-dating the git history.

For example, with the following files, running mvn spotless:apply I get the "Default Result", whereas running mvn spotless:apply -DspotlessSetLicenseHeaderYearsFromGitHistory=true I get the "FromGit Result", but I'd really like to get the "Desired Result".

File Current Year Header Range Git Range Default Result FromGit Result Desired Result
new.java 2021 None 2021 2021 2021 ✅ 2021
old.java 2021 None 2019-2020 2021 2019-2020 ⛔ 2019-2021
outdated.java 2021 2019 2018-2020 2019-2021 2018-2020 ⛔2018-2021
prehistory.java 2021 2016-2020 2019-2020 2016-2021 2019-2020 ⛔2016-2021

For old.java, outdated.java, prehistory.java with the -DspotlessSetLicenseHeaderYearsFromGitHistory=true option I think the Git Range should be extended to include the Current Year, as the change will be committed in the current year. This effect can be achieved manually by rerunning without the -DspotlessSetLicenseHeaderYearsFromGitHistory=true option but I don't think that should be necessary.

For prehistory.java with the -DspotlessSetLicenseHeaderYearsFromGitHistory=true option I think the date range should also be extended to incorporate the date range claimed in the existing license Header Range. To ignore the existing license header range results in data loss and a narrower copyright claim which is a problem.

Would it be reasonable to extend spotlessSetLicenseHeaderYearsFromGitHistory functionality to incorporate these new behaviours?

Would it be necessary to add further options so that current behaviour can be maintained too?

@nedtwigg
Copy link
Member

nedtwigg commented Feb 1, 2021

I'm worried about a combinatoric explosion of features in the license header step, but I'm open to new features if necessary. For now, it looks to me like there's a misunderstanding.

First off, any project which wants to enforce headers can be in one of two states:

  • the headers are accurate / we trust them
  • the headers are inaccurate / we don't trust them

If the headers are accurate, then updating them is easy. You can either use ratchetFrom and their date will update only when the file is actually changed (IANAL, but I believe this is the legally correct policy). Or you can use updateYearWithLatest true which always does createdYear-todayYear.

If the headers are inaccurate, then the only option is a one-time manual process to make them accurate. Once they are accurate, then you can use one of the simple workflows above. One starting point for this manual process is "what does git say", which is nice because although it's not perfect, it is nonetheless at least a verifiable lower bound.

Another starting point is "parse what is in there now". The problem is that if the headers have not been enforced in the past, then it is guaranteed that there is going to be a lot of noise. And that means that a lot of the headers cannot be accurately parsed automatically (e.g. (C) 2009 author A, (C) 2011 author B, I have no idea what Spotless will parse from this). So if you want to go this route, it will unavoidably include a manual review.

If I understand correctly, you want to combine "git" with "what is in there now", but you want to do that automatically. I think that is an unavoidably manual process, because parsing "what is in there now" is unavoidably manual if the existing headers are dirty. So I am inclined to not merge a feature to support this behavior, because I think the specification is inconsistent. Manual steps are unavoidable, and simple tools can help with the manual process. I am suspicious of complex tools, because I think the combination of new-user-confusion, power-user-overconfidence, and implementation complexity leads to net-negative value.

@roxspring
Copy link
Author

roxspring commented Feb 2, 2021

Obviously I disagree with your assessment 😄

First off, any project which wants to enforce headers can be in one of two states:

I think this statement over-simplifies things by assuming that all files within a project are in the same consistent state. The truth is that when onboarding projects there are likely to be a mix of files with accurate headers, missing headers, once accurate but now outdated headers, and some nonsense oddities too. Some of these files will have copyright statements pre-dating the git repository they currently reside in because not all files start life in git.

And that means that a lot of the headers cannot be accurately parsed automatically (e.g. (C) 2009 author A, (C) 2011 author B, I have no idea what Spotless will parse from this)

(FYI Spotless will establish a date range of 2009-2011 to use as it's $YEAR token in the license header template).

You're right that there is a "garbage in - garbage out" problem here, but the reality is that often the "garbage in" is only a little inconsistent, and not totally irreparable nonsense. The projects I've been working with of late have a few minor variations of license format between files but primarily vary between simply having a header vs not. Whichever mode you use Spotless's license header functionality in, if existing headers and the new template are of vastly different structure then the tool will produce considerable changes to review and likely correct (more on that below).

One starting point for this manual process is "what does git say", which is nice because although it's not perfect, it is nonetheless at least a verifiable lower bound.

Given a file claiming copyright 2016-2020, but not seen in git until 2019, what are the chances that we should ever change that initial claim year to 2019? The chances seem very slim to me. The claim that copyright started in 2016 can be parsed from the file content and is just as verifiable as the as the date associated with the git commit - each could be faked but we have no reason to assume either has been.

So if you want to go this route, it will unavoidably include a manual review

Review of the tools output is surely unavoidable anyway. When onboarding a project nobody should be blindly accepting the copyright claims generated by the tool. The thing is that you want to the tool to have done the reasonably right thing so that when reviewing said output the number of corrections is minimised.

As you point out, the tool's "use git history" mode is clearly the best way to add verifiable copyright statements to files with missing copyright headers. The problem is that it assumes it knows best and blindly overwrites the claims of pre-existing files, sometimes shortening their claimed range, which I don't think is ever acceptable. This resulting reduction in copyright claim is the part that really troubles me about the "use git history" mode. Currently I'm left with the dissatisfactory the choice between:

  • "use git history" and then have hundreds of pre-existing-claim files to review and correct because the tool has reduced the claim.
  • "don't use git history" and have hundreds of missing-header files to review and correct because the tool has ignored their readily available history.

If I've persuaded you on any of these points then #784 is available for review, otherwise I'm happy to drop it. For the projects I've been working with my version produces output which, of course, needs review but is correct the vast majority of the time and only needs minimal corrections - all of which suits me perfectly. I'll likely end up onboarding using my fork, but the nice thing is that it shouldn't be needed beyond the onboarding process and we'll be able to switch to a stable maintained version of Spotless for ongoing ratcheting. Beyond this issue Spotless looks to be a pretty spot-on tool for our team!

@nedtwigg
Copy link
Member

nedtwigg commented Feb 2, 2021

Given a file claiming copyright 2016-2020, but not seen in git until 2019

If there are a ton of files like this, could you find-replace (C) 2019- with (C) 2016-? I suppose there are some projects where this heuristic would work well, but others where there is perhaps more diversity in the headers, and you gain a lot by actually doing the parse as in your PR.

I'll likely end up onboarding using my fork, but the nice thing is that it shouldn't be needed beyond the onboarding process and we'll be able to switch to a stable maintained version of Spotless for ongoing ratcheting

Great! Thanks very much for the PR. At least for now, I am closing this as won't-implement. I added a comment to your PR #784 so that others can run it. If other people use/need this and feel that I made a mistake, I hope you'll leave a comment. The history of this license header step is that it's been buried in combinatoric explosion-of-features, so I'm protective of adding more complexity to it, perhaps overly so.

In particular, there are other open issues like #774 and #323 which I think are more common, and I don't want to make implementing them harder based on what I perceive to be a very rare case.

@nedtwigg nedtwigg closed this as completed Feb 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants