Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Version file tree diff: design doc #11507
Version file tree diff: design doc #11507
Changes from all commits
2ba73fa
07e2dba
8fb3941
1f2b58a
299e000
71f5b18
32ce43d
e41c720
173c56d
de82f28
301c492
623ea93
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥼 Feels like the start of a Sci-Fi plot :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand these manifest will hash only the "main content" of the document. We should mention that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mentioned in the "Changed files" and "Initial implementation" section, this section doesn't touch on how the hash is calculated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't find an option to exclude the unchanged files from the list, we could log each type to three different files if we want, but I like no having to deal with files...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that would be nice, especially for very large projects since most files won't have changed. Could always do something like
grep -v "= "
or similar, but that still returns it, but we can avoid it in the Python process at least.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced we need the lines changed in v1. I'd like to avoid this overhead if we can -- I think the start should just be the rclone approach, since sorting the files is a "nice to have" and not required for any of the functionality we build. It would be an obvious v2 feature once we see how much people are using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, though I'm not really sure where line differences really fits in on RTD, even in a next iteration. I'm not sure there is a good UX we can give with line numbers in a diffs, especially where our diffs will not align with changes from a PR -- because a PR base branch might not be the same as our default version.
I think the source file diffing is best left to GitHub and we should focus solely on the rendered diff experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we have the list of changed files, will we download all those files and execute the diff over each of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are we looking to do with these line numbers/source diffs? How are we planning on making this diff something users get value from?
I feel like adding line numbers in the diffs is at very least not yet necessary, though I would probably say this is a part of this feature we don't quite need either.
I think more applicable is making sure that visual diffing has enough UI/UX to be highly useful in review -- easier navigation between chunks, etc. I don't see line numbers or source diffs needed for something like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the count of lines changed could be useful to sort the changed files, so it's easy to know which files require the most attention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, if the current date or commit is injected into each page, all pages will be marked as changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, another thought, maybe our detection of changes should be some combination of user configuration and HTML parsing. Users give us the project's main content selector(s), or we have a reliable default, and we store hashes of this content inside this selector at the end of the build process? However even this might only help filter out file changes where dynamic content comes from the theme templates (commit id, date changed, etc in the footer/header).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we could use our search indexing code to get a better diff, that already parses all sections on the main content. We already have that information in ES (well, not for PR previews).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little bit, but in that UX we're still giving the user a huge list of files that changed, putting the burden on the user to figure out what is important. We should be aiming to give the user a minimal, accurate list of files changed by a PR. If a PR only makes small changes to some files, our UX should point the user to those files. Sorting only seems to be a work around for a number of cases, not a UX we actually want to give to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking something like "quick list of maybe relevant changes" (10 top files deleted/changed/added) and then a "view all changed files" link below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with that, if we can't distinguish between a small change an author makes and a small change that the build tool makes, sorting won't accurately give the top 10 files changed. Legitimate small changes wouldn't sort any higher up than many dynamic content changes.
I don't know if parsing HTML and only hashing nested content gets around this issue, but rclone will definitely be subject to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably slower as well, but I think I'm liking the idea of doing the diff in Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want diff between all the versions and I would leave this infer for a next iteration in case we need it. Currently, we only want the diff between the default version (as source) and the PR build.
What are other cases that you are considering the diff being useful for outside the one I mentioned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's useful to see the diff between other versions, stable vs latest, v1 vs v2, etc. This feature isn't just useful for PR builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand on what are those use cases you have in mind to see the files differences between those other versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sames as checking the diff between a PR preview and latest/stable. You want to know which files has changed, suggestions for possible redirects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I know. I'm asking for more context here because I'm not finding useful knowing which files has changed between versions that don't include a PR version in the comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see a case where a v1 is out and a v2 is released, and you need to figure out what files are different to create redirects on
latest
. So you'd want a diff betweenlatest
andv2
, or similar. Or even a diff betweenv1
andv2
to better understand the work done.It doesn't seem like a super common case though, but I can see some uses for it. It seems safe to avoid doing that work in the initial version of this, and see if people request it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restricting the versions that are allowed would add a special case to the API, I mean, is not a tone of work, but allowing all versions isn't either.
If we are going to restrict the versions from
version_a
to external versions, are we going to restrict the versions fromversion_b
? Just latest or stable?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API could always return 404 or similar HTTP error code if there is no PR involved in the comparison; for example. I think that's not a problem and should be easy to implement.
I'd start with
version_a
being thedefault version
orlatest
andversion_b
being an external version.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some edge cases where users might want to compare two non-default versions, but this will be a rare use case.
This sort of comparison feels like a better tool for git/GitHub than something we maintain. I'd agree with the above that comparing against only the default version is definitely where we should start, but I suspect there isn't much incentive for us to complicate this feature too much more than that.
Either way, the UI for linking to files that changed is only going to be visible on PR builds -- either as a comment/description in the PR or via a hovering UI in the PR built documentation. We don't have either option for non-PR build versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have to download the files, this will be expensive (in particular with big files: bandwidth and CPU). I would make this optional in a way that we can enable/disable easily. Or even maybe in a second iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use the complete
Version
objects to keep consistency with all the other endpoints, yes 👍🏼There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "being calculated in another request"? I understand the
VersionDiff
data is going to be pre-calculated after build succeeds, so the data will be always ready for the endpoint.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We won't run the diff on every build, only on demand. Running on every build will be expensive if not all projects need or use this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't start with the complexity added of running it on demand. That would also degrade the user experience that have to wait for it to finish after opening the PR preview.
We should start building this feature to be run after each successful build in a Celery task for those project with PR builds enabled and grow from there. That will give us a better understanding of the costs of this feature. We had a similar thought some time ago when we built PR builds and it ended not being that expensive and being one of our top features.
I'd apply the same thought logic here and keep the initial implementation simple and user friendly, providing the best UX we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think complexity is a problem here, triggering a task after each build or calling that task when the API is requested are not a complex task.
If we do that, we are starting big, running this task on each PR preview, for everyone.
Depending on how this will integrate with the GitHub action, users will still need to wait for the build to finish in order to be able to retrieve the diff, there is no downside by doing this on demand if used with the GH pooling this API every x seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was picturing we'd do what @humitos is describing here. This feels like it should be a build time artifact, mainly because there doesn't really feel like a use case that benefits from this being async.
If we're worried about performance and cost of s3 calls, I would suggest we consider storing a file manifest for each build with file hashes instead of using rclone at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is, actually, a great idea and not only for cost of s3 calls. We already have the modeling for this: we can expand the
HTMLFile
model to add ahash512
orhashmd5
field there. Then use that field to compare against versions when hitting the API endpoint. All data will be in the database and we won't require hitting S3 at all -- that should be fast.Having all this data will allow us to compare any version against any other version in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't tracking all HTML files in the DB anymore, that was taking a lot of space in our DB. We are only tracking the existence of index.html files. Saving a manifest seems good, maybe save it as a field in the DB for the latest build only, but also having that S3 should be ok as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't worry too much about this yet. Jumping into this terrain at this stage will make everything more complex from the beginning. I think the first step here is having the diff data stored in the database and the API endpoint that returns the diff if the data exists.
The PR preview only works if the build has succeed. So, we will always hit the endpoint when the data is available already. That's not a problem.
The GitHub action could use a simple polling technique for now: polling the build API every X seconds to know if the build has finished. In case it succeed, call the diff API to update the message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The action doesn't know when the build finished, that's the problem.
Could be an option, but it may take a while for some builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably we could just pass a commit to the API, and confirm we're getting data from that commit, and then poll if not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API will return the build object, which has the commit. But polling will still be an issue for long builds, the action will need to keep running until it times out or a diff is returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like having this feature in the dashboard, but the "link me to relevant files changed in this PR" feature would be great for authors as well, which may not have access to the RTD dashboard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we initially expose this feature in the Addons PR Build Warning message, which will only be shown on completed builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that, but also is something we have 100% control over and we can play with it before going live and let users to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed we should use addons UI to start.
I think the build detail view could be another place eventually, as this is another place that maintainers will interact with RTD as they are reviewing. This could be for the next iteration though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want, we could automate this a little, and just ask the user for a GitHub token, and we create the webhook behind the scenes, and then ask them to use our GitHub action as a final step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we're integrating this feature more, we should build a platform feature for it, and remove the GitHub Action approach. That way we can just have the comment update once the build is successful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will require using a GitHub app :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does seem like a good next step, but it doesn't sound like we need to start there either. Using addons to show this information is a good first step.
For a next iteration, we would need to decide what it would take to surface this information in GitHub -- using GitHub Actions or not. This doesn't feel immediately necessary though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to create two different manifest files? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Why did you think we need two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not saying we need/want two manifest files, but I'm confused about if the manifest will have the hash for the main content and another hash for the full file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the main content, don't think we need the hash from the whole content for anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What results are you referencing here? Didn't we say to store the hash in
ImportFile
or similar object?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The results from the diff, hashes will be stored in storage in a JSON file. But we may want to store the results from the diff for faster responses. We don't need to use
ImportFile
for any of that.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, are you considering to trigger a task on the background when the API endpoint is hit the first time?
I thought we wanted to pre- calculate the diff only for projects with PR build enabled after a build for the PR succeeded. I think that all (most) users using PR builds will use this feature; so it makes sense to me to do it automatically at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #11507 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd agree with @humitos here, but also, let's optimize for cost savings later. We can gauge if this additional usage is really meaningful compared to engineering and maintaining a complex system.