-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Version file tree diff: design doc #11507
Conversation
# ['+ ore', '- one', '- two', ' three', ' four', '+ five'] | ||
|
||
A good thing of using Python is that we don't need to write the files to disk, | ||
and the result is easier to parse. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably slower as well, but I think I'm liking the idea of doing the diff in Python.
$ rclone check --combined=- /usr/src/app/checkouts/readthedocs.org/a /usr/src/app/checkouts/readthedocs.org/b | ||
+ new.txt | ||
- deleted.txt | ||
= unchanged.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't find an option to exclude the unchanged files from the list, we could log each type to three different files if we want, but I like no having to deal with files...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that would be nice, especially for very large projects since most files won't have changed. Could always do something like grep -v "= "
or similar, but that still returns it, but we can avoid it in the Python process at least.
docs/dev/design/file-tree-diff.rst
Outdated
A combination of using the DB and S3 would be really useful, | ||
as the model will serve as the lock for the API, | ||
and will be easier to keep track of which build was used to generate the diff | ||
without having to download the diff file from S3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm liking the idea of using both, or even just store everything in DB, maybe automatically clean diffs that haven't been accessed in a while if we are worried about this taking a lot of space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I would go with the DB-only approach. Dealing with static data on S3 is complicated, in particular if we need to perform queries on them or update its structure in the future.
The VersionDiff
data is temporal so we can clean up it pretty often if we are worried about the size.
- Expose this feature in the dashboard for now, and use our GitHub action to simply link to the dashboard. | ||
Maybe don't even expose the API to the public, just use it internally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like having this feature in the dashboard, but the "link me to relevant files changed in this PR" feature would be great for authors as well, which may not have access to the RTD dashboard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we initially expose this feature in the Addons PR Build Warning message, which will only be shown on completed builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that, but also is something we have 100% control over and we can play with it before going live and let users to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed we should use addons UI to start.
I think the build detail view could be another place eventually, as this is another place that maintainers will interact with RTD as they are reviewing. This could be for the next iteration though.
docs/dev/design/file-tree-diff.rst
Outdated
- Detect changes in sections of HTML files. | ||
- Expand to other file types? | ||
- Integrate with addons? | ||
- Allow doing a diff between versions of different projects? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this would be useful, hmm.
docs/dev/design/file-tree-diff.rst
Outdated
|
||
The endpoint will be: | ||
|
||
GET /api/v3/projects/{project_slug}/versions/{version_slug}/diff/{other_version_slug}/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This endpoint could also be:
/api/v3/projects/project-slug/diff?version_a=one&version_b=foo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm liking more this endpoint. It's easier for clients to just change a query param than a path.
- Use a custom `repository dispatch event <https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#repository_dispatch>`__ | ||
to trigger the action from our webhook. This requires the user to do some additional setup, | ||
and for our webhooks to support custom headers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want, we could automate this a little, and just ask the user for a GitHub token, and we create the webhook behind the scenes, and then ask them to use our GitHub action as a final step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we're integrating this feature more, we should build a platform feature for it, and remove the GitHub Action approach. That way we can just have the comment update once the build is successful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will require using a GitHub app :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does seem like a good next step, but it doesn't sound like we need to start there either. Using addons to show this information is a good first step.
For a next iteration, we would need to decide what it would take to surface this information in GitHub -- using GitHub Actions or not. This doesn't feel immediately necessary though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting all of this together. This is going in a great direction 💯
I left a few questions (that could be clarified in the document as well) and suggestions about the proposed implementation that I'd like to discuss more before moving forward with this.
It would be good to receive feedback from other folks as well, since this feature could be resources/costs intensive.
Lines changed between two files | ||
------------------------------- | ||
|
||
Once we have the list of files that changed, we can use a tool like ``diff`` to get the lines that changed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we have the list of changed files, will we download all those files and execute the diff over each of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are we looking to do with these line numbers/source diffs? How are we planning on making this diff something users get value from?
I feel like adding line numbers in the diffs is at very least not yet necessary, though I would probably say this is a part of this feature we don't quite need either.
I think more applicable is making sure that visual diffing has enough UI/UX to be highly useful in review -- easier navigation between chunks, etc. I don't see line numbers or source diffs needed for something like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the count of lines changed could be useful to sort the changed files, so it's easy to know which files require the most attention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, if the current date or commit is injected into each page, all pages will be marked as changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, another thought, maybe our detection of changes should be some combination of user configuration and HTML parsing. Users give us the project's main content selector(s), or we have a reliable default, and we store hashes of this content inside this selector at the end of the build process? However even this might only help filter out file changes where dynamic content comes from the theme templates (commit id, date changed, etc in the footer/header).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we could use our search indexing code to get a better diff, that already parses all sections on the main content. We already have that information in ES (well, not for PR previews).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will still give the full list of files that changed, but the sorting will help the user to identify which files are more relevant.
A little bit, but in that UX we're still giving the user a huge list of files that changed, putting the burden on the user to figure out what is important. We should be aiming to give the user a minimal, accurate list of files changed by a PR. If a PR only makes small changes to some files, our UX should point the user to those files. Sorting only seems to be a work around for a number of cases, not a UX we actually want to give to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking something like "quick list of maybe relevant changes" (10 top files deleted/changed/added) and then a "view all changed files" link below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with that, if we can't distinguish between a small change an author makes and a small change that the build tool makes, sorting won't accurately give the top 10 files changed. Legitimate small changes wouldn't sort any higher up than many dynamic content changes.
I don't know if parsing HTML and only hashing nested content gets around this issue, but rclone will definitely be subject to this.
docs/dev/design/file-tree-diff.rst
Outdated
{ | ||
"version_a": {"id": 1, "build": {"id": 1}}, | ||
"version_b": {"id": 2, "build": {"id": 2}}, | ||
"diff": { | ||
"added": [{"file": "new.txt"}], | ||
"removed": [{"file": "deleted.txt"}], | ||
"modified": [{"file": "changed.txt", "lines": {"added": 1, "removed": 1}}] | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting to use a JSON field here on the database?
If that's the case, I'd propose to use a JSON field only for the diff
attribute since we will want to perform queries quickly over the version that are compared:
class VersionDiff:
version_a = models.ForeignKey(Version)
version_b = models.ForeignKey(Version)
diff = models.JSONField()
That will simplify the queries a lot in the future and make the faster as well, due to the indexes created for the version fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example for the storing this in S3, for the DB version it will be FKs, we also need to link to the builds of each version.
} | ||
} | ||
|
||
The version and build can be the full objects, or just the IDs and slugs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use the complete Version
objects to keep consistency with all the other endpoints, yes 👍🏼
The version and build can be the full objects, or just the IDs and slugs. | ||
|
||
We will generate a lock on this request, to avoid multiple calls to the API for the same versions. | ||
We can reply with a ``202 Accepted`` if the diff is being calculated in another request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "being calculated in another request"? I understand the VersionDiff
data is going to be pre-calculated after build succeeds, so the data will be always ready for the endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We won't run the diff on every build, only on demand. Running on every build will be expensive if not all projects need or use this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't start with the complexity added of running it on demand. That would also degrade the user experience that have to wait for it to finish after opening the PR preview.
We should start building this feature to be run after each successful build in a Celery task for those project with PR builds enabled and grow from there. That will give us a better understanding of the costs of this feature. We had a similar thought some time ago when we built PR builds and it ended not being that expensive and being one of our top features.
I'd apply the same thought logic here and keep the initial implementation simple and user friendly, providing the best UX we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't start with the complexity added of running it on demand. That would also degrade the user experience that have to wait for it to finish after opening the PR preview.
Don't think complexity is a problem here, triggering a task after each build or calling that task when the API is requested are not a complex task.
We should start building this feature to be run after each successful build in a Celery task for those project with PR builds enabled and grow from there. That will give us a better understanding of the costs of this feature.
If we do that, we are starting big, running this task on each PR preview, for everyone.
I'd apply the same thought logic here and keep the initial implementation simple and user friendly, providing the best UX we can.
Depending on how this will integrate with the GitHub action, users will still need to wait for the build to finish in order to be able to retrieve the diff, there is no downside by doing this on demand if used with the GH pooling this API every x seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was picturing we'd do what @humitos is describing here. This feels like it should be a build time artifact, mainly because there doesn't really feel like a use case that benefits from this being async.
If we're worried about performance and cost of s3 calls, I would suggest we consider storing a file manifest for each build with file hashes instead of using rclone at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're worried about performance and cost of s3 calls, I would suggest we consider storing a file manifest for each build with file hashes instead of using rclone at all.
This is, actually, a great idea and not only for cost of s3 calls. We already have the modeling for this: we can expand the HTMLFile
model to add a hash512
or hashmd5
field there. Then use that field to compare against versions when hitting the API endpoint. All data will be in the database and we won't require hitting S3 at all -- that should be fast.
Having all this data will allow us to compare any version against any other version in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't tracking all HTML files in the DB anymore, that was taking a lot of space in our DB. We are only tracking the existence of index.html files. Saving a manifest seems good, maybe save it as a field in the DB for the latest build only, but also having that S3 should be ok as well.
Integrations | ||
------------ | ||
|
||
You may be thinking that once we have an API, it will be just a matter of calling that API from a GitHub action. Wrong! | ||
|
||
Doing the API call is easy, but knowing *when* to call it is hard. | ||
We need to call the API after the build has finished successfully, | ||
or we will be comparing the files of an incomplete or stale build. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't worry too much about this yet. Jumping into this terrain at this stage will make everything more complex from the beginning. I think the first step here is having the diff data stored in the database and the API endpoint that returns the diff if the data exists.
The PR preview only works if the build has succeed. So, we will always hit the endpoint when the data is available already. That's not a problem.
The GitHub action could use a simple polling technique for now: polling the build API every X seconds to know if the build has finished. In case it succeed, call the diff API to update the message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR preview only works if the build has succeed. So, we will always hit the endpoint when the data is available already. That's not a problem.
The action doesn't know when the build finished, that's the problem.
The GitHub action could use a simple polling technique for now: polling the build API every X seconds to know if the build has finished. In case it succeed, call the diff API to update the message.
Could be an option, but it may take a while for some builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably we could just pass a commit to the API, and confirm we're getting data from that commit, and then poll if not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API will return the build object, which has the commit. But polling will still be an issue for long builds, the action will need to keep running until it times out or a diff is returned.
But since we are doing this on demand, and we can cache the results, we can minimize the costs | ||
(maybe is not that much). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, are you considering to trigger a task on the background when the API endpoint is hit the first time?
I thought we wanted to pre- calculate the diff only for projects with PR build enabled after a build for the PR succeeded. I think that all (most) users using PR builds will use this feature; so it makes sense to me to do it automatically at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #11507 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd agree with @humitos here, but also, let's optimize for cost savings later. We can gauge if this additional usage is really meaningful compared to engineering and maintaining a complex system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a lot of thinking about the logistics of managing file line count diffing, but I think we really just want to start with the actual files changed. The tradeoff between the features provided by knowing the number of lines changes (smarter sorting) vs. complexity of the feature feels like it's not worth it to me.
I think v1 of this is just using rclone, and showing the files changed. If we keep the functionality lightweight a ton of the complexity here seems like it goes away, we can run it on all PR builds, and we can figure out other approaches in the future for file diffing.
I'd recommend that we investigate mapping the changes files to the git repo changes before we do any of the proposed solutions where we download files from S3, but that should definitely be in a v2 after we do the initial work here of making a useful feature without sorting.
Git providers may already offer a way to compare file trees, but again, | ||
they work on the source files, and not on the generated files. | ||
|
||
All hope was lost for having nice features like this, until now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥼 Feels like the start of a Sci-Fi plot :D
$ rclone check --combined=- /usr/src/app/checkouts/readthedocs.org/a /usr/src/app/checkouts/readthedocs.org/b | ||
+ new.txt | ||
- deleted.txt | ||
= unchanged.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that would be nice, especially for very large projects since most files won't have changed. Could always do something like grep -v "= "
or similar, but that still returns it, but we can avoid it in the Python process at least.
|
||
To start, we will only consider HTML files (``--include=*.html``). | ||
|
||
Lines changed between two files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced we need the lines changed in v1. I'd like to avoid this overhead if we can -- I think the start should just be the rclone approach, since sorting the files is a "nice to have" and not required for any of the functionality we build. It would be an obvious v2 feature once we see how much people are using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, though I'm not really sure where line differences really fits in on RTD, even in a next iteration. I'm not sure there is a good UX we can give with line numbers in a diffs, especially where our diffs will not align with changes from a PR -- because a PR base branch might not be the same as our default version.
I think the source file diffing is best left to GitHub and we should focus solely on the rendered diff experience.
docs/dev/design/file-tree-diff.rst
Outdated
API | ||
--- | ||
|
||
This operation is expensive, so it won't be available to unauthenticated users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree making this authed seems like a bad place to start. I'd much rather make the endpoint be less expensive. I don't really understand why this would be slow if we already have VersionDiff objects in the DB? It seems like it should just be a DB query.
Integrations | ||
------------ | ||
|
||
You may be thinking that once we have an API, it will be just a matter of calling that API from a GitHub action. Wrong! | ||
|
||
Doing the API call is easy, but knowing *when* to call it is hard. | ||
We need to call the API after the build has finished successfully, | ||
or we will be comparing the files of an incomplete or stale build. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably we could just pass a commit to the API, and confirm we're getting data from that commit, and then poll if not?
- Expose this feature in the dashboard for now, and use our GitHub action to simply link to the dashboard. | ||
Maybe don't even expose the API to the public, just use it internally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we initially expose this feature in the Addons PR Build Warning message, which will only be shown on completed builds.
- Use a custom `repository dispatch event <https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#repository_dispatch>`__ | ||
to trigger the action from our webhook. This requires the user to do some additional setup, | ||
and for our webhooks to support custom headers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we're integrating this feature more, we should build a platform feature for it, and remove the GitHub Action approach. That way we can just have the comment update once the build is successful.
I just checked the latest commit with the updates and I feel there are still some things I'd like to discuss and to be reflected in this document:
|
I'm +1 on that, but wanted to see other's opinions
I'm not really sold on this idea, the order will be kind of random and unexpected for the changes done on the PR.
I'm okay leaving that for later
It should also point to the versions, since builds aren't deleted when a version is deleted. Or we can manually delete diff objects once a version is deleted.
That will be a breaking change, we should decide if we want to have it under auth or not.
This is assuming we will only run a diff between two sets of versions, PR preview and latest? or stable? I still think it's useful to allow diffing between any versions on demand. From Eric's comments, I think we don't want to expose this as an API yet anyway, but only from the addons endpoint. Which again, requires us to choose which version we will make the comparison over, I guess we will choose the one that the default branch points to. |
I'm saying that we can implement the authed version only for Read the Docs for Business to access private projects. For now, we can start implementing this feature on Read the Docs for Community to get some experience/feedback and grow from there for the v2.
We should use the same setting we are already using for DocDiff, which is currently
There are good points for each of the different approaches. I think this decision is important since it will dictate how we and other people will use this feature. @ericholscher @agjohnson what are your thoughts on each of these different approaches? |
It doesn't really seem like we need to make a decision about diffing different versions now? We just ship with the initial implementation that diffs the PR build against the latest version? We can always add some way to trigger diffing of arbitrary versions in the future, but it sounds like we don't need to do that in the initial implementation ("v1")? |
@ericholscher what's your position about 1) implementing the diffing as a background task and requiring hitting the API twice to get the data, or 2) always perform the diffing at build time and hit the API endpoint only once always getting the same response? |
For clarification, the first call will trigger the task and wait for the result. |
It seems pretty straightforward to support both the situation where we have the files on disk, or need to query S3 for a set of them. I think to start we do it during build time while we have the files on disk for PR builds? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes to the doc look good to me 👍
docs/dev/design/file-tree-diff.rst
Outdated
build_a = models.ForeignKey(Build, on_delete=models.CASCADE, related_name='diff_a') | ||
build_b = models.ForeignKey(Build, on_delete=models.CASCADE, related_name='diff_b') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do wonder if this should be a build object, or just a commit hash? I guess they work similarly, but there could be multiple different versions on the same commit, so it would be nice to not have to duplicate that work. However, we can also pretty easily get the commit via the Build.
Not 100% sure what is best here, but perhaps something to consider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A build can be triggered for the same commit and still change the output, say it takes into consideration the time or an env variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The immediate plan here seems reasonable 👍
It seems we have the most questions on future plans for next iterations, but this seems like discussion we can come back to. There are some technical implementations in this document that can be dialed back a good deal to match what we want to build initially, and for the medium term -- for example, async execution of diffing and line diffing incur the most complexity that we don't yet need. Perhaps it's worth clarifying here before we get to building this.
|
||
To start, we will only consider HTML files (``--include=*.html``). | ||
|
||
Lines changed between two files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, though I'm not really sure where line differences really fits in on RTD, even in a next iteration. I'm not sure there is a good UX we can give with line numbers in a diffs, especially where our diffs will not align with changes from a PR -- because a PR base branch might not be the same as our default version.
I think the source file diffing is best left to GitHub and we should focus solely on the rendered diff experience.
Lines changed between two files | ||
------------------------------- | ||
|
||
Once we have the list of files that changed, we can use a tool like ``diff`` to get the lines that changed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are we looking to do with these line numbers/source diffs? How are we planning on making this diff something users get value from?
I feel like adding line numbers in the diffs is at very least not yet necessary, though I would probably say this is a part of this feature we don't quite need either.
I think more applicable is making sure that visual diffing has enough UI/UX to be highly useful in review -- easier navigation between chunks, etc. I don't see line numbers or source diffs needed for something like this.
- Once we have the diff between versions ``A`` and ``B``, we can infer the diff between ``B`` and ``A``. | ||
We can store that information as well, or just calculate it on the fly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some edge cases where users might want to compare two non-default versions, but this will be a rare use case.
This sort of comparison feels like a better tool for git/GitHub than something we maintain. I'd agree with the above that comparing against only the default version is definitely where we should start, but I suspect there isn't much incentive for us to complicate this feature too much more than that.
Either way, the UI for linking to files that changed is only going to be visible on PR builds -- either as a comment/description in the PR or via a hovering UI in the PR built documentation. We don't have either option for non-PR build versions.
The version and build can be the full objects, or just the IDs and slugs. | ||
|
||
We will generate a lock on this request, to avoid multiple calls to the API for the same versions. | ||
We can reply with a ``202 Accepted`` if the diff is being calculated in another request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was picturing we'd do what @humitos is describing here. This feels like it should be a build time artifact, mainly because there doesn't really feel like a use case that benefits from this being async.
If we're worried about performance and cost of s3 calls, I would suggest we consider storing a file manifest for each build with file hashes instead of using rclone at all.
- Expose this feature in the dashboard for now, and use our GitHub action to simply link to the dashboard. | ||
Maybe don't even expose the API to the public, just use it internally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed we should use addons UI to start.
I think the build detail view could be another place eventually, as this is another place that maintainers will interact with RTD as they are reviewing. This could be for the next iteration though.
- Use a custom `repository dispatch event <https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#repository_dispatch>`__ | ||
to trigger the action from our webhook. This requires the user to do some additional setup, | ||
and for our webhooks to support custom headers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does seem like a good next step, but it doesn't sound like we need to start there either. Using addons to show this information is a good first step.
For a next iteration, we would need to decide what it would take to surface this information in GitHub -- using GitHub Actions or not. This doesn't feel immediately necessary though.
docs/dev/design/file-tree-diff.rst
Outdated
- Expose this feature only via the addons feature. | ||
- Allow to diff an external version against the version that points to the default branch/tag of the project only. | ||
- Use a feature flag to enable this feature on projects. | ||
- Run the diff while we have the files on disk (end of the build), if possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, regarding all the discussion above, this does match where we've discussed starting with this feature 👍
I think this is a medium term plan too. Your ideas on how to build on top of this feature sounds like we have some ways to iterate on this feature if we want, but I think this plan gives users the most useful feature we can without building something that is too complex for us to maintain well.
But since we are doing this on demand, and we can cache the results, we can minimize the costs | ||
(maybe is not that much). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd agree with @humitos here, but also, let's optimize for cost savings later. We can gauge if this additional usage is really meaningful compared to engineering and maintaining a complex system.
I like this idea |
We already built this, and then ripped it out :) |
Heh yeah, though we ripped it out because it was expensive and we also weren't using it. I don't feel we'd need a per-file model for this either way, a single manifest file seems reasonable. I am starting to feel that rclone is the wrong solution to build on top of. Rclone only does comparison at the file metadata level, which is only the bare minimum we need. It still requires us to do secondary comparison to avoid giving the user a long list of files that haven't actually changed. This may or may not require something like a hash manifest file, but definitely requires we do something on top of rclone. Some prior art to look at here is sitediff: A few different options it has for avoiding false positive diffs are:
So, really quite advanced, and obviously way more than we'd build. But it shows the difficulty of comparing two sets of HTML and some of the solutions necessary. Selector targeting seems like the easiest to accomplish. I think it would help to see actual output of rclone between two versions of our docs, maybe another project too. We've only discussed many of the plans here theoretically so far, it would help to figure out exactly what we're going to be working with. |
I gave a quick test to sitediff, it's a neat tool. I didn't want to crawl our prod docs, so just did the dev docs. I'd expect no changes between these two versions. Unconfigured, the output is what I expected from rclone -- quite noisy: Diffs show a lot of noise outside of content changes in all these files, any place Using a configuration with The noisy changes with version names and commit hashs are larger avoided, but the diffs here are mostly from intersphinx altering the link titles. Unfortunately, these are also invisible changes to the user, so the UX of telling users these files changed and them linking them to a doc page where nothing looks changed is going to be something we need to work around. The second is better, but I am still expecting no changes from this at all, so not ideal still. |
@agjohnson the research around I did a quick test by myself as well and it worked great, finding differences only on one file after defining the
They mention similar issues on their README file and solve them by using That said, I think we should continue discussing what's the path we want to follow here since there are a few possible implementations opened already and it seems there are more research we can do before making the next step. |
Yeah, and I wouldn't build exactly what sitediff built but testing with it does highlight what we can expect from similar approaches we're discussing. We'll hit these hurdles too, and might find ourselves needing similar features -- not that I think we should build something this complex. Agreed we should continue discussing this more. I don't have a great feeling for where we do go, but it's feeling like we'll need to give more user control than file-based comparison. I suppose to echo what we're doing with output paths, we could expect the build to output a manifest. We could make this a job step that the user can override and apply advanced transform rules if they want. A left field option could be using docdiff to do the comparisons, and crawling each site. This is back to the async issue though and could be slow. However, it would be much better UX in that the list of files would align with what docdiff would give the user visually. |
I thought more about this. I started thinking about "build an integration with Then, I went back to the origins of all this discussion: "we need a way to map a If we are thinking of giving users the ability to somehow configure the "file tree diff" feature to be able to detect what are the changes between two rendered HTML by defining what's the main selector, and potentially allowing them to define other selectors to ignore some chunks, and... blah blah -- I would like to propose some a lot easier for everybody: why not just ask them "how to map a Knowing that, we can use the right tool designed to compare files that we all know: Git. With that information, we can then run the mapping the user provided us 1 over each of the changed files to get the URL. Git will tell us:
There are two cons that I'm seeing here:
In my opinion, this looks like a good idea to discuss and explore more compared to the alternatives we have been discussing. It may not be perfect, but looks a lot easier to understand, to implement, and to explain to users. I'd love to know your feedback about this. Where do you see potential issues? What do you like? What do you dislike? How do you compare it with the alternatives? Footnotes
|
Agreed. And sorry if I wasn't clear above but I was just testing sitediff, I also did not want to build on top of this tool. It just already does much of what we described building, so worth comparing at least.
I also had this same thought path. Though I was also considering building on top of what files PRs list as changed. I realized that this is not a good option as the project might not build the PR base branch or the version might not match. Relying on Git to detect changes is a closer possibility. I would still vote to keep this as a backup option for now. It requires us maintaining more custom Git operations and the user has to give us a mapping. I could see us requiring a mapping file in the build, or we might even use a However, after thinking through source file mapping, and requiring the user tell us how to map changes, I came all the way back to just asking the user to tell us what changed:
That is, projects would have to output a hash manifest file of changes to HTML files in their build:
The plan above for describing differences is still just pulling the two checksum files and noting what doesn't match. We have the page URLs in the checksum file, can note page deletions/additions, and when our defaults don't work the user has full control. It does seem like we should involve the user at some point, at least for some portion of projects. |
We discussed this in the call today. Takeaways:
|
For a first pass, I feel we should do whatever is quickest so that we can work on the UX here too. This might still be file based hashing just to start experimenting even. Relying on our search code to break down a DOM to text only representation feels like a second or third pass maybe? Thinking about text representation vs DOM comparison more, DOM comparison seems like where we should be ultimately though. Users should be alerted on changes like image URLs, HTML classes, and non-text DOM changes, not just text changes. Maybe text-only mode is a configurable option later though. |
I think we can just start with extracting the main content of the file (we already do this for search) and just hash it as is. I have updated the design doc to include things that we discussed, I left the old ideas in case they are useful, but I'm fine removing them. |
Yeah, I think we're talking about the same thing. I'm saying we avoid rendering the DOM to text and instead hash the inside contents of an element inside the HTML document. I'm not familiar with what we are technically doing with search indexing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update looks good to me. I think there are more explicit descriptions to be done still because I'm still not sure we are all in the same page. I left some suggestion to explicitly mention what we discuss in the meeting as way to be sure of that.
Besides those comments in the PR,
- The document mentions rclone, but I understand we are not going to use it at all for this. If we want to keep that section, we should add a note or similar saying that this was considered originally but we are not going to implement it.
- Same for "Lines changed between two files" section, since we won't implement that approach.
VersionDiff
object from "Storing results" is not required anymore; since we will useImportedFile
with some modifications if required. So, that section may need to be re-written/updated/removed.
Also, with the new implementation discussed for v1; how would be the API response? Do we need to adjust anything there?
When a build finishes, we generate this manifest for all HTML files, and store it. | ||
When we need to compare two versions, we can just compare the manifests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand these manifest will hash only the "main content" of the document. We should mention that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mentioned in the "Changed files" and "Initial implementation" section, this section doesn't touch on how the hash is calculated.
- Generate a manifest of all HTML files from the versions that we want to compare. | ||
This will be done at the end of the build. | ||
- Generate the hash based on the main content of the file, | ||
not the whole file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to create two different manifest files? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Why did you think we need two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not saying we need/want two manifest files, but I'm confused about if the manifest will have the hash for the main content and another hash for the full file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the main content, don't think we need the hash from the whole content for anything.
- Don't store the results in the DB, | ||
we can store the results in a next iteration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What results are you referencing here? Didn't we say to store the hash in ImportFile
or similar object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The results from the diff, hashes will be stored in storage in a JSON file. But we may want to store the results from the diff for faster responses. We don't need to use ImportFile
for any of that.
The initial implementation won't have a public API, we will only expose this feature within the addons. But don't think the response proposed before needs major changes. |
I'd like to move forward on this. It feels like we're mostly on the same page, so I think we can resolve most of the remaining questions on an actual implementation at this point 👍 |
Going to merge this since we are working on the implementation now. |
ref #11319
Manual linking to the preview since we don't have this feature yet :D https://dev--11507.org.readthedocs.build/en/11507/design/file-tree-diff.html
📚 Documentation previews 📚
docs
): https://docs--11507.org.readthedocs.build/en/11507/dev
): https://dev--11507.org.readthedocs.build/en/11507/