Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version file tree diff: design doc #11507

Merged
merged 12 commits into from
Oct 9, 2024
359 changes: 359 additions & 0 deletions docs/dev/design/file-tree-diff.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,359 @@
Version file tree diff
======================

Goals
-----

- Compare files from two versions to identify the files that have been added, removed, or modified.
- Provide an API for this feature.
- Integrate this feature to suggest redirects on files that were removed.
- Integrate this feature to list the files that changed in a pull request.
stsewd marked this conversation as resolved.
Show resolved Hide resolved

Non-goals
---------

- Replace the `docdiff <https://github.com/readthedocs/addons?tab=readme-ov-file#docdiff>`__ feature from addons.
That works on the client side, and it's good for comparing the content of files.

Current problems
----------------

Currently, when a user opens a PRs, they need to manually search for the files of interest (new and modified files).
We have a GitHub action that links to the root of the documentation preview, that helps a little, but it's not enough.

When files are removed or renamed, users may not be aware that a redirect may be needed.
We track 404s in our traffic analytics, but they don't keep track of the version,
and it may be too late to add a redirect when users are already seeing a 404.

In the past, we haven't implemented those features, because it's hard to map the source files to the generated files,
since that depends on the build tool and configuration used by the project.

Git providers may already offer a way to compare file trees, but again,
they work on the source files, and not on the generated files.

All hope was lost for having nice features like this, until now.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥼 Feels like the start of a Sci-Fi plot :D


Proposed solution
-----------------

Since redirects and files of interest are related to the generated files,
instead of working over the source files, we will work over the generated files, which we have access to.

The key points of this feature are:

- Get the diff of the file tree between two versions.
- Expose that as an API.
- Integrate that in PR previews.

Diff between two versions
-------------------------

Using a manifest
~~~~~~~~~~~~~~~~

We can create a manifest that contains the hashes and other important metadata of the files,
we can save this manifest in storage or in the DB.

When a build finishes, we generate this manifest for all HTML files, and store it.
When we need to compare two versions, we can just compare the manifests.
Comment on lines +57 to +58
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand these manifest will hash only the "main content" of the document. We should mention that here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mentioned in the "Changed files" and "Initial implementation" section, this section doesn't touch on how the hash is calculated.


This doesn't require downloading the files, but it requires building a version to generate the manifest.

The manifest will be a JSON object with the following structure:

.. code:: json

{
"build": {
"id": 1
},
"files": {
"index.html": {
"hash": "1234567890"
},
"path/to/file.html": {
"hash": "1234567890"
}
}
}

Using rclone
stsewd marked this conversation as resolved.
Show resolved Hide resolved
~~~~~~~~~~~~

.. note::

This solution won't be used in the final implementation, it's kept here for reference.

We are already using ``rclone`` to speed up uploads to S3,
``rclone`` has a command (``rclone check``) to return the diff between two directories.
For this, it uses the metadata of the files, like size and hash
(it doesn't download the files).

.. code:: console

$ ls a
changed.txt new.txt unchanged.txt
$ ls b
changed.txt deleted.txt unchanged.txt
$ rclone check --combined=- /usr/src/app/checkouts/readthedocs.org/a /usr/src/app/checkouts/readthedocs.org/b
+ new.txt
- deleted.txt
= unchanged.txt
Copy link
Member Author

@stsewd stsewd Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't find an option to exclude the unchanged files from the list, we could log each type to three different files if we want, but I like no having to deal with files...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that would be nice, especially for very large projects since most files won't have changed. Could always do something like grep -v "= " or similar, but that still returns it, but we can avoid it in the Python process at least.

* changed.txt

The result is a list of files with a mark indicating if they were added, removed, or modified, or if they were unchanged.
The result is easy to parse.
stsewd marked this conversation as resolved.
Show resolved Hide resolved
There is no option to exclude the files that were unchanged when using ``--combined``,
another option can be to output each type of change to a different file (``--missing-on-dst``, ``--missing-on-src``, ``--differ``).

To start, we will only consider HTML files (``--include=*.html``).

Changed files
-------------

Listing the files that were added or deleted is straightforward,
but when listing the files that were modified, we want to list files that had relevant changes only.

For example, if the build injects some content that changes on every build (like a timestamp or commit),
we don't want to list all files as modified.

We have a couple of options to improve this list.

Hashing the main content
~~~~~~~~~~~~~~~~~~~~~~~~

Timestamps and other metadata is usually added in the footer of the files, outside the main content.
Instead of hashing the whole file, we can hash only the main content of the file,
and use that hash to compare the files.

This will allow us to better detect files that were modified in a meaningful way.

Since we don't need a secure hash, we can use MD5, since it's built-in in Python.

Lines changed between two files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced we need the lines changed in v1. I'd like to avoid this overhead if we can -- I think the start should just be the rclone approach, since sorting the files is a "nice to have" and not required for any of the functionality we build. It would be an obvious v2 feature once we see how much people are using it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, though I'm not really sure where line differences really fits in on RTD, even in a next iteration. I'm not sure there is a good UX we can give with line numbers in a diffs, especially where our diffs will not align with changes from a PR -- because a PR base branch might not be the same as our default version.

I think the source file diffing is best left to GitHub and we should focus solely on the rendered diff experience.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. note::

This solution won't be used in the final implementation, it's kept here for reference.

In order to provide more useful information, we can sort the files by some metrics,
like the number of lines that changed.

Once we have the list of files that changed, we can use a tool like ``diff`` to get the lines that changed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we have the list of changed files, will we download all those files and execute the diff over each of them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we looking to do with these line numbers/source diffs? How are we planning on making this diff something users get value from?

I feel like adding line numbers in the diffs is at very least not yet necessary, though I would probably say this is a part of this feature we don't quite need either.

I think more applicable is making sure that visual diffing has enough UI/UX to be highly useful in review -- easier navigation between chunks, etc. I don't see line numbers or source diffs needed for something like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the count of lines changed could be useful to sort the changed files, so it's easy to know which files require the most attention.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if the current date or commit is injected into each page, all pages will be marked as changed.

Copy link
Contributor

@agjohnson agjohnson Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, another thought, maybe our detection of changes should be some combination of user configuration and HTML parsing. Users give us the project's main content selector(s), or we have a reliable default, and we store hashes of this content inside this selector at the end of the build process? However even this might only help filter out file changes where dynamic content comes from the theme templates (commit id, date changed, etc in the footer/header).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could use our search indexing code to get a better diff, that already parses all sections on the main content. We already have that information in ES (well, not for PR previews).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will still give the full list of files that changed, but the sorting will help the user to identify which files are more relevant.

A little bit, but in that UX we're still giving the user a huge list of files that changed, putting the burden on the user to figure out what is important. We should be aiming to give the user a minimal, accurate list of files changed by a PR. If a PR only makes small changes to some files, our UX should point the user to those files. Sorting only seems to be a work around for a number of cases, not a UX we actually want to give to users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking something like "quick list of maybe relevant changes" (10 top files deleted/changed/added) and then a "view all changed files" link below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with that, if we can't distinguish between a small change an author makes and a small change that the build tool makes, sorting won't accurately give the top 10 files changed. Legitimate small changes wouldn't sort any higher up than many dynamic content changes.

I don't know if parsing HTML and only hashing nested content gets around this issue, but rclone will definitely be subject to this.

This is useful to link to the most relevant files that changed in a PR.

.. code:: console

$ cat a.txt
One
Two
Three
Four
Five
$ cat b.txt
Ore
Three
Four
Five
Six
$ diff --side-by-side --suppress-common-lines a.txt b.txt
One | Ore
Two <
> Six

.. note::

Taken from https://stackoverflow.com/questions/27236891/diff-command-to-get-number-of-different-lines-only.

The command will return only the lines that changed between the two files.
We can just count the lines, or maybe even parse each symbol to check if the line was added or removed.

Another alternative is to use the `difflib <https://docs.python.org/3/library/difflib.html>`__ module,
the only downside is that it doesn't distinguish lines that were changed from lines that were added or removed.
But maybe that's ok? Do we really need to know if a line was changed instead of added or removed?

.. code:: python

import difflib

diff = difflib.ndiff(["one", "two", "three", "four"], ["ore", "three", "four", "five"])
print(list(diff))
# ['+ ore', '- one', '- two', ' three', ' four', '+ five']

A good thing of using Python is that we don't need to write the files to disk,
and the result is easier to parse.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably slower as well, but I think I'm liking the idea of doing the diff in Python.


Alternative metrics
+++++++++++++++++++

.. note::

This solution won't be used in the final implementation, it's kept here for reference.

Checking the number of lines that changed is a good metric, but it requires downloading the files.
Another metric we could use is the size of the files, that can be obtained from the metadata (no need of downloading the files),
The most a file size has changed, the most lines have likely been added or removed,
this still leaves lines that changed with the same amount of characters as irrelevant in the listing.

Storing results
---------------

Doing a diff between two versions can be expensive, so we need to store the results.

We can store the results in the DB (``VersionDiff``).
The information to store would contain some information about the versions compared, the builds, and the diff itself.

.. code:: python

class VersionDiff(models.Model):
version_a = models.ForeignKey(
Version, on_delete=models.CASCADE, related_name="diff_a"
)
version_b = models.ForeignKey(
Version, on_delete=models.CASCADE, related_name="diff_b"
)
build_a = models.ForeignKey(Build, on_delete=models.CASCADE, related_name="diff_a")
build_b = models.ForeignKey(Build, on_delete=models.CASCADE, related_name="diff_b")
diff = JSONField()

The diff will be a JSON object with the files that were added, removed, or modified.
With an structure like this:

.. code:: json

{
"added": [{"file": "new.txt"}],
"removed": [{"file": "deleted.txt"}],
"modified": [{"file": "changed.txt", "lines": {"added": 1, "removed": 1}}]
}

The information is stored in a similar way that it will be returned by the API.
Things important to note:

- We need to take into consideration the diff of the latest successful builds only.
If any of the builds from the stored diff don't match the latest successful build of any of the versions,
we need to the diff again.
stsewd marked this conversation as resolved.
Show resolved Hide resolved
- Once we have the diff between versions ``A`` and ``B``, we can infer the diff between ``B`` and ``A``.
We can store that information as well, or just calculate it on the fly.
Comment on lines +237 to +238
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want diff between all the versions and I would leave this infer for a next iteration in case we need it. Currently, we only want the diff between the default version (as source) and the PR build.

What are other cases that you are considering the diff being useful for outside the one I mentioned?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's useful to see the diff between other versions, stable vs latest, v1 vs v2, etc. This feature isn't just useful for PR builds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on what are those use cases you have in mind to see the files differences between those other versions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sames as checking the diff between a PR preview and latest/stable. You want to know which files has changed, suggestions for possible redirects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I know. I'm asking for more context here because I'm not finding useful knowing which files has changed between versions that don't include a PR version in the comparison.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see a case where a v1 is out and a v2 is released, and you need to figure out what files are different to create redirects on latest. So you'd want a diff between latest and v2, or similar. Or even a diff between v1 and v2 to better understand the work done.

It doesn't seem like a super common case though, but I can see some uses for it. It seems safe to avoid doing that work in the initial version of this, and see if people request it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like a super common case though, but I can see some uses for it. It seems safe to avoid doing that work in the initial version of this, and see if people request it.

Restricting the versions that are allowed would add a special case to the API, I mean, is not a tone of work, but allowing all versions isn't either.

If we are going to restrict the versions from version_a to external versions, are we going to restrict the versions from version_b? Just latest or stable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API could always return 404 or similar HTTP error code if there is no PR involved in the comparison; for example. I think that's not a problem and should be easy to implement.

I'd start with version_a being the default version or latest and version_b being an external version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some edge cases where users might want to compare two non-default versions, but this will be a rare use case.

This sort of comparison feels like a better tool for git/GitHub than something we maintain. I'd agree with the above that comparing against only the default version is definitely where we should start, but I suspect there isn't much incentive for us to complicate this feature too much more than that.

Either way, the UI for linking to files that changed is only going to be visible on PR builds -- either as a comment/description in the PR or via a hovering UI in the PR built documentation. We don't have either option for non-PR build versions.

- The list of files are objects, so we can store additional information in the future.
- When a file has been modified, we also store the number of lines that changed.
We could also show this for files that were added or removed.
Comment on lines +240 to +241
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have to download the files, this will be expensive (in particular with big files: bandwidth and CPU). I would make this optional in a way that we can enable/disable easily. Or even maybe in a second iteration.

- If a project or version is deleted (or deactivated), we should delete the diff as well.
stsewd marked this conversation as resolved.
Show resolved Hide resolved
- Using the DB to save this information will serve as the lock for the API,
so we don't calculate the diff multiple times for the same versions.

We could store the changed files sorted by the number of changes, or make that an option in the API,
or just let the client sort the files as they see fit.

API
---

The initial diff operation can be expensive, so we may consider not exposing this feature to unauthenticated users.
And a diff can only be done between versions of the same project that the user has access to.

The endpoint will be:

GET /api/v3/projects/{project_slug}/diff/?version_a={version_a}&version_b={version_b}

And the response will be:

.. code:: json

{
"version_a": {"id": 1, "build": {"id": 1}},
"version_b": {"id": 2, "build": {"id": 2}},
"diff": {
"added": [{"file": "new.txt"}],
"removed": [{"file": "deleted.txt"}],
"modified": [{"file": "changed.txt", "lines": {"added": 1, "removed": 1}}]
}
}

The version and build can be the full objects, or just the IDs and slugs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the complete Version objects to keep consistency with all the other endpoints, yes 👍🏼


We will generate a lock on this request, to avoid multiple calls to the API for the same versions.
humitos marked this conversation as resolved.
Show resolved Hide resolved
We can reply with a ``202 Accepted`` if the diff is being calculated in another request.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "being calculated in another request"? I understand the VersionDiff data is going to be pre-calculated after build succeeds, so the data will be always ready for the endpoint.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't run the diff on every build, only on demand. Running on every build will be expensive if not all projects need or use this feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't start with the complexity added of running it on demand. That would also degrade the user experience that have to wait for it to finish after opening the PR preview.

We should start building this feature to be run after each successful build in a Celery task for those project with PR builds enabled and grow from there. That will give us a better understanding of the costs of this feature. We had a similar thought some time ago when we built PR builds and it ended not being that expensive and being one of our top features.

I'd apply the same thought logic here and keep the initial implementation simple and user friendly, providing the best UX we can.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't start with the complexity added of running it on demand. That would also degrade the user experience that have to wait for it to finish after opening the PR preview.

Don't think complexity is a problem here, triggering a task after each build or calling that task when the API is requested are not a complex task.

We should start building this feature to be run after each successful build in a Celery task for those project with PR builds enabled and grow from there. That will give us a better understanding of the costs of this feature.

If we do that, we are starting big, running this task on each PR preview, for everyone.

I'd apply the same thought logic here and keep the initial implementation simple and user friendly, providing the best UX we can.

Depending on how this will integrate with the GitHub action, users will still need to wait for the build to finish in order to be able to retrieve the diff, there is no downside by doing this on demand if used with the GH pooling this API every x seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was picturing we'd do what @humitos is describing here. This feels like it should be a build time artifact, mainly because there doesn't really feel like a use case that benefits from this being async.

If we're worried about performance and cost of s3 calls, I would suggest we consider storing a file manifest for each build with file hashes instead of using rclone at all.

Copy link
Member

@humitos humitos Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're worried about performance and cost of s3 calls, I would suggest we consider storing a file manifest for each build with file hashes instead of using rclone at all.

This is, actually, a great idea and not only for cost of s3 calls. We already have the modeling for this: we can expand the HTMLFile model to add a hash512 or hashmd5 field there. Then use that field to compare against versions when hitting the API endpoint. All data will be in the database and we won't require hitting S3 at all -- that should be fast.

Having all this data will allow us to compare any version against any other version in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't tracking all HTML files in the DB anymore, that was taking a lot of space in our DB. We are only tracking the existence of index.html files. Saving a manifest seems good, maybe save it as a field in the DB for the latest build only, but also having that S3 should be ok as well.


Integrations
------------

You may be thinking that once we have an API, it will be just a matter of calling that API from a GitHub action. Wrong!

Doing the API call is easy, but knowing *when* to call it is hard.
We need to call the API after the build has finished successfully,
or we will be comparing the files of an incomplete or stale build.
Comment on lines +278 to +285
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't worry too much about this yet. Jumping into this terrain at this stage will make everything more complex from the beginning. I think the first step here is having the diff data stored in the database and the API endpoint that returns the diff if the data exists.

The PR preview only works if the build has succeed. So, we will always hit the endpoint when the data is available already. That's not a problem.

The GitHub action could use a simple polling technique for now: polling the build API every X seconds to know if the build has finished. In case it succeed, call the diff API to update the message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR preview only works if the build has succeed. So, we will always hit the endpoint when the data is available already. That's not a problem.

The action doesn't know when the build finished, that's the problem.

The GitHub action could use a simple polling technique for now: polling the build API every X seconds to know if the build has finished. In case it succeed, call the diff API to update the message.

Could be an option, but it may take a while for some builds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably we could just pass a commit to the API, and confirm we're getting data from that commit, and then poll if not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API will return the build object, which has the commit. But polling will still be an issue for long builds, the action will need to keep running until it times out or a diff is returned.


Luckily, we have a webhook that tells us when a build has finished successfully.
But, we don't want users to have to implement the integration by themselves.

We could:

- Use this as an opportunity to explore using GitHub Apps.
- Request additional permissions in our existing OAuth2 integration (``project`` scope). Probably not a good idea.
- Expose this feature in the dashboard for now, and use our GitHub action to simply link to the dashboard.
Maybe don't even expose the API to the public, just use it internally.
Comment on lines +294 to +295
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having this feature in the dashboard, but the "link me to relevant files changed in this PR" feature would be great for authors as well, which may not have access to the RTD dashboard.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we initially expose this feature in the Addons PR Build Warning message, which will only be shown on completed builds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that, but also is something we have 100% control over and we can play with it before going live and let users to use it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed we should use addons UI to start.

I think the build detail view could be another place eventually, as this is another place that maintainers will interact with RTD as they are reviewing. This could be for the next iteration though.

- Use a custom `repository dispatch event <https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#repository_dispatch>`__
to trigger the action from our webhook. This requires the user to do some additional setup,
and for our webhooks to support custom headers.
Comment on lines +296 to +298
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want, we could automate this a little, and just ask the user for a GitHub token, and we create the webhook behind the scenes, and then ask them to use our GitHub action as a final step.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we're integrating this feature more, we should build a platform feature for it, and remove the GitHub Action approach. That way we can just have the comment update once the build is successful.

Copy link
Member Author

@stsewd stsewd Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will require using a GitHub app :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does seem like a good next step, but it doesn't sound like we need to start there either. Using addons to show this information is a good first step.

For a next iteration, we would need to decide what it would take to surface this information in GitHub -- using GitHub Actions or not. This doesn't feel immediately necessary though.

- Hit the API repeatedly from the GitHub action until the diff is ready.
This is not ideal, some build may take a long time, and the action may time out.
- Expose this feature in the addons API only, which will hit the service when a user views the PR preview.

Initial implementation
----------------------

For the initial implementation, we will:

- Generate a manifest of all HTML files from the versions that we want to compare.
This will be done at the end of the build.
- Generate the hash based on the main content of the file,
not the whole file.
Comment on lines +308 to +311
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to create two different manifest files? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Why did you think we need two?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not saying we need/want two manifest files, but I'm confused about if the manifest will have the hash for the main content and another hash for the full file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the main content, don't think we need the hash from the whole content for anything.

- MD5 will be the hashing algorithm used.
- Only expose the files that were added, removed, or modified (HTML files only).
The number of lines that changed wont be exposed.
- Don't store the results in the DB,
we can store the results in a next iteration.
Comment on lines +315 to +316
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What results are you referencing here? Didn't we say to store the hash in ImportFile or similar object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The results from the diff, hashes will be stored in storage in a JSON file. But we may want to store the results from the diff for faster responses. We don't need to use ImportFile for any of that.

- Expose this feature only via the addons feature.
- Allow to diff an external version against the version that points to the default branch/tag of the project only.
- Use a feature flag to enable this feature on projects.

Other features that are not mentioned here, like exposing the number of lines that changed,
or a public API, will not be implemented in the initial version,
and may be considered in the future (and thier implementation is subject to change).

Possible issues
---------------

In the case that we use a manifest,
hashing the contents of the files may add some overhead to the build.

In the case that we use ``rclone``,
even if we don't download files from S3, we are still making calls to S3, and AWS charges for those calls.
But since we are doing this on demand, and we can cache the results, we can minimize the costs
(maybe is not that much).
Comment on lines +333 to +334
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, are you considering to trigger a task on the background when the API endpoint is hit the first time?

I thought we wanted to pre- calculate the diff only for projects with PR build enabled after a build for the PR succeeded. I think that all (most) users using PR builds will use this feature; so it makes sense to me to do it automatically at this point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd agree with @humitos here, but also, let's optimize for cost savings later. We can gauge if this additional usage is really meaningful compared to engineering and maintaining a complex system.


``rclone check`` returns only the list of files that changed,
if we want to make additional checks over those files, we will need to make additional calls to S3.

We should also just check a X number of files, we don't want to run a diff of thousands of files,
and also a limit on the size of the files.

Future improvements and ideas
-----------------------------

- Detect moved files.
stsewd marked this conversation as resolved.
Show resolved Hide resolved
This will imply checking the hashes of deleted and added files,
if that same hash of a file that was deleted matches one from a file that was added,
we have a move.
In case we use rclone, since we don't have access to those hashes after rclone is run,
we would need to re-fetch that metadata from S3.
Could be a feature request for rclone.
- Detect changes in sections of HTML files.
stsewd marked this conversation as resolved.
Show resolved Hide resolved
We could re-use the code we have for search indexing.
- Expand to other file types
- Allow doing a diff between versions of different projects
- Allow to configure how the main content of the file is detected
(like a CSS selector).
- Allow to configure content that should be ignored when hashing the file
(like a CSS selector).