Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support large(r) file uploads by directly uploading to S3 #2012

Merged
merged 20 commits into from
Oct 30, 2024

Conversation

carolinemodic
Copy link
Contributor

@carolinemodic carolinemodic commented Oct 23, 2024

#1674

Suppor large(r) file uploads by directly uploading to S3

In order to enable larger file uploads in arlo directly upload from the client to s3 rather then doing "passthrough" file uploads through our server. This is the heroku recommended approach and generally recommended by most other heroku/s3 users. There will still be a 5GB max limit above which we need to chunk the transfer to s3. The process involves

  1. Making a GET to /upload-url to get the url to POST the file to for the actual upload. If configured to save files to the local file system this url will be a local server endpoint, otherwise it will be a presigned AWS url.
  2. Make a POST to the url returned in step 1 with the file attached. Again this will either be a server endpoint (/file) or a AWS endpoint.
  3. Make a POST to /upload-complete to mark the file as uploaded and do any post-upload steps like updating DB elements and kicking off background jobs.

I landed on keeping the upload-url endpoint for step 1 rather then the other suggested approaches as different files have different authentication which directly relates to the file path that that endpoint needs to construct (it needs to know about the election_id and in some cases jurisdiction_id) so it seemed simplest to leave it like this.

This PR also changes CVR uploads to be 1 file and always be the "wrapper" zip we previously created for Hart/ESS. In the case of HART we now check if there were not any zip files inside of the top-level zip after unzipping to handle the situation where someone just uploads the zip of cvrs and does not "wrap" it in anything. There was some validation logic for hart around not having images or other file types in the zip that is now moves to happen in the processing step.

I recommend reviewing by commit, I tried to isolate the bigger test refactors but let me know if you want me to break up any commits further.

If we wanted to support multiple file uploads in the future to not change this I think we would need to:

  • Upload all selected files to the s3 bucket (this part is easy enough we can even call the presigned POST url multiple times )
  • Change the way it references a single CVR file in the DB to either reference a path to the s3 bucket containing all the files, or allow a list of cvr files
  • Change the hart/ess background tasks to fetch all the files in that bucket instead of fetching the zip and unzipping.

@@ -41,6 +41,7 @@ def s3(): # pylint: disable=invalid-name
"s3",
aws_access_key_id=config.AWS_ACCESS_KEY_ID,
aws_secret_access_key=config.AWS_SECRET_ACCESS_KEY,
region_name="us-west-1",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to add this or the url returned in the get presigned post call will not be domain specific which causes errors. I assume this should be an env var?

@@ -237,7 +233,7 @@ const SelectSystemStep: React.FC<{
<HTMLSelect
large
onChange={setSystemType}
value={systemType ?? undefined}
value={systemType || undefined}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pre commit hook yelled about this line and made me change it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird! I wonder how it got past the hook before

@jonahkagan
Copy link
Contributor

Overall feedback:

  • I think the three step approach is an elegant solution to supporting both s3 and local file uploads.
  • As you mentioned, I think we can take the idea a little further by making the local file upload interface more generic. It could even just mimic the S3 interface to simplify the rest of the code even further. Then it could definitely just be a single endpoint.
  • If we mimic the S3 interface for local uploads, then step 1 could become simpler: we'd just need logic somewhere that tells us the S3 path for each file upload. This could potentially be a backend endpoint for each file upload like you have currently, or it could be a single backend endpoint. It could even just be a function on the frontend or hardcoded into each file upload on the frontend.
  • It makes total sense to me to have all of the clearing logic and such in the step 3 endpoint handler. I think we could go further and do all the validation there as well. We don't really need to validate anything before uploading a file to S3, do we?

Python notes:

  • Python type-checking is not great. I wouldn't waste too much time on trying to get it to work.

@arsalansufi
Copy link
Contributor

The endpoint naming convention I've adopted, the idea of calling back to the server on local filesystem uploads.

I like the endpoint names! Very clear. I also can't think of a way to make this fewer than 3 calls.

And I like the idea of keeping the 3 calls for local file system uploads, too. To keep branching between local dev and prod as minimal as possible.

The endpoint implementations in the server for Step 2 are basically identical, other then a few cases where the file type is different, should I have a generic endpoint for this or keep it duplicated for each file option?

I like the idea of consolidating if easy to do so, again just to make local dev as similar to prod as possible.

I moved any sort of "clearing" step from occuring BEFORE the new uploaded file is stored to after in Step 3. I did this so that all of the db updates happen in the same endpoint, but I'm not sure if that matters and if this should happen somewhere else.

This makes sense to me!

Validation logic - There are a few things like file type that get validated in multiple steps, is that necessary? Are there places where I should add more validation? I think some of my step 2 refactors may need more file type validation.

I like the idea of consolidating validation logic, probably in step 3. The only validation that I'd want to happen sooner is preventing wrong file types, and I think we already cover that on the frontend by virtue of the allowed file types that we specify when we instantiate the file input.

Is it "filename" or "file_name" lol

I've always used file_name for consistency with file_path, file_type, etc. haha, but don't feel strongly

I noticed in general that the validation in these endpoints was often refactored to its own function, however I would then hit issues where I needed to re-do it or add "type: ignore" markers because it wouldn't know in the main server endpoint body that some field was already validated to not be None. Is there a reason why we usually do this? Is it ok to not do this? Any suggestions?

I have in some cases used typing.cast to provide the Python type checker knowledge it's missing, e.g.,

typing.cast(
Optional[Dict[str, Dict[str, Optional[str]]]],
jurisdiction.contest_choice_name_standardizations,
)

Might help in some of the cases you're describing? But also okay with punting as Jonah suggested.


To the questions re wrapper ZIPs, assumptions around Hart/ES&S, etc., I believe we covered all those in Slack but am happy to answer any follow up questions!

@carolinemodic carolinemodic force-pushed the caro/large_file_uploads branch 3 times, most recently from fa3fd69 to f79cacd Compare October 29, 2024 01:34
@carolinemodic carolinemodic marked this pull request as ready for review October 29, 2024 01:43
@carolinemodic carolinemodic force-pushed the caro/large_file_uploads branch 2 times, most recently from 07c8327 to 1e3f911 Compare October 29, 2024 15:12
Copy link
Contributor

@jonahkagan jonahkagan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for splitting this up into easy to read commits! It was pretty straightforward to review.

I've got a bunch of small notes but nothing major (hopefully)

try {
const file = files[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're only supporting a single file, should we change the files: File[] param to just be file: File?

: {
fileType: file.type,
}
const getUploadResponse = await axios(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, we only used axios over fetch in order to gain the ability to track upload progress. I can't remember how it impacts mocks and testing, but might consider switching all requests but s3 upload request to fetch

`/api${url}/upload-complete`,
addCSRFToken({
method: 'POST',
data: finalizeUploadformData,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these endpoints should probably just take JSON for consistency with the rest of the API, since we don't need form data to send a file. Though I realize at this point that might be an annoying large change to make.

@@ -237,7 +233,7 @@ const SelectSystemStep: React.FC<{
<HTMLSelect
large
onChange={setSystemType}
value={systemType ?? undefined}
value={systemType || undefined}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird! I wonder how it got past the hook before

multiple:
values.cvrFileType === CvrFileType.ESS ||
values.cvrFileType === CvrFileType.HART,
multiple: false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think this is false by default, and could be omitted now

@@ -1035,19 +1034,31 @@ def parse_hart_cvrs(

cvr_zip_files: Dict[str, BinaryIO] = {} # { file_name: file }
scanned_ballot_information_files: List[BinaryIO] = []
hasNonCsvZipFiles = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove has since this isn't a boolean

if len(cvr_zip_files) == 0 and len(scanned_ballot_information_files) == 0:
cvr_zip_files[jurisdiction.cvr_file.name] = retrieve_file(
jurisdiction.cvr_file.storage_path
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to download the file again in this case, or is there a way to reuse it since we already have it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the wrapper zip is immediately closed I didn't see a obvious way and wanted to err on the side of touching this code as little as possible haha

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could just do wrapper_zip.seek(0) and move the closing of the file to later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filing a followup for this #2025

@@ -377,11 +365,12 @@ def test_dominion_cvrs_with_leading_equal_signs(
rv = client.get(
f"/api/election/{election_id}/jurisdiction/{jurisdiction_ids[0]}/cvrs"
)
print(json.loads(rv.data))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove print

data={
"jurisdictions": (
# We expect the API to order the jurisdictions by name, so we
# upload them out of order.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe keep this comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

)


def setup_ballot_manifest_upload(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These helpers are great! One nit: they are doing more than just set up the upload, they are doing it, so might be slightly clearer to name them: upload_ballot_manifest, etc.

@carolinemodic
Copy link
Contributor Author

was hoping to merge tonight but there is some security issue I'm not understanding in my test deployment so will pick up tomorrow

@carolinemodic carolinemodic temporarily deployed to arlo-caro-large-file-up-ekdvmj October 30, 2024 02:48 Inactive
@carolinemodic carolinemodic temporarily deployed to arlo-caro-large-file-up-ekdvmj October 30, 2024 15:07 Inactive
@carolinemodic carolinemodic temporarily deployed to arlo-caro-large-file-up-ekdvmj October 30, 2024 15:09 Inactive
@carolinemodic carolinemodic temporarily deployed to arlo-caro-large-file-up-ekdvmj October 30, 2024 15:16 Inactive
@carolinemodic carolinemodic temporarily deployed to arlo-caro-large-file-up-ekdvmj October 30, 2024 15:16 Inactive
@carolinemodic carolinemodic force-pushed the caro/large_file_uploads branch from 7f1eed4 to ea5df28 Compare October 30, 2024 15:53
@carolinemodic carolinemodic temporarily deployed to arlo-caro-large-file-up-ekdvmj October 30, 2024 15:53 Inactive
@carolinemodic carolinemodic temporarily deployed to arlo-caro-large-file-up-ekdvmj October 30, 2024 15:53 Inactive
@carolinemodic carolinemodic force-pushed the caro/large_file_uploads branch from ea5df28 to 2dcf903 Compare October 30, 2024 16:30
@carolinemodic carolinemodic temporarily deployed to arlo-caro-large-file-up-ekdvmj October 30, 2024 16:31 Inactive
@carolinemodic carolinemodic merged commit 699b659 into main Oct 30, 2024
5 checks passed
@carolinemodic carolinemodic deleted the caro/large_file_uploads branch October 30, 2024 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants