Support large(r) file uploads by directly uploading to S3 #2012

carolinemodic · 2024-10-23T22:12:45Z

Suppor large(r) file uploads by directly uploading to S3

In order to enable larger file uploads in arlo directly upload from the client to s3 rather then doing "passthrough" file uploads through our server. This is the heroku recommended approach and generally recommended by most other heroku/s3 users. There will still be a 5GB max limit above which we need to chunk the transfer to s3. The process involves

Making a GET to /upload-url to get the url to POST the file to for the actual upload. If configured to save files to the local file system this url will be a local server endpoint, otherwise it will be a presigned AWS url.
Make a POST to the url returned in step 1 with the file attached. Again this will either be a server endpoint (/file) or a AWS endpoint.
Make a POST to /upload-complete to mark the file as uploaded and do any post-upload steps like updating DB elements and kicking off background jobs.

I landed on keeping the upload-url endpoint for step 1 rather then the other suggested approaches as different files have different authentication which directly relates to the file path that that endpoint needs to construct (it needs to know about the election_id and in some cases jurisdiction_id) so it seemed simplest to leave it like this.

This PR also changes CVR uploads to be 1 file and always be the "wrapper" zip we previously created for Hart/ESS. In the case of HART we now check if there were not any zip files inside of the top-level zip after unzipping to handle the situation where someone just uploads the zip of cvrs and does not "wrap" it in anything. There was some validation logic for hart around not having images or other file types in the zip that is now moves to happen in the processing step.

I recommend reviewing by commit, I tried to isolate the bigger test refactors but let me know if you want me to break up any commits further.

If we wanted to support multiple file uploads in the future to not change this I think we would need to:

Upload all selected files to the s3 bucket (this part is easy enough we can even call the presigned POST url multiple times )
Change the way it references a single CVR file in the DB to either reference a path to the s3 bucket containing all the files, or allow a list of cvr files
Change the hart/ess background tasks to fetch all the files in that bucket instead of fetching the zip and unzipping.

carolinemodic · 2024-10-23T22:15:36Z

server/util/file.py

@@ -41,6 +41,7 @@ def s3():  # pylint: disable=invalid-name
        "s3",
        aws_access_key_id=config.AWS_ACCESS_KEY_ID,
        aws_secret_access_key=config.AWS_SECRET_ACCESS_KEY,
+        region_name="us-west-1",


I had to add this or the url returned in the get presigned post call will not be domain specific which causes errors. I assume this should be an env var?

carolinemodic · 2024-10-23T22:16:04Z

client/src/components/JurisdictionAdmin/BatchInventory.tsx

@@ -237,7 +233,7 @@ const SelectSystemStep: React.FC<{
          <HTMLSelect
            large
            onChange={setSystemType}
-            value={systemType ?? undefined}
+            value={systemType || undefined}


the pre commit hook yelled about this line and made me change it

weird! I wonder how it got past the hook before

jonahkagan · 2024-10-23T22:45:03Z

Overall feedback:

I think the three step approach is an elegant solution to supporting both s3 and local file uploads.
As you mentioned, I think we can take the idea a little further by making the local file upload interface more generic. It could even just mimic the S3 interface to simplify the rest of the code even further. Then it could definitely just be a single endpoint.
If we mimic the S3 interface for local uploads, then step 1 could become simpler: we'd just need logic somewhere that tells us the S3 path for each file upload. This could potentially be a backend endpoint for each file upload like you have currently, or it could be a single backend endpoint. It could even just be a function on the frontend or hardcoded into each file upload on the frontend.
It makes total sense to me to have all of the clearing logic and such in the step 3 endpoint handler. I think we could go further and do all the validation there as well. We don't really need to validate anything before uploading a file to S3, do we?

Python notes:

Python type-checking is not great. I wouldn't waste too much time on trying to get it to work.

arsalansufi · 2024-10-24T16:54:11Z

The endpoint naming convention I've adopted, the idea of calling back to the server on local filesystem uploads.

I like the endpoint names! Very clear. I also can't think of a way to make this fewer than 3 calls.

And I like the idea of keeping the 3 calls for local file system uploads, too. To keep branching between local dev and prod as minimal as possible.

The endpoint implementations in the server for Step 2 are basically identical, other then a few cases where the file type is different, should I have a generic endpoint for this or keep it duplicated for each file option?

I like the idea of consolidating if easy to do so, again just to make local dev as similar to prod as possible.

I moved any sort of "clearing" step from occuring BEFORE the new uploaded file is stored to after in Step 3. I did this so that all of the db updates happen in the same endpoint, but I'm not sure if that matters and if this should happen somewhere else.

This makes sense to me!

Validation logic - There are a few things like file type that get validated in multiple steps, is that necessary? Are there places where I should add more validation? I think some of my step 2 refactors may need more file type validation.

I like the idea of consolidating validation logic, probably in step 3. The only validation that I'd want to happen sooner is preventing wrong file types, and I think we already cover that on the frontend by virtue of the allowed file types that we specify when we instantiate the file input.

Is it "filename" or "file_name" lol

I've always used file_name for consistency with file_path, file_type, etc. haha, but don't feel strongly

I noticed in general that the validation in these endpoints was often refactored to its own function, however I would then hit issues where I needed to re-do it or add "type: ignore" markers because it wouldn't know in the main server endpoint body that some field was already validated to not be None. Is there a reason why we usually do this? Is it ok to not do this? Any suggestions?

I have in some cases used typing.cast to provide the Python type checker knowledge it's missing, e.g.,

arlo/server/api/standardized_contests.py

Lines 132 to 135 in ab2a972

    
           typing.cast( 
        
               Optional[Dict[str, Dict[str, Optional[str]]]], 
        
               jurisdiction.contest_choice_name_standardizations, 
        
           )

Might help in some of the cases you're describing? But also okay with punting as Jonah suggested.

To the questions re wrapper ZIPs, assumptions around Hart/ES&S, etc., I believe we covered all those in Slack but am happy to answer any follow up questions!

jonahkagan

Thanks for splitting this up into easy to read commits! It was pretty straightforward to review.

I've got a bunch of small notes but nothing major (hopefully)

jonahkagan · 2024-10-29T16:37:53Z

client/src/components/useCSV.ts

  try {
+    const file = files[0]


If we're only supporting a single file, should we change the files: File[] param to just be file: File?

jonahkagan · 2024-10-29T16:40:22Z

client/src/components/useCSV.ts

+      : {
+          fileType: file.type,
+        }
+    const getUploadResponse = await axios(


IIRC, we only used axios over fetch in order to gain the ability to track upload progress. I can't remember how it impacts mocks and testing, but might consider switching all requests but s3 upload request to fetch

jonahkagan · 2024-10-29T16:42:26Z

client/src/components/useCSV.ts

+      `/api${url}/upload-complete`,
+      addCSRFToken({
+        method: 'POST',
+        data: finalizeUploadformData,


I think these endpoints should probably just take JSON for consistency with the rest of the API, since we don't need form data to send a file. Though I realize at this point that might be an annoying large change to make.

jonahkagan · 2024-10-29T19:13:37Z

client/src/components/JurisdictionAdmin/BatchInventory.tsx

@@ -237,7 +233,7 @@ const SelectSystemStep: React.FC<{
          <HTMLSelect
            large
            onChange={setSystemType}
-            value={systemType ?? undefined}
+            value={systemType || undefined}


weird! I wonder how it got past the hook before

jonahkagan · 2024-10-29T19:20:22Z

client/src/components/Atoms/CSVForm.tsx

-                      multiple:
-                        values.cvrFileType === CvrFileType.ESS ||
-                        values.cvrFileType === CvrFileType.HART,
+                      multiple: false,


Nit: I think this is false by default, and could be omitted now

jonahkagan · 2024-10-29T20:01:09Z

server/api/cvrs.py

@@ -1035,19 +1034,31 @@ def parse_hart_cvrs(

    cvr_zip_files: Dict[str, BinaryIO] = {}  # { file_name: file }
    scanned_ballot_information_files: List[BinaryIO] = []
+    hasNonCsvZipFiles = []


Nit: remove has since this isn't a boolean

jonahkagan · 2024-10-29T20:05:08Z

server/api/cvrs.py

+    if len(cvr_zip_files) == 0 and len(scanned_ballot_information_files) == 0:
+        cvr_zip_files[jurisdiction.cvr_file.name] = retrieve_file(
+            jurisdiction.cvr_file.storage_path
+        )


Do we need to download the file again in this case, or is there a way to reuse it since we already have it?

since the wrapper zip is immediately closed I didn't see a obvious way and wanted to err on the side of touching this code as little as possible haha

Maybe we could just do wrapper_zip.seek(0) and move the closing of the file to later?

filing a followup for this #2025

jonahkagan · 2024-10-29T20:10:04Z

server/tests/ballot_comparison/test_cvrs.py

@@ -377,11 +365,12 @@ def test_dominion_cvrs_with_leading_equal_signs(
    rv = client.get(
        f"/api/election/{election_id}/jurisdiction/{jurisdiction_ids[0]}/cvrs"
    )
+    print(json.loads(rv.data))


remove print

jonahkagan · 2024-10-29T20:12:42Z

server/tests/conftest.py

-        data={
-            "jurisdictions": (
-                # We expect the API to order the jurisdictions by name, so we
-                # upload them out of order.


Maybe keep this comment

jonahkagan · 2024-10-29T20:14:17Z

server/tests/helpers.py

+    )
+
+
+def setup_ballot_manifest_upload(


These helpers are great! One nit: they are doing more than just set up the upload, they are doing it, so might be slightly clearer to name them: upload_ballot_manifest, etc.

carolinemodic · 2024-10-30T02:45:48Z

was hoping to merge tonight but there is some security issue I'm not understanding in my test deployment so will pick up tomorrow

…ndle error testing

carolinemodic commented Oct 23, 2024

View reviewed changes

carolinemodic requested review from jonahkagan and arsalansufi October 23, 2024 22:16

carolinemodic force-pushed the caro/large_file_uploads branch 3 times, most recently from fa3fd69 to f79cacd Compare October 29, 2024 01:34

carolinemodic marked this pull request as ready for review October 29, 2024 01:43

carolinemodic force-pushed the caro/large_file_uploads branch 2 times, most recently from 07c8327 to 1e3f911 Compare October 29, 2024 15:12

jonahkagan approved these changes Oct 29, 2024

View reviewed changes

carolinemodic force-pushed the caro/large_file_uploads branch from 1e3f911 to 3830ef4 Compare October 29, 2024 23:21

carolinemodic mentioned this pull request Oct 30, 2024

Optimize hart cvr zip parsing when there is no wrapper zip to not download file twice #2025

Closed

carolinemodic force-pushed the caro/large_file_uploads branch from a053980 to 023fd09 Compare October 30, 2024 02:24

carolinemodic temporarily deployed to arlo-caro-large-file-up-ekdvmj October 30, 2024 02:33 Inactive

carolinemodic temporarily deployed to arlo-caro-large-file-up-ekdvmj October 30, 2024 02:48 Inactive

carolinemodic temporarily deployed to arlo-caro-large-file-up-ekdvmj October 30, 2024 15:07 Inactive

carolinemodic temporarily deployed to arlo-caro-large-file-up-ekdvmj October 30, 2024 15:09 Inactive

carolinemodic temporarily deployed to arlo-caro-large-file-up-ekdvmj October 30, 2024 15:16 Inactive

carolinemodic force-pushed the caro/large_file_uploads branch from 7f1eed4 to ea5df28 Compare October 30, 2024 15:53

carolinemodic temporarily deployed to arlo-caro-large-file-up-ekdvmj October 30, 2024 15:53 Inactive

carolinemodic added 4 commits October 30, 2024 09:30

update client code to get upload url, post to upload, post to finalize

cdc89f2

restrict all cvr files uploads to 1 file

a7ece06

client tests passing

993f248

add file helper and parsing utils

d9856a1

carolinemodic added 16 commits October 30, 2024 09:30

update server endpoints to break upload into 3 steps

9ef5352

handle unwrapped hart zip

42e3129

add helper functions to manage test uploads

ace25e1

add tests for public file upload endpoint

775f66b

add tests for get upload url endpoints

d2bd401

refactor server tests to call helper functions to upload files and ha…

6894d5e

…ndle error testing

update test missed in rebase

72aae9d

change files to file in client useCSV

c5dd582

update upload complete calls to use json not form data

2d4a63b

nit changes to file utils

3ddb0c6

dont create subdir for cvrs

192784b

make upload file require logged in user

d5811f2

address various pr nit comments in api

eada206

rename test helpers

0fd0836

fix server to expect json params on upload complete

735d693

test csp on server

2dcf903

carolinemodic force-pushed the caro/large_file_uploads branch from ea5df28 to 2dcf903 Compare October 30, 2024 16:30

carolinemodic temporarily deployed to arlo-caro-large-file-up-ekdvmj October 30, 2024 16:31 Inactive

carolinemodic merged commit 699b659 into main Oct 30, 2024
5 checks passed

carolinemodic deleted the caro/large_file_uploads branch October 30, 2024 17:27

jonahkagan mentioned this pull request Nov 12, 2024

Expect zipped CVR file uploads in audit admin view #2057

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support large(r) file uploads by directly uploading to S3 #2012

Support large(r) file uploads by directly uploading to S3 #2012

carolinemodic commented Oct 23, 2024 •

edited

Loading

carolinemodic Oct 23, 2024

carolinemodic Oct 23, 2024

jonahkagan Oct 29, 2024

jonahkagan commented Oct 23, 2024

arsalansufi commented Oct 24, 2024

jonahkagan left a comment

jonahkagan Oct 29, 2024

jonahkagan Oct 29, 2024

jonahkagan Oct 29, 2024

jonahkagan Oct 29, 2024

jonahkagan Oct 29, 2024

jonahkagan Oct 29, 2024

jonahkagan Oct 29, 2024

carolinemodic Oct 29, 2024

jonahkagan Oct 30, 2024

carolinemodic Oct 30, 2024

jonahkagan Oct 29, 2024

jonahkagan Oct 29, 2024

carolinemodic Oct 29, 2024

jonahkagan Oct 29, 2024

carolinemodic commented Oct 30, 2024

Support large(r) file uploads by directly uploading to S3 #2012

Support large(r) file uploads by directly uploading to S3 #2012

Conversation

carolinemodic commented Oct 23, 2024 • edited Loading

Suppor large(r) file uploads by directly uploading to S3

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonahkagan commented Oct 23, 2024

arsalansufi commented Oct 24, 2024

jonahkagan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carolinemodic commented Oct 30, 2024

carolinemodic commented Oct 23, 2024 •

edited

Loading