Add logic to skip already-indexed files #5

allaway · 2024-07-12T21:41:49Z

A major use case of mine for this workflow is to index files in non-tower buckets that are attached to Synapse. Using a workflow like this is much more time efficient and less babysitting than indexing using on a single EC2 instance, especially as datasets get large (>1TB).

However, occasionally I've had to re-run this workflow multiple times on the same bucket when additional data has been added. This means the entire bucket gets re-downloaded and indexed when running the workflow. It would be helpful to have one or both of the following features to make this more time and cost efficient:

add a modified / created parameter that skips re-indexing any files that are prior to the date entered in the param
add an option to skip any S3 keys that are already in the target synapse project/folder

BWMac · 2024-07-15T15:23:49Z

Hi @allaway just linking the Jira ticket that we have tracking this issue: https://sagebionetworks.jira.com/browse/IBCDPE-692

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add logic to skip already-indexed files #5

Add logic to skip already-indexed files #5

allaway commented Jul 12, 2024

BWMac commented Jul 15, 2024

Add logic to skip already-indexed files #5

Add logic to skip already-indexed files #5

Comments

allaway commented Jul 12, 2024

BWMac commented Jul 15, 2024