Minimize the risk of expired SAS tokens AND simplify node management #627
Labels
Code Quality Improvements
Make code make code more readable, maintainable, prevent bugs, improve security
enhancement
New feature or request
Robustness
Enable users can run tasks w/o bugs or with mitigation of known bugs
Scalability
Enable users can scale TES workloads
TES Priority: P2
Groomed to a Priority 2 issue
Milestone
Problem:
For every work task, two things sited in azure blob storage MUST be on the node: the task runner & the task runner's task json. Those things currently must be tagged with a SAS token because they cannot be downloaded without it. Any start task that needs any resource from blob storage suffers from the same issue.
Any task added to a job cannot have its command-line changed (e.g. to update a SAS token) without first terminating and then deleting the task from the job and replacing it with a new one (which will end up at the end of the line). This is a problem when running at scale, because it is very conceivable (and actually has happened) that the token expires before the task finally starts running.
Any start tasks that must download anything requiring a SAS token have it worse, because the start task is generated at pool creation (and thus becomes a long-lived entity). In Terra today, SAS tokens live shorter lives than pools do (and a pools "lifetime" setting is the limit for new task additions to the pool's job, NOT new task STARTs). Start tasks can be updated, but that appears to require either a different batch client (with the C# library) than the one we are currently using, or a different approach to how we call the batch data-plane APIs than what we are currently doing.
Solution:
Describe alternatives you've considered
Do nothing knowing that these issues will continue to be issues, especially as environments ask for shorter SAS token lifetimes as time goes on.
Sub Tasks
Code dependencies
Will this require code changes in:
CoA, for new and/or existing deployments
? NoTES standalone, for new and/or existing deployments
? NoTerra, for new and/or existing deployments
? NoBuild pipeline
? NoIntegration tests
? NoAdditional context
Completing this feature will enable easier implementation and/or largely or fully complete the following issues:
The text was updated successfully, but these errors were encountered: