Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hive worker cannot resume jobs which use native dlls after pausing (and after automatic snapshots every 18 hours) #3166

Open
gkronber opened this issue Jun 29, 2022 · 0 comments
Assignees

Comments

@gkronber
Copy link
Member

Describe the bug

Hive worker create AppDomains for each job and store the assemblies for the job in a folder Temp/PluginTemp/{jobGuid}.
When the job is stopped (e.g. for a snapshot), the AppDomain is disposed and the folder with the assemblies is cleared. However, this does not work for native dlls because they cannot be unloaded and the Hive worker process still blocks the native dll. An exception is raised when trying to delete the dll and the folder which is caught by the Hive worker.

The problem arises when the same job is resumed at the same worker. After downloading the job from the server the worker tries to create the folder for the job and write the assemblies. Since this folder and the file still exists another exception is raised (caught again by the Hive worker). However, the job cannot be resumed and will be marked as failed at the Hive server.

To Reproduce
Steps to reproduce the behavior:

  1. Create a GP SymReg job and set Evaluator to "Parameter Optimization Evaluator" (this uses the hl-native-interpreter plugin)
  2. Configure GP run to make sure it takes a few minutes (10)
  3. Run in Hive but select only a single worker
  4. Open job manager, wait for the job to be "running" and then pause the job.
  5. Wait for the job to be paused and resume the job
  6. The job will be stopped with state "Failed". The error message will show a problem with "hl-native-interpreter.dll"

Proposed fix
Check whether the folder for the jobGuid already exists in the Hive worker and reused the existing folder. Additionally check whether plugin files already exist in the folder and do not overwrite those files. Since it is the same job we can reuse the old files.

@gkronber gkronber changed the title Hive worker cannot process jobs which use native dlls and take longer than 18hours Hive worker cannot resume jobs which use native dlls after pausing (and after automatic snapshots every 18 hours) Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants