-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement kill, delete and remove endpoints #64
Implement kill, delete and remove endpoints #64
Conversation
f61699c
to
6e86eb1
Compare
d64eb2a
to
bd52dc7
Compare
bd52dc7
to
92ce100
Compare
92ce100
to
9d9c21f
Compare
In this PR, and unlike what is done in DIRAC, I used the ON DELETE CASCADE constraint. This change would therefore require an ALTER statement on some tables of the DB. Should we do this or duplicate the logic of what is currently done in DIRAC ? |
This seems very sensible to me, would any changes be needed in DIRAC v8.1 to make it tollerant to this? We should probably discuss it in the BiLD, can you open an issue in vanilla DIRAC? |
I think DIRAC is already tolerant: by adding the |
src/diracx/db/jobs/schema.py
Outdated
@@ -198,3 +210,99 @@ class LoggingInfo(JobLoggingDBBase): | |||
StatusTimeOrder = Column(Numeric(precision=12, scale=3), default=0) | |||
StatusSource = Column(String(32), default="Unknown") | |||
__table_args__ = (PrimaryKeyConstraint("JobID", "SeqNum"),) | |||
|
|||
|
|||
class TaskQueues(TaskQueueDBBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just stylistic, but: you added this in the same module where JobDB (and JobLoggingDB) is. SandboxMetadataDB is instead defined in another module. We either put all "WMS-related" schemas together or we split them in different files like it was in DIRAC.
src/diracx/db/jobs/schema.py
Outdated
) | ||
|
||
|
||
class BannedSitesQueue(TaskQueueDBBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you check if this is still used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still used by the JobScheduling optimizer (__sendToTQ method that create/fills the JobRequirements section of the JDL, which is then interpreted by the insertIntoTQ method of the JobState class which calls the TQDB insertJob method). This can be the case if a user specifies a list of banned sites and the job doesn't require any site in particular.
However, I would argue that having a BannedSitesQueue table is not very appealing.
One solution would be to get the list of all sites, remove from this list the banned sites and insert the remaining sites in the tq_TQToSites
table.
I think we should also ask the community if banning sites is a real use case, because this feature brings quite a lot of complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feature is there because, most probably, of some (LHCb) user requirement of long time ago. This reminds me that we have often argued that it did not make much sense and that could even create issues. I would be tempted to remove it also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we use the ON DELETE CASCADE constraint in DIRAC to remove the entries from this table, then I think we can remove this table from diracx.
9b4990c
to
993d2d4
Compare
await asyncio.gather( | ||
*( | ||
delete_job( | ||
job_id, | ||
config, | ||
job_db, | ||
job_logging_db, | ||
task_queue_db, | ||
background_task, | ||
) | ||
for job_id in job_ids | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be practical to make this an actual bulk operation on the DB side? i.e. something like:
DELETE FROM JOBS WHERE JOB_ID IN (123, 124, 125, 126)
rather than
DELETE FROM JOBS WHERE JOB_ID = 123
DELETE FROM JOBS WHERE JOB_ID = 124
DELETE FROM JOBS WHERE JOB_ID = 125
DELETE FROM JOBS WHERE JOB_ID = 126
(This is probably better as an issue and then a separate PR rather than adding complexity to this one.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up doing it here because I had problems using the TaskGroup+except* as you advised, and I wanted to see if making bulk methods could help me avoid theses. In the end, I still had to use them, but most of the work was already done. Sorry for the complexity.
for job_id in job_ids | ||
) | ||
) | ||
except NoResultFound as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably use a TaskGroup
+except*
instead of gather
so we get the better exception handling.
Ideally I think we should return a more explicit error like:
detail={
"message": "Failed to remove 5 jobs out of 167",
"failed_ids": [1, 78, 32, 25, 90]
}
or maybe even:
detail={
"message": "Failed to remove 5 jobs out of 167",
"failed_ids": {
"1": str(e1),
"78": str(e2),
"32": str(e3),
"25": str(e4),
"90": str(e5)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should cancel any action if one of them fail ? (for example here should we remove the other 162 jobs or not ?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that’s practical I think that would be nice.
I concur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chaen Agrees as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linked discussion: #97
f17af3a
to
f503216
Compare
@chaen I think the pytest-integration tests are failing because the TaskQueueDB and SandboxMetadataDB are not in diracx-charts. Is this normal ? If so, do you want me to add them in a pull request there, or should I let you do it? |
f503216
to
30e2adf
Compare
30e2adf
to
1721185
Compare
1721185
to
b333ae4
Compare
Implement kill, delete and remove endpoints (see #22, #63)