-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(server): Asynchronous server-side background task execution #4317
base: master
Are you sure you want to change the base?
feat(server): Asynchronous server-side background task execution #4317
Conversation
3484290
to
60a262c
Compare
60a262c
to
2361221
Compare
ed75dbc
to
a75244f
Compare
if db_task.username == self._get_username(): | ||
has_right_to_query_status = True | ||
should_set_consumed_flag = db_task.is_in_terminated_state | ||
elif db_task.product_id is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this elif
? Why is it mutually exclusive if username or product is provided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because different workflows are executed. You can only query the status of the task if:
- You are the user who was responsible for creating the task. (In this case, but only for terminated tasks, the
consumed
flag is set to true.) - The task is associated with a product (this might not always be the case!) and you are an admin of said product. (In this case, the
consumed
flag is left intact.) - You are a superuser. (In this case, the
consumed
flag is left intact.)
If this was not an elif
, a product admin or superuser querying their own task would leave the consumed
flag intact.
with DBSession(self._database_factory) as session: | ||
try: | ||
db_task = session.query(DBTask).get(task_obj.token) | ||
session.expunge(db_task) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the goal of this expunge()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removes the DBTask
object from the Session
's internal cache. Results in clearing the __sa_state
member, tearing off the connection between the Python object and the ORM-handled "view" of it. This ensures that callers of get_task_record()
, even if they capture a reference to what is being returned here, are NOT allowed to mutate the state of the object in a way that the database state is also mutated.
(In order to safely mutate a DBTask
entity, _mutate_task_record
exists.)
to communicate large inputs (that should not be put in the `Queue`) | ||
to the `execute` method of an `AbstractTask`. | ||
""" | ||
task_tmp_root = Path(tempfile.gettempdir()) / "codechecker_tasks" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a real question, I'm just thinking loudly: we put things under /tmp
, because it's server side, right? Only the client is using the workspace directory instead of /tmp
for temporary files, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This is all server-side. Just a temporary directory that's more identifiable than a truly "random" temporary directory for the task's tmp data (like the unzip of a mass-store-run). Because servers can fail in a way that the cleaning of temp directories are not executed, as we are no longer using the tempfile... context managers here.
LOG.warning("Failed to remove background task's data_dir at " | ||
"'%s':\n%s", self.data_path, str(ex)) | ||
|
||
def _implementation(self, _task_manager: "TaskManager") -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it get an @abstractmethod
annotation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, @abstractmethod
isn't a built-in, but coming from the abc
library, which makes the object not pickleable.
Also, methods that raise NotImplementedError()
are type-checked by PyRight and there is a linter warning in the editor that your class is not proper.
a75244f
to
d2bb4d5
Compare
This patch implements the whole support ecosystem for server-side background tasks, in order to help lessen the load (and blocking) of API handlers in the web-server for long-running operations. A **Task** is represented by two things in strict co-existence: a lightweight, `pickle`-able implementation in the server's code (a subclass of `AbstractTask`) and a corresponding `BackgroundTask` database entity, which resides in the "configuration" database (shared across all products). A Task is created by API request handlers and then the user is instructed to retain the `TaskToken`: the task's unique identifier. Following, the server will dispatch execution of the object into a background worker process, and keep status synchronisation via the database. Even in a service cluster deployment, load balancing will not interfere with users' ability to query a task's status. While normal users can only query the status of a single task (which is usually automatically done by client code, and not the user manually executing something); product administrators, and especially server administrators have the ability to query an arbitrary set of tasks using the potential filters, with a dedicated API function (`getTasks()`) for this purpose. Tasks can be cancelled only by `SUPERUSER`s, at which point a special binary flag is set in the status record. However, to prevent complicating inter-process communication, cancellation is supposed to be implemented by `AbstractTask` subclasses in a co-operative way. The execution of tasks in a process and a `Task`'s ability to "communicate" with its execution environment is achieved through the new `TaskManager` instance, which is created for every process of a server's deployment. Unfortunately, tasks can die gracelessly if the server is terminated (either internally, or even externally). For this reason, the `DROPPED` status will indicate that the server has terminated prior to, or during a task's execution, and it was unable to produce results. The server was refactored significantly around the handling of subprocesses in order to support various server shutdown scenarios. Servers will start `background_worker_processes` number of task handling subprocesses, which are distinct from the already existing "API handling" subprocesses. By default, if unconfigured, `background_worker_processes` is equal to `worker_processes` (the number of API processes to spawn), which is equal to `$(nproc)` (CPU count in the system). This patch includes a `TestingDummyTask` demonstrative subclass of `AbstractTask` which counts up to an input number of seconds, and each second it gracefully checks whether it is being killed. The corresponding testing API endpoint, `createDummyTask()` can specify whether the task should simulate a failing status. This endpoint can only be used from, but is used extensively, the unit testing of the project. This patch does not include "nice" or "ergonomic" facilities for admins to manage the tasks, and so far, only the server-side of the corresponding API calls are supported.
d2bb4d5
to
19302e4
Compare
Important
This is patch 1 of the Asynchronous Store Protocol (#3672).
This patch implements the whole support ecosystem for server-side background tasks, in order to help lessen the load, occupancy, and potentially detrimental blocking behaviour on API handler processes, occupied by long-running operations in the web-server, such as
massStoreRun()
.API workers vs. background/task workers
In this patch, the processing model of the
CodeChecker server
is extended with the concept of background workers. These special processes deal with consuming tasks off of the server's queue, and executing them. The previous clone processes of a server are now and henceforth termed API workers, and they do the same as prior: respond to Thrift RPC calls.The server will start
background_worker_processes
number of background processes, which is, by default, the same number asworker_processes
, the number of API handlers. (Which defaults to$(nproc)
.)Self-healing against dead children
A fix is also implemented in this patch which is the result of a side effect that stemmed from reviewing and reworking both the threading and the signalling model of a
CodeChecker server
process tree. Previously, exceptions escaping from an API worker process (or it getting OOM-killed, etc.) would result in an initially large number of workers to die off one by one, leaving the server with a monotonously descending number of workers to dispatch RPC requests to. In this patch, the server is scripted to handleSIGCHLD
and respawn dead child processes (both API and Task) when given the opportunity.Graceful server shutdown
The server's life cycle and signal handling logic refactoring also includes changes which makes the server process(es) more graceful in their termination, which is needed to accurately clean-up pending tasks (see later).
In case a brutal and unclean server shutdown is needed, following this patch, the
SIGINT
orSIGTERM
signal must be sent twice to the server.(Alternatively, just use
SIGKILL
! 😉)Task management
A Task is represented by two things in strict co-existence: a lightweight,
pickle
-able implementation of the task's execution logic (subclassed fromAbstractTask
and overriding_implementation()
); and a correspondingBackgroundTask
entity in the database. These records reside in the CONFIGURATION database, and the task information is server-, or service-wide, shared across all products.The database-synchronised record contains several human-readable metadata about the task, such as a one-line
summary
field and the larger, log-likecomments
column, with several timestamps recorded for the crucial moments during the task's execution.The most important flag is a task's status, which can be:
ALLOCATED
,ENQUEUED
,RUNNING
,COMPLETED
,FAILED
,CANCELLED
, andDROPPED
.Normal path
The life cycle of a task is as follows:
Task
is created as the result of an API request (executed in the API worker process).a. (Currently, for the sake of both testing and demonstration, only the
createDummyTask()
API, and only in a test context, can create a task.)ALLOCATED
, together with theBackgroundTask
database entity.Task
object into a shared, synchronised queue within the server's memory space.At this point, the task is considered
ENQUEUED
.a.
AbstractTask
subclasses MUST bepickle
-able and reasonably small.b. The library offers means to store additional large data on the file system, in a temporary directory specific to the task.
getTaskInfo()
API (executed in the context of any API worker process, synchronised over the database) to query whether the task was completed, if the user wishes to receive this information.Task
object from the queue.After some bookkeeping, the task will be
RUNNING
.MyTaskClass::_implementation()
is called, which executes the task's primary business logic._implementation()
returns without a failure, the task will be considered successfullyCOMPLETED
.Any exception escaping from the method will set the task to
FAILED
, and exception information is logged into theBackgroundTask.comments
column of the database.Together, these two are the "Normal termination states.".
Task
is available from the queue.Abnormal path 1: admin cancellation
At any point following
ALLOCATED
status, but most likely in theENQUEUED
andRUNNING
statuses, aSUPERUSER
may issue acancelTask()
order.This will set
BackgroundTask.cancel_flag
, and the task is expected (although not required!) to poll its ownshould_cancel()
status internally in checkpoints, and terminate gracefully to this request. This is done by_implementation()
exiting by raising aTaskCancelHonoured
exception.(If the task does not raise one, it will be allowed to conclude normally, or fail in some other manner.
Tasks cancelled gracefully will have the
CANCELLED
status.For example, a background task that performs an action over a set of input files generally should be implemented like this:
Abnormal path 2: server shutdown
Alternatively, at any point in this life cycle, the server might receive the command to terminate itself (kill signals
SIGINT
,SIGTERM
; alternatively caused byCodeChecker server --stop
). Following the termination of API workers, the background workers will also shut down one by one.At this point, the default behaviour is to cause a special cancel event which tasks currently
RUNNING
may still gracefully honour, as-if it was aSUPERUSER
's single-task cancel request. All other tasks that have not started executing yet and are in theALLOCATED
orENQUEUED
status will never start.All tasks not in a normal termination state will be set to the
DROPPED
status, with thecomments
field containing a log about the specifics of in which state the task was dropped, and why. (Together,CANCELLED
andDROPPED
are the "abnormal termination states", indicating that the task terminated due to some external influence.)Task querying
The
getTaskInfo()
API, querying the status of one task, is available to the user who caused the task to spawn, thePRODUCT_ADMIN
s of the product associated with the task (if any), andSUPERUSER
s.The
getTasks()
API, which queries multiple tasks based on a filter set, is available only toPRODUCT_ADMIN
s (results restricted to the products they are admin thereof) andSUPERUSER
s (unrestricted).Just about anything that is available as information in the database about a task can be queried.
--machine-id
Unfortunately, servers don't always terminate gracefully (cue the aforementioned
SIGKILL
, but also the container, VM, or the host machine could simply die during execution, in ways the server is not able to handle). Because tasks are not shared across server processes, and there are crucial bits of information in the now dead process's memory which would have been needed to execute the task, a server later restarting in place of a previously dead one should be able to identify which tasks its "predecessor" left behind without clean-up.This is achieved by storing the running computer's identifier, configurable via
CodeChecker server --machine-id
, as an additional piece of information for each task. By default, the machine ID is constructed fromgethostname():portnumber
, e.g.,cc-server:80
.In containerised environments, relying on
gethostname()
may not be entirely stable!For example, Docker exposes the first 12 digits of the container's unique hash as the "hostname" of the insides of the container. If the container is started with
--restart always
or--restart unless-stopped
, then this is fine, however, more advanced systems, such as Docker swarm will create a new container in case the old one died (!), resulting in a new value ofgethostname()
.In such environments, service administrators must pay additional caution and configure their instances by setting
--machine-id
for subsequent executions of the "same" server accordingly. If a server with machine IDM
starts up (usually after a container or "system" restart), it will set every task not in any "termination states" and associated with machine IDM
to theDROPPED
status (with an appropriately formatted comment accompanying), signifying that the previous instance "dropped" these tasks, but had no chance of recording this fact.