Add support for continuously starting load jobs as slots free up in the loader #1494

sh-rp · 2024-06-19T14:54:47Z

Description

In the current implementation we more or less always start n (=max workers) load jobs, let them complete and then rerun the whole loader to schedule the next n jobs. In this PR we submit new load jobs as slots free up.

What is happening here:

The loader now periodically checks all jobs wether they are done and schedules new ones as needed.
Runnable Jobs now manage their own internal state (file moving still needs to be done by the loader on the main thread) and have a dedicated "run" method which is the one called on a thread.
Renaming of the Job base classes to make clearer what is going on:
- RunnableLoadJob: Jobs that actually do something and should be executed on a thread
- FollowupJob: Class that creates a new job persisted to the disc
- HasFollowupJobs (ex-FollowupJob): Trait that tells loader to look for followup Jobs
- FinalizedJob: Not runnable because it already has an actionable state (completed, failed, retry). Used for indicating failed restored jobs, completed restored jobs and cases where nothing needs to be done
FollowupJobs always go to the "new_jobs" folder and need to be picked up by the loader as any other jobs do, there were some just directly executed in the main thread: not good! :) We now assume that creation of FollowupJobs does not yield exceptions, I don't think this is needed and it makes the code simpler.
(Hopefully) simplification of the load class
Restoring jobs is now handled the same way as creating jobs is, this simplifies the code in a few places. Jobs that are found in the running folder now are simply started again, the bigquery job can figure out wether it needs to be resumed on its own.

Possible FollowupWork:

Performance Tuning in loader, a lot of time is spent on parsing filenames (I think).

netlify · 2024-06-19T14:55:02Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`2c38f13`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/66acccc36859290008ab7658

dlt/load/load.py

sh-rp · 2024-06-19T14:56:12Z

dlt/load/load.py

+        if (
+            len(self.load_storage.list_new_jobs(load_id)) == 0
+            and len(self.load_storage.normalized_packages.list_started_jobs(load_id)) == 0
+        ):


is this the correct "package completion" condition? I think so, but am not 100% sure.

this was the previous condition:

if file_count == 0: logger.info(f"No new jobs found in {load_id}") return 0, []

so checking just new jobs. I do not fully get why you do it here again? the loop above should exit only when all jobs are completed, no?

sh-rp · 2024-06-19T14:57:17Z

tests/load/test_dummy_client.py

@@ -96,15 +96,15 @@ def test_unsupported_write_disposition() -> None:
    load.load_storage.normalized_packages.save_schema(load_id, schema)


I needed to change a bunch of tests, since we do not rely on multiple executions of the run method anymore. All the changes make sense, it might be good to add a few more tests cases specific to the new implementation.

dlt/load/load.py

sh-rp · 2024-06-19T15:01:50Z

dlt/load/load.py

        remaining_jobs: List[LoadJob] = []
+        # if an exception condition was met, return it to the main runner
+        pending_exception: Exception = None


I need to collect the exception to raise in the main loop here now, we could alternatively collect all problems we find and print them out, not just raise on exception.

# Conflicts: # dlt/load/load.py # tests/load/test_dummy_client.py

fix some tests

sh-rp · 2024-07-18T15:47:55Z

@rudolfix this can go into another review. Two open questions from my side:

Should we change the behavior of def is_package_partially_loaded (see above)
What exactly is the desired behavior from a conceptual standpoint when creating a followup job fails? In the old version, loading just continues and those jobs are marked as failed. I don't think this makes sense, because the load will be useless if for example a mergejob can't be created and executed. So either we decide

this is a transient problem (e.g. the user can manually fix the schema and restart he load) , in that case we just raise an exception on the main thread, so that the job that triggered the scheduling of a followupjob remains in "started_jobs" and will be rerun on the pipeline execution including the scheduling of the followupjob (it is implemented and tested like this now)
or this is a terminal problem, in which case we should also stop the load but mark the loadpackage as failed

# Conflicts: # dlt/destinations/impl/filesystem/filesystem.py

rudolfix

What exactly is the desired behavior from a conceptual standpoint when creating a followup job fails? In the old version, loading just continues and those jobs are marked as failed. I don't think this makes sense, because the load will be useless if for example a mergejob can't be created and executed. So either we decide
this is a transient problem (e.g. the user can manually fix the schema and restart he load) , in that case we just raise an exception on the main thread, so that the job that triggered the scheduling of a followupjob remains in "started_jobs" and will be rerun on the pipeline execution including the scheduling of the followupjob (it is implemented and tested like this now)
or this is a terminal problem, in which case we should also stop the load but mark the loadpackage as failed

my take:
make it a transient error

and we need how we deal with failed packages. right now we continue load and do not raise exception at the end. IMO we should change that. we should continue load but raise exception at the end automatically.

I'll write a ticket for that - it ia a breaking change

dlt/load/load.py

rudolfix · 2024-07-29T16:15:15Z

dlt/load/load.py

+                ) and job_client.should_load_data_to_staging_dataset(load_table)
+
+            # set job vars
+            job.set_run_vars(load_id=load_id, schema=schema, load_table=load_table)


We need to change tagging stuff:

we tag session in:

def create_load_job( self, table: TTableSchema, file_path: str, load_id: str, restore: bool = False ) -> LoadJob: """Starts SqlLoadJob for files ending with .sql or returns None to let derived classes to handle their specific jobs""" self._set_query_tags_for_job(load_id, table)

which is called from main thread and not supposed to open any connection. IDK how it works now :) but even if it does, it will tag a session on main thread and then we immediately close the connection and the reopen it on the worker thread but in that case tagging does not happen.

Since you passed all required params in set_vars, we do not need to take any parameters in here:

def _set_query_tags_for_job(self, load_id: str, table: TTableSchema) -> None:

this function should be called in run method of all jobs that have sql_client (created by sql_job_client). so we need to move it from client to the job... which is a good move

rudolfix · 2024-07-29T16:23:58Z

dlt/load/load.py

                # this will raise on signal
-                sleep(1)
+                sleep(


see the above. are we still reading any job listings when looping on idle?

rudolfix · 2024-07-29T16:47:45Z

tests/cli/test_pipeline_command.py

    venv = Venv.restore_current()
    with pytest.raises(CalledProcessError) as cpe:
        print(venv.run_script("chess_pipeline.py"))
+
+    # move job into running folder manually


I think we need to change how is_package_partially_loaded works. partially loaded -> has packages that are not completed and has packages that are completed.

current implementation assumes that failed packages, started packages and retried packages modify the destination. this is IMO wrong (we assume that jobs are atomic - most of them are). Now I think that was wrong from the start
WDYT?

rudolfix · 2024-07-29T16:52:22Z

tests/load/bigquery/test_bigquery_client.py

    )
+
+    # job will be automatically found and resumed


OK! I probably overlooked the lines below!

rudolfix · 2024-07-29T16:55:31Z

tests/load/test_dummy_client.py

+    # sanity check
+    assert duration > 5
+
+    # we want 1000 empty processed jobs to need less than 15 seconds total (locally it runs in 10)


OK. but this idle loop where we sleep 0.1 seconds worries. do we have 100% cpu usage when this test run? maybe it is a little bit faster but we saturate CPU (while having some threads working). pls take a look. with 50k or 100k jobs maybe.

we must avoid starving threads by an idle loop that reads 50k files over and over

rudolfix · 2024-07-29T16:59:48Z

tests/pipeline/test_pipeline.py

    with pytest.raises(PipelineStepFailed):
        pipeline.run(airtable_emojis())
+    # move job into running folder manually


@property def has_pending_data(self) -> bool: """Tells if the pipeline contains any extracted files or pending load packages""" return ( len(self.list_normalized_load_packages()) > 0 or len(self.list_extracted_load_packages()) > 0 )

it does not even look into package content. any not completed package is pending and will be executed before new package is created. this check is main reason this function exist

I answered is_package_partially_loaded question above

tests/pipeline/test_pipeline_trace.py

sh-rp · 2024-07-30T12:33:06Z

my take: make it a transient error

and we need how we deal with failed packages. right now we continue load and do not raise exception at the end. IMO we should change that. we should continue load but raise exception at the end automatically.

I'll write a ticket for that - it ia a breaking change

Ok, I changed it slightly and added new exceptions to indicate what went wrong plus tests. With regards to the load package with failing jobs: I totally agree that that should raise at the end of the load if there were failed jobs. Now these errors are pretty much hidden.

…ag execution there.

sh-rp · 2024-07-31T15:55:36Z

dlt/common/storages/load_package.py

@@ -723,19 +723,12 @@ def build_job_file_name(

    @staticmethod
    def is_package_partially_loaded(package_info: LoadPackageInfo) -> bool:


the behavior is unified now between the different package states, I'd say this is correct.

# Conflicts: # dlt/destinations/impl/filesystem/filesystem.py

rudolfix

LGTM!

add support for starting load jobs as slots free up

78a5989

sh-rp commented Jun 19, 2024

View reviewed changes

dlt/load/load.py Show resolved Hide resolved

sh-rp commented Jun 19, 2024

View reviewed changes

sh-rp marked this pull request as ready for review June 19, 2024 14:57

sh-rp commented Jun 19, 2024

View reviewed changes

dlt/load/load.py Outdated Show resolved Hide resolved

sh-rp commented Jun 19, 2024

View reviewed changes

sh-rp added the enhancement New feature or request label Jun 20, 2024

rudolfix added the sprint Marks group of tasks with core team focus at this moment label Jun 26, 2024

Merge branch 'devel' into feat/continuous-load-jobs

b4d05c8

# Conflicts: # dlt/load/load.py # tests/load/test_dummy_client.py

sh-rp force-pushed the feat/continuous-load-jobs branch from c265ecc to b4d05c8 Compare June 27, 2024 13:31

sh-rp added 3 commits July 2, 2024 11:14

Merge branch 'devel' into feat/continuous-load-jobs

8f1c9bc

update loader class to devel changes

c516fbc

update failed w_d test

da8c9e6

sh-rp force-pushed the feat/continuous-load-jobs branch from 0a9b5c3 to da8c9e6 Compare July 2, 2024 10:45

sh-rp added 6 commits July 2, 2024 13:37

reduce sleep time for now

b8ff71d

add first implementation of futures on custom destination

fa66386

rename start_file_load to get_load_job

d59e4eb

add first version of working follow up jobs for new loader setup

3a8ec86

require jobclient in constructor for duckdb

1768e17

fixes some dummy tests

1707413

rudolfix removed the sprint Marks group of tasks with core team focus at this moment label Jul 3, 2024

sh-rp added 7 commits July 3, 2024 14:38

update all jobs to have the new run method

189988c

unify file_path argument in loadjobs

a53a9b7

fixes some filepath related tests

37108a6

renames job classes for more clarity and small updates

aaa14fe

re-organize jobs a bit more

78f5dbc

fix some tests

fix destination parallelism

a8d4a7a

remove changed in config.toml

2d1c3b0

Merge branch 'devel' into feat/continuous-load-jobs

b6e4fca

# Conflicts: # dlt/destinations/impl/filesystem/filesystem.py

sh-rp force-pushed the feat/continuous-load-jobs branch from 835a49d to bd252f0 Compare July 18, 2024 16:31

fix linter

1c73de1

sh-rp force-pushed the feat/continuous-load-jobs branch from bd252f0 to 1c73de1 Compare July 18, 2024 16:33

rudolfix requested changes Jul 29, 2024

View reviewed changes

sh-rp added 3 commits July 30, 2024 12:31

Merge branch 'devel' into feat/continuous-load-jobs

d35842c

put sleep amount back to 1.0 while checking for completed load jobs

90f820c

create explicit exceptions for failed table chain jobs

6ba32f8

sh-rp added 5 commits July 30, 2024 15:13

make the large load package test faster

9142a1b

fix trace test

9fc995e

allow clients to prepare for job execution on thread and move query t…

bf9f912

…ag execution there.

fix runnable job tests and linter

5c07c07

fix linter again and remove wrong value from tests

ce3e1c9

sh-rp force-pushed the feat/continuous-load-jobs branch from d1b2144 to ce3e1c9 Compare July 30, 2024 14:50

sh-rp added 3 commits July 31, 2024 13:27

test

7fe2f46

update detection of pending jobs, will probably break some tests

7e569af

fix two tests of pending packages

960f309

sh-rp commented Jul 31, 2024

View reviewed changes

sh-rp added 5 commits July 31, 2024 23:36

fix test_remove_pending_packages test

1cf2207

Merge branch 'devel' into feat/continuous-load-jobs

5b6717c

# Conflicts: # dlt/destinations/impl/filesystem/filesystem.py

switch to docker compose subcommand

3423ca7

fix compose deployments

4b21365

fix test for arrow version in delta tables

2c38f13

rudolfix approved these changes Aug 4, 2024

View reviewed changes

sh-rp merged commit 3bb677f into devel Aug 4, 2024
54 checks passed

sh-rp deleted the feat/continuous-load-jobs branch August 4, 2024 22:08

This was referenced Aug 15, 2024

Dangling Parquet files in delta table #1693

Closed

Fix delta table dangling Parquet file bug #1695

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for continuously starting load jobs as slots free up in the loader #1494

Add support for continuously starting load jobs as slots free up in the loader #1494

sh-rp commented Jun 19, 2024 •

edited

Loading

netlify bot commented Jun 19, 2024 •

edited

Loading

sh-rp Jun 19, 2024

rudolfix Jul 15, 2024

sh-rp Jun 19, 2024

sh-rp Jun 19, 2024

sh-rp commented Jul 18, 2024

rudolfix left a comment

rudolfix Jul 29, 2024

rudolfix Jul 29, 2024

rudolfix Jul 29, 2024

rudolfix Jul 29, 2024

rudolfix Jul 29, 2024

rudolfix Jul 29, 2024

sh-rp commented Jul 30, 2024

sh-rp Jul 31, 2024

rudolfix left a comment

		@@ -96,15 +96,15 @@ def test_unsupported_write_disposition() -> None:
		load.load_storage.normalized_packages.save_schema(load_id, schema)

		@@ -723,19 +723,12 @@ def build_job_file_name(

		@staticmethod
		def is_package_partially_loaded(package_info: LoadPackageInfo) -> bool:

Add support for continuously starting load jobs as slots free up in the loader #1494

Add support for continuously starting load jobs as slots free up in the loader #1494

Conversation

sh-rp commented Jun 19, 2024 • edited Loading

Description

netlify bot commented Jun 19, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp commented Jul 18, 2024

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp commented Jul 30, 2024

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

sh-rp commented Jun 19, 2024 •

edited

Loading

netlify bot commented Jun 19, 2024 •

edited

Loading