Improve SQL Server Incremental Statement #447

dvdrndlph · 2024-11-22T02:13:57Z

Issue Description

I am incrementally slinging data to a large Oracle table via sqlldr from SQL Server. Here is an example command:

sling run --src-conn CDW_MSSQL --src-stream dbo.BillingTransactionFact --tgt-conn ORABOODLE --tgt-object BillingTransactionFact --tgt-conn ORABOODLE --tgt-object BillingTransactionFact --mode incremental --update-key BillingTransactionKey --limit 250000 -d

The table rows have a lot of data, which requires a lot of space to manage when ordering is required. For this table, I must move in very small increments to avoid the following error:

mssql: Could not allocate a new page for database 'TEMPDB' because the 'DEFAULT' filegroup is full due to lack of storage space or database files reaching the maximum allowed size. Note that UNLIMITED files are still limited to 16TB. Create the necessary space by dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.

Description of the issue:
The command spawns a query that is logged like this:

2024-11-21 18:59:16 INF reading from source database
2024-11-21 18:59:16 DBG select top 250000 * from "dbo"."BillingTransactionFact" where "BillingTransactionKey" > 12249997 order by "BillingTransactionKey" asc

This query, after considerable processing time, produces the TEMPDB error above.

The query produces the same error outside sling.

The following equivalent query, on the other hand, is both more performant and uses much less TEMPDB space:

with ids as (
  select top 400000 "BillingTransactionKey"
    from "dbo"."BillingTransactionFact" 
   where "BillingTransactionKey" > 12249997 
   order by "BillingTransactionKey" asc
) select *
    from "dbo"."BillingTransactionFact" btf
   inner join ids i
      on btf."BillingTransactionKey" = i."BillingTransactionKey";

Could sling use logic like this to read from a SQL Server source database?

I think this optimization could help a lot.

Sling version (sling --version):
Version: 1.2.22
Operating System (linux, mac, windows):
Windows
Replication Configuration:
N/A
Log Output (please run command with -d):

See above.

The text was updated successfully, but these errors were encountered:

flarco · 2024-11-22T12:05:44Z

Interesting.

You should be able to use the custom SQL for this, using the {incremental_value} placeholder.
See here for details.

# create query.sql
echo '
with ids as (
  select top 400000 "BillingTransactionKey"
    from "dbo"."BillingTransactionFact" 
   where "BillingTransactionKey" > coalesce({incremental_value}, 0)
   order by "BillingTransactionKey" asc
) select *
    from "dbo"."BillingTransactionFact" btf
   inner join ids i
      on btf."BillingTransactionKey" = i."BillingTransactionKey";
' > query.sql

# use query.sql as stream
sling run --src-conn CDW_MSSQL --src-stream file://query.sql --tgt-conn ORABOODLE --tgt-object BillingTransactionFact --tgt-conn ORABOODLE --tgt-object BillingTransactionFact --mode incremental --update-key BillingTransactionKey --limit 250000 -d

dvdrndlph · 2024-11-22T19:04:28Z

One little tweak to make it work:

echo '
with ids as (
  select top 400000 "BillingTransactionKey"
    from "dbo"."BillingTransactionFact" 
   where "BillingTransactionKey" > coalesce({incremental_value}, 0)
   order by "BillingTransactionKey" asc
) select btf.*
    from "dbo"."BillingTransactionFact" btf
   inner join ids i
      on btf."BillingTransactionKey" = i."BillingTransactionKey";
' > query.sql

This made a huge difference. I went from processing 250K rows per iteration with throughput of about 1K rows/sec to doing 2M at 3600 rows/sec. I really think this should be the default query. It is worth the wordiness.

Cheers,
Dave

dvdrndlph · 2024-11-22T19:21:20Z

In the interests of full disclosure, we did also more than double the TEMPDB storage on SQL Server. This may have something to do with my observed performance improvement. But I think the change here essentially removes TEMPDB as a bottleneck. Wherever possible for SQL Server at least, I think we want to limit projections when ordering results, and subqueries can really help performance.

flarco · 2024-11-22T19:25:48Z

Thanks for the details. It's interesting that it improved that much, I would not have guessed. Each database has its own engine optimization and particularities... I will look into setting that logic as the default for SQL Server.

dvdrndlph · 2024-11-22T19:29:37Z

Also, we are running SQL Server version 15.0.4410. YMMV.

dvdrndlph · 2024-11-26T20:22:11Z

Okay, my subquery hack is not including all records. I really thought using "top" in a subquery would honor the "order by," but after all iterations, I find a bunch of records are missed. So this is what my new query.sql file looks like:

      with foo as (
        select *
          from "dbo"."BillingTransactionFact"
         where "BillingTransactionKey" > coalesce({incremental_value}, -4)
           and "BillingTransactionKey" < coalesce({incremental_value}, -4) + 10000000
       ) select *
           from foo�

where 10000000 is what is passed to sling via --limit (and I think ultimately ignored because of what is specified for the stream here).

This is effectively processing at most the specified limit of rows and doing a simple insert to save time. And it avoids the "order by" entirely, which relieves the pressure on TEMPDB.

flarco added the enhancement New feature or request label Nov 22, 2024

flarco changed the title ~~Incremental load of "wide" SQL Server table exhausts TEMPDB~~ Improve SQL Server Incremental Statement Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve SQL Server Incremental Statement #447

Improve SQL Server Incremental Statement #447

dvdrndlph commented Nov 22, 2024 •

edited

Loading

flarco commented Nov 22, 2024 •

edited

Loading

dvdrndlph commented Nov 22, 2024

dvdrndlph commented Nov 22, 2024 •

edited

Loading

flarco commented Nov 22, 2024

dvdrndlph commented Nov 22, 2024

dvdrndlph commented Nov 26, 2024 •

edited

Loading

Improve SQL Server Incremental Statement #447

Improve SQL Server Incremental Statement #447

Comments

dvdrndlph commented Nov 22, 2024 • edited Loading

Issue Description

flarco commented Nov 22, 2024 • edited Loading

dvdrndlph commented Nov 22, 2024

dvdrndlph commented Nov 22, 2024 • edited Loading

flarco commented Nov 22, 2024

dvdrndlph commented Nov 22, 2024

dvdrndlph commented Nov 26, 2024 • edited Loading

dvdrndlph commented Nov 22, 2024 •

edited

Loading

flarco commented Nov 22, 2024 •

edited

Loading

dvdrndlph commented Nov 22, 2024 •

edited

Loading

dvdrndlph commented Nov 26, 2024 •

edited

Loading