Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve SQL Server Incremental Statement #447

Open
dvdrndlph opened this issue Nov 22, 2024 · 6 comments
Open

Improve SQL Server Incremental Statement #447

dvdrndlph opened this issue Nov 22, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@dvdrndlph
Copy link

dvdrndlph commented Nov 22, 2024

Issue Description

I am incrementally slinging data to a large Oracle table via sqlldr from SQL Server. Here is an example command:

sling run --src-conn CDW_MSSQL --src-stream dbo.BillingTransactionFact --tgt-conn ORABOODLE --tgt-object BillingTransactionFact --tgt-conn ORABOODLE --tgt-object BillingTransactionFact --mode incremental --update-key BillingTransactionKey --limit 250000 -d

The table rows have a lot of data, which requires a lot of space to manage when ordering is required. For this table, I must move in very small increments to avoid the following error:

mssql: Could not allocate a new page for database 'TEMPDB' because the 'DEFAULT' filegroup is full due to lack of storage space or database files reaching the maximum allowed size. Note that UNLIMITED files are still limited to 16TB. Create the necessary space by dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.
  • Description of the issue:
    The command spawns a query that is logged like this:
2024-11-21 18:59:16 INF reading from source database
2024-11-21 18:59:16 DBG select top 250000 * from "dbo"."BillingTransactionFact" where "BillingTransactionKey" > 12249997 order by "BillingTransactionKey" asc

This query, after considerable processing time, produces the TEMPDB error above.

The query produces the same error outside sling.

The following equivalent query, on the other hand, is both more performant and uses much less TEMPDB space:

with ids as (
  select top 400000 "BillingTransactionKey"
    from "dbo"."BillingTransactionFact" 
   where "BillingTransactionKey" > 12249997 
   order by "BillingTransactionKey" asc
) select *
    from "dbo"."BillingTransactionFact" btf
   inner join ids i
      on btf."BillingTransactionKey" = i."BillingTransactionKey";

Could sling use logic like this to read from a SQL Server source database?

I think this optimization could help a lot.

  • Sling version (sling --version):
    Version: 1.2.22

  • Operating System (linux, mac, windows):
    Windows

  • Replication Configuration:
    N/A

  • Log Output (please run command with -d):

See above.

@flarco
Copy link
Collaborator

flarco commented Nov 22, 2024

Interesting.

You should be able to use the custom SQL for this, using the {incremental_value} placeholder.
See here for details.

# create query.sql
echo '
with ids as (
  select top 400000 "BillingTransactionKey"
    from "dbo"."BillingTransactionFact" 
   where "BillingTransactionKey" > coalesce({incremental_value}, 0)
   order by "BillingTransactionKey" asc
) select *
    from "dbo"."BillingTransactionFact" btf
   inner join ids i
      on btf."BillingTransactionKey" = i."BillingTransactionKey";
' > query.sql

# use query.sql as stream
sling run --src-conn CDW_MSSQL --src-stream file://query.sql --tgt-conn ORABOODLE --tgt-object BillingTransactionFact --tgt-conn ORABOODLE --tgt-object BillingTransactionFact --mode incremental --update-key BillingTransactionKey --limit 250000 -d

@dvdrndlph
Copy link
Author

One little tweak to make it work:

echo '
with ids as (
  select top 400000 "BillingTransactionKey"
    from "dbo"."BillingTransactionFact" 
   where "BillingTransactionKey" > coalesce({incremental_value}, 0)
   order by "BillingTransactionKey" asc
) select btf.*
    from "dbo"."BillingTransactionFact" btf
   inner join ids i
      on btf."BillingTransactionKey" = i."BillingTransactionKey";
' > query.sql

This made a huge difference. I went from processing 250K rows per iteration with throughput of about 1K rows/sec to doing 2M at 3600 rows/sec. I really think this should be the default query. It is worth the wordiness.

Cheers,
Dave

@dvdrndlph
Copy link
Author

dvdrndlph commented Nov 22, 2024

In the interests of full disclosure, we did also more than double the TEMPDB storage on SQL Server. This may have something to do with my observed performance improvement. But I think the change here essentially removes TEMPDB as a bottleneck. Wherever possible for SQL Server at least, I think we want to limit projections when ordering results, and subqueries can really help performance.

@flarco
Copy link
Collaborator

flarco commented Nov 22, 2024

Thanks for the details. It's interesting that it improved that much, I would not have guessed. Each database has its own engine optimization and particularities... I will look into setting that logic as the default for SQL Server.

@flarco flarco added the enhancement New feature or request label Nov 22, 2024
@dvdrndlph
Copy link
Author

Also, we are running SQL Server version 15.0.4410. YMMV.

@flarco flarco changed the title Incremental load of "wide" SQL Server table exhausts TEMPDB Improve SQL Server Incremental Statement Nov 22, 2024
@dvdrndlph
Copy link
Author

dvdrndlph commented Nov 26, 2024

Okay, my subquery hack is not including all records. I really thought using "top" in a subquery would honor the "order by," but after all iterations, I find a bunch of records are missed. So this is what my new query.sql file looks like:

      with foo as (
        select *
          from "dbo"."BillingTransactionFact"
         where "BillingTransactionKey" > coalesce({incremental_value}, -4)
           and "BillingTransactionKey" < coalesce({incremental_value}, -4) + 10000000
       ) select *
           from foo�

where 10000000 is what is passed to sling via --limit (and I think ultimately ignored because of what is specified for the stream here).

This is effectively processing at most the specified limit of rows and doing a simple insert to save time. And it avoids the "order by" entirely, which relieves the pressure on TEMPDB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants