Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add default max batch size and batchSize parameter for SF updates #55

Open
wants to merge 41 commits into
base: master
Choose a base branch
from

Conversation

unintellisense
Copy link

I recently started using the project to perform bulk updates to Salesforce, and found it was not creating multiple batches when the row count exceeded the maximum 10,000 rows allowed.

This PR sets a default batch size of 5000 (since there is also a max size per batch I preferred to not start at 10k) and allows a batchSize parameter to be set if you prefer to change the batch size (often, the SF API will process multiple batches faster than one large batch, though your API limit is measured by number of batches).

@mosche
Copy link

mosche commented Sep 10, 2020

Running into the same issue, this would be very helpful @springml :)

build.sbt Outdated
Comment on lines 14 to 15
"com.springml" % "salesforce-wave-api" % "1.0.10",
"com.github.loanpal-engineering" % "salesforce-wave-api" % "eb71436",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@unintellisense Wondering, is this related to this change? Why do you need to change to a fork of the api client?

@mosche
Copy link

mosche commented Sep 10, 2020

Looking into it a bit further, it might be better to improve Utils.repartition to also consider the number of rows when calculating the target number of partitions rather than just basing the number of partitions on the estimated size.

Comment on lines +35 to +36
val partitionCnt = (1 + csvRDD.count() / batchSize).toInt
val partitionedRDD = csvRDD.repartition(partitionCnt)
Copy link

@mosche mosche Sep 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be the right place to repartition as it's just leading to a 2nd round of shuffling the data around :/ Partitioning to control the size of ingest batches is already done in Utils.repartition, so the limit of records per batch should be considered there:
https://github.com/springml/spark-salesforce/pull/59/files#diff-b359f3e710dff2341dbedadb012b9ff4R62-R73

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A PR for the alternative approach is here #59

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants