-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add default max batch size and batchSize parameter for SF updates #55
base: master
Are you sure you want to change the base?
Conversation
Add batch size update
Nulls not empty
Running into the same issue, this would be very helpful @springml :) |
build.sbt
Outdated
"com.springml" % "salesforce-wave-api" % "1.0.10", | ||
"com.github.loanpal-engineering" % "salesforce-wave-api" % "eb71436", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@unintellisense Wondering, is this related to this change? Why do you need to change to a fork of the api client?
Looking into it a bit further, it might be better to improve |
val partitionCnt = (1 + csvRDD.count() / batchSize).toInt | ||
val partitionedRDD = csvRDD.repartition(partitionCnt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to be the right place to repartition as it's just leading to a 2nd round of shuffling the data around :/ Partitioning to control the size of ingest batches is already done in Utils.repartition
, so the limit of records per batch should be considered there:
https://github.com/springml/spark-salesforce/pull/59/files#diff-b359f3e710dff2341dbedadb012b9ff4R62-R73
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A PR for the alternative approach is here #59
spark 3.0/scala 2.12 compatibility
swap out java-sizeof version and remove scala version from the artifact-id
update gitignore a
use salesforce magic null string value
* add support for max column width to csv parser * pom changes
shade dependency
support pkChunks with filtering (handling empty batches)
bulk api 2.0
change force api to 53
fix max columns
I recently started using the project to perform bulk updates to Salesforce, and found it was not creating multiple batches when the row count exceeded the maximum 10,000 rows allowed.
This PR sets a default batch size of 5000 (since there is also a max size per batch I preferred to not start at 10k) and allows a batchSize parameter to be set if you prefer to change the batch size (often, the SF API will process multiple batches faster than one large batch, though your API limit is measured by number of batches).