add default max batch size and batchSize parameter for SF updates #55

unintellisense · 2020-06-28T22:37:34Z

I recently started using the project to perform bulk updates to Salesforce, and found it was not creating multiple batches when the row count exceeded the maximum 10,000 rows allowed.

This PR sets a default batch size of 5000 (since there is also a max size per batch I preferred to not start at 10k) and allows a batchSize parameter to be set if you prefer to change the batch size (often, the SF API will process multiple batches faster than one large batch, though your API limit is measured by number of batches).

…lts to 5000

Add batch size update

Nulls not empty

mosche · 2020-09-10T06:11:55Z

Running into the same issue, this would be very helpful @springml :)

mosche · 2020-09-10T06:18:47Z

build.sbt

-  "com.springml" % "salesforce-wave-api" % "1.0.10",
+  "com.github.loanpal-engineering" % "salesforce-wave-api" % "eb71436",


@unintellisense Wondering, is this related to this change? Why do you need to change to a fork of the api client?

mosche · 2020-09-10T15:10:43Z

Looking into it a bit further, it might be better to improve Utils.repartition to also consider the number of rows when calculating the target number of partitions rather than just basing the number of partitions on the estimated size.

mosche · 2020-09-15T07:04:09Z

src/main/scala/com/springml/spark/salesforce/SFObjectWriter.scala

+    val partitionCnt = (1 + csvRDD.count() / batchSize).toInt
+    val partitionedRDD = csvRDD.repartition(partitionCnt)


This doesn't seem to be the right place to repartition as it's just leading to a 2nd round of shuffling the data around :/ Partitioning to control the size of ingest batches is already done in Utils.repartition, so the limit of records per batch should be considered there:
https://github.com/springml/spark-salesforce/pull/59/files#diff-b359f3e710dff2341dbedadb012b9ff4R62-R73

A PR for the alternative approach is here #59

spark 3.0/scala 2.12 compatibility

swap out java-sizeof version and remove scala version from the artifact-id

update gitignore a

use salesforce magic null string value

* add support for max column width to csv parser * pom changes

shade dependency

support pkChunks with filtering (handling empty batches)

bulk api 2.0

change force api to 53

fix max columns

unintellisense and others added 8 commits June 27, 2020 23:44

add batchSize parameter for controlling bulk upload batch size. defau…

46e4f57

…lts to 5000

update readme

144f9f9

Merge pull request #1 from loanpal-engineering/add-batch-size-update

e3f58b0

Add batch size update

use updated wave-api dependency

bd21e6a

null instead of empty for null objects

707921d

cast null as null

af32243

use master

6b4fb6d

Merge pull request #2 from loanpal-engineering/nulls-not-empty

eda9081

Nulls not empty

mosche reviewed Sep 10, 2020

View reviewed changes

mosche reviewed Sep 15, 2020

View reviewed changes

SKinserLoanpal and others added 18 commits October 1, 2020 14:22

spark 3.0/scala 2.12 compatibility

131c4f3

change to jitpack

67ec60a

Merge pull request #3 from loanpal-engineering/spark-3.0

b8c10ea

spark 3.0/scala 2.12 compatibility

swap out spark version

32effa7

Merge pull request #4 from loanpal-engineering/spark-3.0

0e62d2a

swap out java-sizeof version and remove scala version from the artifact-id

use salesforce magic null string value

8f72643

maybe empty string is dumb

e9b423c

make note of reading/writing

d934530

switch to cast

dedec34

fix

5f4437f

a

6770197

add empty string na

c1d8e0b

make all null values #N/A aside from dates

2451cb6

update gitignore a

fix dumb mistake

7e068e5

add special case for booleans

03491b1

dumb mistake

3f9358d

fix data writer

f7ed7f2

fix

99c125d

SKinserLoanpal and others added 15 commits February 3, 2021 12:23

delete commented code

4661c15

Merge pull request #5 from loanpal-engineering/null-na

e226b20

use salesforce magic null string value

add support for max column width to csv parser (#6)

b6fe136

* add support for max column width to csv parser * pom changes

shade dependency

7581bf8

Merge pull request #8 from loanpal-engineering/spark-3.3.0

6706341

shade dependency

support pkChunks with filtering (handling empty batches)

b19abd0

Merge pull request #9 from loanpal-engineering/BI-10792

fab2d96

support pkChunks with filtering (handling empty batches)

Create codeowners

6a658f1

bulk api 2.0

850f035

fix

e1d7e59

Merge pull request #10 from loanpal-engineering/spark-3.3.0

9a04a88

bulk api 2.0

change force api to 53

7316f09

Merge pull request #11 from loanpal-engineering/spark-3.3.0

2d727e8

change force api to 53

fix max columns

8ccd432

Merge pull request #14 from loanpal-engineering/spark-3.3.0

bd9c2fa

fix max columns

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add default max batch size and batchSize parameter for SF updates #55

add default max batch size and batchSize parameter for SF updates #55

unintellisense commented Jun 28, 2020

mosche commented Sep 10, 2020

mosche Sep 10, 2020

mosche commented Sep 10, 2020

mosche Sep 15, 2020 •

edited

Loading

mosche Sep 15, 2020

		"com.springml" % "salesforce-wave-api" % "1.0.10",
		"com.github.loanpal-engineering" % "salesforce-wave-api" % "eb71436",

		val partitionCnt = (1 + csvRDD.count() / batchSize).toInt
		val partitionedRDD = csvRDD.repartition(partitionCnt)

add default max batch size and batchSize parameter for SF updates #55

Are you sure you want to change the base?

add default max batch size and batchSize parameter for SF updates #55

Conversation

unintellisense commented Jun 28, 2020

mosche commented Sep 10, 2020

mosche Sep 10, 2020

Choose a reason for hiding this comment

mosche commented Sep 10, 2020

mosche Sep 15, 2020 • edited Loading

Choose a reason for hiding this comment

mosche Sep 15, 2020

Choose a reason for hiding this comment

mosche Sep 15, 2020 •

edited

Loading