Add parallel upload for BulkImportWriter #134

chezou · 2024-08-16T18:25:21Z

This patch introduces parallel upload capability to BulkImportWriter.

It makes a chunk of 10,000 records and uploads in parallel. Also, it reduces memory consumption for msgpack format uploading.

Upload speed becomes x4.4 faster, and memory consumption can be controlled by using Temporary file and chunk_record_size number.

Dummy data creation for benchmark

>>> import numpy as np; import pandas as pd
>>> def fake_data(n):
...    users = np.random.choice([0., 1., 2.], (n, 1))
...    items = np.random.choice([0., 1., 2.], (n, 1))
...    weight = np.random.rand(n,1)
...    return np.concatenate((users, items, weight), axis=1)
...
>>> d1 = fake_data(10_000_000)
>>> df = pd.DataFrame(d1, columns=["users", "items", "scores"])
>>> import pytd; import os
>>> client=pytd.Client(database="aki", apikey=os.environ["TD_API_KEY"])

Upload with a single thread

>>> import time
>>> s=time.time()
>>> client.load_table_from_dataframe(df, "aki.pytd_bi_test", writer="bulk_import", if_exists="overwrite", fmt="msgpack", max_workers=1, chunk_record_size=10_000_000)
>>> print(f"elapsed time:{time.time() - s} sec")
uploading data converted into a msgpack file
uploaded data in 64.06 sec
performing a bulk import job
[job id 2172780406] imported 10000000 records.
elapsed time:281.57144117355347 sec

Upload with 6 threads

>>> import pytd; import os
>>> import time
>>> s=time.time()
>>> client.load_table_from_dataframe(df, "aki.pytd_bi_test", writer="bulk_import", if_exists="overwrite", fmt="msgpack", max_workers=6, chunk_record_size=1_000_000)
>>> print(f"elapsed time:{time.time() - s} sec")
uploading data converted into a msgpack file
uploaded data in 14.56 sec
performing a bulk import job
[job id 2172780795] imported 10000000 records.
elapsed time:209.5141899585724 sec

This change is to avoid the need to keep the entire msgpack in memory, which can be a problem for large data sets.

pytd/writer.py

shroman · 2024-08-28T10:13:15Z

pytd/writer.py

@@ -449,7 +482,7 @@ def _bulk_import(self, table, file_like, if_exists, fmt="csv"):
        table : :class:`pytd.table.Table`
            Target table.

-        file_like : File like object
+        file_like : List of file like objects


Suggested change

file_like : List of file like objects

file_likes : List of file like objects

shroman · 2024-08-28T10:13:32Z

pytd/writer.py

            stack.close()

-    def _bulk_import(self, table, file_like, if_exists, fmt="csv"):
+    def _bulk_import(self, table, file_like, if_exists, fmt="csv", max_workers=5):


Suggested change

def _bulk_import(self, table, file_like, if_exists, fmt="csv", max_workers=5):

def _bulk_import(self, table, file_likes, if_exists, fmt="csv", max_workers=5):

shroman · 2024-08-28T10:14:18Z

pytd/writer.py

-                # To skip API._prepare_file(), which recreate msgpack again.
-                bulk_import.upload_part("part", file_like, size)
+                with ThreadPoolExecutor(max_workers=max_workers) as executor:
+                    for i, fp in enumerate(file_like):


Suggested change

for i, fp in enumerate(file_like):

for i, fp in enumerate(file_likes):

shroman · 2024-08-28T10:14:47Z

pytd/writer.py

            else:
-                bulk_import.upload_file("part", fmt, file_like)
+                fp = file_like[0]


Suggested change

fp = file_like[0]

fp = file_likes[0]

shroman · 2024-08-28T10:16:31Z

pytd/writer.py

@@ -542,7 +591,9 @@ def _write_msgpack_stream(self, items, stream):
                    mp = packer.pack(normalized_msgpack(item))
                gz.write(mp)

-        stream.seek(0)


did this #seek become unnecessary?

Yes, it's moved to here https://github.com/treasure-data/pytd/pull/134/files#diff-7e3490636bfbd0197f47ef9694662f2621beb294fb178b5185c8f257082c53e7R533
This is because to get file size bystream.tell() properly, #seek should happen later.

shroman · 2024-08-28T10:21:25Z

pytd/tests/test_writer.py

+                df.to_dict(orient="records"), fp
+            )
+            api_client.create_bulk_import().upload_part.assert_called_with(
+                "part-0", ANY, 62


maybe I missed something, but where this 62 comes from?

I tried to get the file size from fp, but I gave up because it doesn't match with the actual file size in the _write_msgpack_stream() method.

Co-authored-by: Roman Shtykh <[email protected]>

chezou and others added 2 commits September 17, 2023 15:10

Add parallel upload for msgpack option

ef3887a

Add chunk_record_size argument

a0767f3

chezou requested a review from tung-vu-td August 17, 2024 03:46

chezou force-pushed the parallel-upload branch from 8714d12 to 175e3d8 Compare August 27, 2024 04:55

Use NamedTemporaryFile instead of io.BytesIO for msgpack

80ed76b

This change is to avoid the need to keep the entire msgpack in memory, which can be a problem for large data sets.

chezou force-pushed the parallel-upload branch from b8c6758 to 80ed76b Compare August 27, 2024 05:26

chezou requested a review from shroman August 27, 2024 17:08

Convert dataframe to dict by chunk to reduce memory consumption

bfe2233

chezou force-pushed the parallel-upload branch from bc69372 to bfe2233 Compare August 27, 2024 21:22

shroman reviewed Aug 28, 2024

View reviewed changes

pytd/writer.py Outdated Show resolved Hide resolved

shroman reviewed Aug 28, 2024

View reviewed changes

chezou and others added 2 commits August 28, 2024 08:25

Better wording for an error message

b000278

Co-authored-by: Roman Shtykh <[email protected]>

Rename variable to prural

04c5500

shroman approved these changes Aug 29, 2024

View reviewed changes

chezou merged commit b560475 into master Aug 29, 2024
21 checks passed

chezou deleted the parallel-upload branch August 29, 2024 19:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parallel upload for BulkImportWriter #134

Add parallel upload for BulkImportWriter #134

chezou commented Aug 16, 2024 •

edited

Loading

shroman Aug 28, 2024

shroman Aug 28, 2024

shroman Aug 28, 2024

shroman Aug 28, 2024

shroman Aug 28, 2024

chezou Aug 28, 2024

shroman Aug 28, 2024

chezou Aug 28, 2024

	file_like : List of file like objects
	file_likes : List of file like objects

	def _bulk_import(self, table, file_like, if_exists, fmt="csv", max_workers=5):
	def _bulk_import(self, table, file_likes, if_exists, fmt="csv", max_workers=5):

	for i, fp in enumerate(file_like):
	for i, fp in enumerate(file_likes):

Add parallel upload for BulkImportWriter #134

Add parallel upload for BulkImportWriter #134

Conversation

chezou commented Aug 16, 2024 • edited Loading

Dummy data creation for benchmark

Upload with a single thread

Upload with 6 threads

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chezou commented Aug 16, 2024 •

edited

Loading