Replies: 3 comments 3 replies
-
Hi @sslivkoff , thanks for the discussion and the information. ConnectorX is mainly targeting on fetching large query result. It speeds up the process by optimizing the client-side execution as well as saturating both network and machine resource through parallelism. However, when query execution on the database server is the bottleneck (e.g. query execution time is long but the result is relatively small), the improvement ConnectorX will become minor and sometimes it can be even slower than sqlalchemy due to the overhead in fetching metadata (e.g. data types, number of rows in the result). I don't think supporting to list/dict could solve the problem since the overhead still exists (please let me know if you think it is possible). But when it comes to pandas, ConnectorX internally is actually allocating the corresponding numpy arrays needed in the pandas block manager and then converting to pandas dataframe with negligible cost. |
Beta Was this translation helpful? Give feedback.
-
hi. thanks for the response what youre saying makes a lot of sense and I realize that my suggestions might not align with the design goals of connector-x I believe you're correct that supporting additional destinations will not help with any server bottlenecks. it would however help with the clientside execution bottlenecks in the case of medium-sized queries to a local db. in this case the client (e.g. python) can still remain the main bottleneck, at least for the tests I ran in the tests I ran I was connecting to a local db and running queries of many sizes. in all cases I saw a substantial improvement using connector-x over sqlalchemy as long as I manually imported pandas prior to running the connector-x query. but if I did not manually import pandas beforehand, then connector-x takes longer. I tested this for each of the supported connector-x destinations and each gave similar results I am not familiar with the pandas block manager and I don't know rust so I don't have much insight into the relevant constraints here. would it be nontrivial to avoid the pandas block manager and instead read the raw bytes using something like numpy's |
Beta Was this translation helpful? Give feedback.
-
hi. thanks for bearing with me. here are a couple examples of the import bottlenecks I was running into. each of these examples should be run in a fresh python session to demonstrate the effects of imports. (i'm also just showing a single run of each example, but they show the same general trend when they are run multiple times) import timingsso just starting with a baseline of how longs things take to import... timing of connectorx importimport time
import sys
print('timing import of connectorx...')
start_time = time.time()
import connectorx as cx
end_time = time.time()
print('connectorx took ' + str(end_time - start_time) + 's to import')
print('pandas is implicitly imported by connectorx:', 'pandas' in sys.modules)
print('numpy is implicitly imported by connectorx:', 'numpy' in sys.modules)
timing of numpy importimport time
print('timing import of numpy...')
start_time = time.time()
import numpy as np
end_time = time.time()
print('numpy took ' + str(end_time - start_time) + 's to import')
timing of pandas importimport time
print('timing import of pandas...')
start_time = time.time()
import pandas as pd
end_time = time.time()
print('pandas took ' + str(end_time - start_time) + 's to import')
also note that once a package is imported, importing it again is trivially fast: import time
print('timing import of pandas...')
start_time = time.time()
import pandas as pd
end_time = time.time()
print('pandas took ' + str(end_time - start_time) + 's to import')
start_time = time.time()
import pandas as pd
end_time = time.time()
print('pandas took ' + str(end_time - start_time) + 's to import')
query timingsnow how long does it take to run a connectorx query? here I am running a query on a small sqlite db with about 100k rows import time
import connectorx as cx
db_url = '<db_path>'
query = 'SELECT * FROM table'
start_time = time.time()
cx.read_sql(db_url, query)
end_time = time.time()
print('query took ' + str(end_time - start_time) + 's')
start_time = time.time()
cx.read_sql(db_url, query)
end_time = time.time()
print('query took ' + str(end_time - start_time) + 's')
start_time = time.time()
cx.read_sql(db_url, query)
end_time = time.time()
print('query took ' + str(end_time - start_time) + 's')
the first query is about ~390 ms slower than subsequent queries. this difference is also not dependent on query size. at first i thought this effect might be due to warmup of the db or the storage layer. but we can eliminate most of this effect by simply importing pandas prior to the first query (this is what I meant by "manual import") import time
import connectorx as cx
import pandas as pd
db_url = '<db_path>'
query = 'SELECT * FROM table'
start_time = time.time()
cx.read_sql(db_url, query)
end_time = time.time()
print('query took ' + str(end_time - start_time) + 's')
start_time = time.time()
cx.read_sql(db_url, query)
end_time = time.time()
print('query took ' + str(end_time - start_time) + 's')
start_time = time.time()
cx.read_sql(db_url, query)
end_time = time.time()
print('query took ' + str(end_time - start_time) + 's')
the ~390ms delay cuts down to ~35ms. afaict these results suggest that these results are also not unique to pandas. if you use other output destinations using
I don't think this is always the case. if the client is on the same machine as the db host (or if they are closely collocated) I think these tests suggest that dataframe lib import times can be the largest bottleneck I also want to add the caveat that I dont want to burden you and I don't know how much work it would be to implement a pure numpy solution without any dataframe libraries. but I am very curious what you think of all this. i do strongly believe that if connectorx had an option for a pure numpy output destination then it would be widely regarded as the best tool to use for cli applications and other applications utilizing short lived queries thanks |
Beta Was this translation helpful? Give feedback.
-
hi. connector-x is an absolute gem of a library. initial tests show a speedup of up to 20x for some of my queries over sqlalchemy
Im wondering if you have considered any export destinations beyond dataframes. something like:
the main reason I ask is import times. i write a lot of cli tools that run sql queries. for cli tools, import times are a real killer, and dataframe libraries generally have very long import times compared to other packages. for example, on my decently fast machine connector-x takes 0.015s to import, whereas pandas takes around 0.250s. this is in addition to all of the other imports and data processing that a cli tool needs to do. cli tools have a significantly better user experience when they are quick and responsive, and this can be difficult in python depending on which libraries you're using
I would love to switch over all my sql-related code from sqlalchemy to connector-x. however sqlalchemy takes just 0.055s to import, so for small queries it's actually still faster than connector-x. this makes it difficult to judge which library should be used in which situation
being able to export to primitive lists and dicts (and even numpy arrays) as a destination would basically solve this problem
another reason to consider using primitive types for exports is that dataframes are a somewhat heavy dependency. some projects might be open to adding connector-x as a dependency but not pandas
curious to hear your thoughts. thanks
Beta Was this translation helpful? Give feedback.
All reactions