exporting to lists / dicts / numpy arrays #321

sslivkoff · 2022-07-29T15:35:25Z

sslivkoff
Jul 29, 2022

hi. connector-x is an absolute gem of a library. initial tests show a speedup of up to 20x for some of my queries over sqlalchemy

Im wondering if you have considered any export destinations beyond dataframes. something like:

dict of {column_name: np array}
dict of {column_name: list}
list of list

the main reason I ask is import times. i write a lot of cli tools that run sql queries. for cli tools, import times are a real killer, and dataframe libraries generally have very long import times compared to other packages. for example, on my decently fast machine connector-x takes 0.015s to import, whereas pandas takes around 0.250s. this is in addition to all of the other imports and data processing that a cli tool needs to do. cli tools have a significantly better user experience when they are quick and responsive, and this can be difficult in python depending on which libraries you're using

I would love to switch over all my sql-related code from sqlalchemy to connector-x. however sqlalchemy takes just 0.055s to import, so for small queries it's actually still faster than connector-x. this makes it difficult to judge which library should be used in which situation

being able to export to primitive lists and dicts (and even numpy arrays) as a destination would basically solve this problem

another reason to consider using primitive types for exports is that dataframes are a somewhat heavy dependency. some projects might be open to adding connector-x as a dependency but not pandas

curious to hear your thoughts. thanks

wangxiaoying · 2022-08-17T04:28:07Z

wangxiaoying
Aug 17, 2022
Maintainer

Hi @sslivkoff , thanks for the discussion and the information.

ConnectorX is mainly targeting on fetching large query result. It speeds up the process by optimizing the client-side execution as well as saturating both network and machine resource through parallelism. However, when query execution on the database server is the bottleneck (e.g. query execution time is long but the result is relatively small), the improvement ConnectorX will become minor and sometimes it can be even slower than sqlalchemy due to the overhead in fetching metadata (e.g. data types, number of rows in the result).

I don't think supporting to list/dict could solve the problem since the overhead still exists (please let me know if you think it is possible). But when it comes to pandas, ConnectorX internally is actually allocating the corresponding numpy arrays needed in the pandas block manager and then converting to pandas dataframe with negligible cost.

0 replies

sslivkoff · 2022-08-17T08:46:58Z

sslivkoff
Aug 17, 2022
Author

hi. thanks for the response

what youre saying makes a lot of sense and I realize that my suggestions might not align with the design goals of connector-x

I believe you're correct that supporting additional destinations will not help with any server bottlenecks. it would however help with the clientside execution bottlenecks in the case of medium-sized queries to a local db. in this case the client (e.g. python) can still remain the main bottleneck, at least for the tests I ran

in the tests I ran I was connecting to a local db and running queries of many sizes. in all cases I saw a substantial improvement using connector-x over sqlalchemy as long as I manually imported pandas prior to running the connector-x query. but if I did not manually import pandas beforehand, then connector-x takes longer. I tested this for each of the supported connector-x destinations and each gave similar results

I am not familiar with the pandas block manager and I don't know rust so I don't have much insight into the relevant constraints here. would it be nontrivial to avoid the pandas block manager and instead read the raw bytes using something like numpy's frombuffer() function? or does the block manager provide a lot of functionality that doesn't exist in raw numpy?

1 reply

wangxiaoying Aug 17, 2022
Maintainer

I believe you're correct that supporting additional destinations will not help with any server bottlenecks. it would however help with the clientside execution bottlenecks in the case of medium-sized queries to a local db. in this case the client (e.g. python) can still remain the main bottleneck, at least for the tests I ran

The bottleneck on small queries are on the overhead of the metadata fetching. In ConnectorX, there are up to three info that will be fetched before issue the query to database:

MIN, MAX query for partition range (if partition is applied)
COUNT query (if return_type=pandas)
schema fetching query to get the column names and types in the result

These metadata can boost ConnectorX's performance and the overhead can be ignored when the query result fetching and parsing is bottleneck on large queries. But on queries with small result, it will become the bottleneck and cannot be resolved by exporting to new data type.

in the tests I ran I was connecting to a local db and running queries of many sizes. in all cases I saw a substantial improvement using connector-x over sqlalchemy as long as I manually imported pandas prior to running the connector-x query. but if I did not manually import pandas beforehand, then connector-x takes longer. I tested this for each of the supported connector-x destinations and each gave similar results

I'm not sure what's "manually import pandas" mean here. My guess is that the database would cache some of the query result, so the first time you fetch the query will be slower than fetching the same one again later. Please correct me if I'm wrong. Also it will be easier to understand if there is a concrete example given.

I am not familiar with the pandas block manager and I don't know rust so I don't have much insight into the relevant constraints here. would it be nontrivial to avoid the pandas block manager and instead read the raw bytes using something like numpy's frombuffer() function? or does the block manager provide a lot of functionality that doesn't exist in raw numpy?

Internally data in pandas are stored in numpy arrays. You can check out this blog: https://uwekorn.com/2020/05/24/the-one-pandas-internal.html

sslivkoff · 2022-08-27T00:39:57Z

sslivkoff
Aug 27, 2022
Author

hi. thanks for bearing with me. here are a couple examples of the import bottlenecks I was running into. each of these examples should be run in a fresh python session to demonstrate the effects of imports. (i'm also just showing a single run of each example, but they show the same general trend when they are run multiple times)

import timings

so just starting with a baseline of how longs things take to import...

timing of connectorx import

import time
import sys

print('timing import of connectorx...')
start_time = time.time()
import connectorx as cx
end_time = time.time()
print('connectorx took ' + str(end_time - start_time) + 's to import')
print('pandas is implicitly imported by connectorx:', 'pandas' in sys.modules)
print('numpy is implicitly imported by connectorx:', 'numpy' in sys.modules)

timing import of connectorx...
connectorx took 0.015387773513793945s to import
pandas is implicitly imported by connectorx: False
numpy is implicitly imported by connectorx: False

timing of numpy import

import time

print('timing import of numpy...')
start_time = time.time()
import numpy as np
end_time = time.time()
print('numpy took ' + str(end_time - start_time) + 's to import')

timing import of numpy...
numpy took 0.06496167182922363s to import

timing of pandas import

import time

print('timing import of pandas...')
start_time = time.time()
import pandas as pd
end_time = time.time()
print('pandas took ' + str(end_time - start_time) + 's to import')

timing import of pandas...
pandas took 0.4150388240814209s to import

also note that once a package is imported, importing it again is trivially fast:

import time

print('timing import of pandas...')
start_time = time.time()
import pandas as pd
end_time = time.time()
print('pandas took ' + str(end_time - start_time) + 's to import')
start_time = time.time()
import pandas as pd
end_time = time.time()
print('pandas took ' + str(end_time - start_time) + 's to import')

timing import of pandas...
pandas took 0.40160059928894043s to import
pandas took 2.384185791015625e-07s to import

query timings

now how long does it take to run a connectorx query? here I am running a query on a small sqlite db with about 100k rows

import time
import connectorx as cx

db_url = '<db_path>'
query = 'SELECT * FROM table'

start_time = time.time()
cx.read_sql(db_url, query)
end_time = time.time()
print('query took ' + str(end_time - start_time) + 's')

start_time = time.time()
cx.read_sql(db_url, query)
end_time = time.time()
print('query took ' + str(end_time - start_time) + 's')

start_time = time.time()
cx.read_sql(db_url, query)
end_time = time.time()
print('query took ' + str(end_time - start_time) + 's')

query took 0.4096062183380127s
query took 0.015462636947631836s
query took 0.015637874603271484s

the first query is about ~390 ms slower than subsequent queries. this difference is also not dependent on query size. at first i thought this effect might be due to warmup of the db or the storage layer. but we can eliminate most of this effect by simply importing pandas prior to the first query (this is what I meant by "manual import")

import time
import connectorx as cx

import pandas as pd

db_url = '<db_path>'
query = 'SELECT * FROM table'

start_time = time.time()
cx.read_sql(db_url, query)
end_time = time.time()
print('query took ' + str(end_time - start_time) + 's')

start_time = time.time()
cx.read_sql(db_url, query)
end_time = time.time()
print('query took ' + str(end_time - start_time) + 's')

start_time = time.time()
cx.read_sql(db_url, query)
end_time = time.time()
print('query took ' + str(end_time - start_time) + 's')

query took 0.018970727920532227s
query took 0.015505075454711914s
query took 0.015842437744140625s

the ~390ms delay cuts down to ~35ms. afaict these results suggest that cx.read_sql would be significantly faster for small queries if it could avoid import of dataframe libraries and just outputting to numpy arrays directly

these results are also not unique to pandas. if you use other output destinations using return_type then this test shows similar results for arrow, modin, dask, and polars. each one takes a long time to import and none of them give a speed improvement over pandas for this usecase

The bottleneck on small queries are on the overhead of the metadata fetching

I don't think this is always the case. if the client is on the same machine as the db host (or if they are closely collocated) I think these tests suggest that dataframe lib import times can be the largest bottleneck

I also want to add the caveat that I dont want to burden you and I don't know how much work it would be to implement a pure numpy solution without any dataframe libraries. but I am very curious what you think of all this. i do strongly believe that if connectorx had an option for a pure numpy output destination then it would be widely regarded as the best tool to use for cli applications and other applications utilizing short lived queries

thanks

2 replies

wangxiaoying Sep 2, 2022
Maintainer

Hi @sslivkoff , thank you for the tests, which are very interesting.

You can check the implementation of read_sql function here, in which we do import pandas when the return_type is set to pandas. You can also see that the core _read_sql which we implement in rust returns a python object result, from which we construct a pandas dataframe using the reconstruct_pandas function. This result is a python dict and consists of three parts:

data: a list of numpy arrays
headers: column names
block_infos: column types and the ordinals

If you only want numpy arrays, we can simply add a return_type numpy and return the data arrays in the result objects. But we need to have a good design of how to construct the results since there might be multiple arrays and I don't know whether there is a standard to represent all information including data, column names and the ordinals.,

strefli3 Oct 1, 2024

In my particular usage I am taking the dataframe from connector-x and convert it with df.to_json. I am not sure if an update to read_sql could make that process any faster by skipping pandas, but I thought it would be worth mentioning in this thread just in case other folks have the same idea.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exporting to lists / dicts / numpy arrays #321

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

exporting to lists / dicts / numpy arrays #321

sslivkoff Jul 29, 2022

Replies: 3 comments · 3 replies

wangxiaoying Aug 17, 2022 Maintainer

sslivkoff Aug 17, 2022 Author

wangxiaoying Aug 17, 2022 Maintainer

sslivkoff Aug 27, 2022 Author

import timings

timing of connectorx import

timing of numpy import

timing of pandas import

query timings

wangxiaoying Sep 2, 2022 Maintainer

strefli3 Oct 1, 2024

sslivkoff
Jul 29, 2022

Replies: 3 comments 3 replies

wangxiaoying
Aug 17, 2022
Maintainer

sslivkoff
Aug 17, 2022
Author

wangxiaoying Aug 17, 2022
Maintainer

sslivkoff
Aug 27, 2022
Author

wangxiaoying Sep 2, 2022
Maintainer