Fast Data Retrieval for Analytics Databases #27

YYYasin19 · 2022-06-14T12:04:22Z

Hey there,

after introducing a statistical test with #23, I now want to integrate a way to load a large batch of data (> 100,000,000) within a reasonable amount of time and memory use.

For this, snowflake's sqlalchemy connector already provides a convenient method that works surprisingly well. You can read more about it here.

Since I dislike or am unsure about some changes here, this will be a draft first. Concretely speaking, the following things are unclear now:

Is there a way to not use pandas as an intermediate representation? (Additional dependency, data needs to be transformed, etc.)
Can we only do this for snowflake, or do we have similar methods for other database systems as well?

ivergara · 2022-06-14T12:57:22Z

My guess is that we should revolve around arrow tables and use the following method instead. This could also help.

I'd be fine using a pyarrow Table as an internal backing storage instead of a dataframe. And I mean it as a general intermediate step within datajudge.

YYYasin19 · 2022-06-14T14:23:29Z

My guess is that we should revolve around arrow tables and use the following method instead. This could also help.

Yes, that could also make sense, I thought it wouldn't really make a difference since we'd need a new requirement for both of them and, in my experience, most projects already have pandas installed for some reason.
Or is Arrow already used somewhere within the project?

ivergara · 2022-06-14T14:52:24Z

My guess is that we should revolve around arrow tables and use the following method instead. This could also help.

Yes, that could also make sense, I thought it wouldn't really make a difference since we'd need a new requirement for both of them and, in my experience, most projects already have pandas installed for some reason. Or is Arrow already used somewhere within the project?

pyarrow is already installed by snowflake-connector

https://github.com/snowflakedb/snowflake-connector-python/blob/2ccd3c811958170854ff4cac159e7c125bf8b4f3/pyproject.toml#L2-L10

In a summary, arrow intends to be the common in-memory representation of tables/dataframes enabling interoperability between different analytics applications. It has plenty of other advantages, that the `snowflake-connector is using for example to improve data transfer.

And I'm more comfortable going that route than straight out requiring pandas. pandas, for what you want to do, is only an intermediary... and a rather bad one at that in my opinion.

YYYasin19 · 2022-06-14T15:20:12Z

pyarrow is already installed by snowflake-connector

Oh, didn't catch that! Then it makes perfect sense to just build on arrow. All of our snowflake users won't have to have anything additionally installed then.

ivergara · 2022-06-20T20:40:45Z

src/datajudge/db_access.py

@@ -9,6 +9,7 @@
 from typing import Callable, Sequence, final, overload

 import sqlalchemy as sa
+from snowflake.connector.cursor import SnowflakeCursor


Guard this import

try: from... except ImportError: print("a useful message")

Btw nowadays we have ModuleNotFoundError

I removed the import statement since it only is a type hint for the user

Then you should do

from typing import TYPE_CHECKING if TYPE_CHECKING: from ....

I don't think I really get what you mean.

I initially added the import to have a

cursor: SnowflakeCursor = engine.connect().connection.cursor()

line, where the IDE then knows that cursor has additional methods provided by snowflake.

Since this was not needed anymore (not important enough) I removed it again, so it had no checking functionality. So it should be fine to just remove it.

ivergara · 2022-06-20T20:41:12Z

src/datajudge/db_access.py

@@ -650,14 +651,12 @@ def get_column(
        selection = sa.select([column])

        if is_snowflake(engine):  # check if we have a snowflake cursor
-            snowflake_cursor = engine.connect().connection.cursor()
+            snowflake_cursor: SnowflakeCursor = engine.connect().connection.cursor()

            # note: this step requires pandas to be installed


This comment is outdated I believe.

Is it, though? Because currently, it requires pyarrow and pandas because we get the SegFault otherwise. I'll change it to reflect that, though.

I wrote it before I was aware of the segfault without pandas.

YYYasin19 · 2022-06-20T20:48:23Z

Unfortunately, the snowflake connector is causing a segmentation fault when pandas is not installed.
The segmentation fault occurs in the snowflake-python-connector code, precisely when yielding from the iterator within the ThreadPoolExecutor.

I'll investigate this further.

ivergara · 2022-06-20T21:38:48Z

Unfortunately, the snowflake connector is causing a segmentation fault when pandas is not installed. The segmentation fault occurs in the snowflake-python-connector code, precisely when yielding from the iterator within the ThreadPoolExecutor.

Not much info on segmentation faults in snowflake-python-connector, this and this might be relevant, which it's "solved" by installing pandas.

I'll investigate this further.

👍

src/datajudge/db_access.py

YYYasin19 · 2022-06-22T15:49:33Z

Update: Although we now may have an SQL-only solution for the initial motivation of this PR (performing the KS-test, as mentioned in #23 ) I still think that there is value in allowing very fast data loading.

Although this adds complexity to our code it is, in my opinion, still justifiable because:
(1) we do it at a highly centralised place in our code. The get_column method is called by practically every data retrieval method. Our current solution is just a hook into this pipeline step. In the future, we could add other hooks on demand, e.g., for postgresql without changing any structural elements.

(2) the snowflake-connector-python solution with .fetch_arrow_all() adds virtually no overhead to even small sample sizes; tested with a quick timeit run.

These are therefore open sub-issues currently:

I'd like to include some fast checking if pandas and pyarrow are installed optimally once, after which the code works.
pandas and pyarrow should only be included as an optional dependency, i.e., not be included in the conda recipe.
Open an issue at snowflake-connector-python to solve the segfault when pandas is not installed.

ivergara · 2022-06-22T19:15:07Z

The approach concerning dependencies to interact with different dependencies has been until now to leave the responsibility of having the appropriate drivers to the end user and not to add them to the dependencies of the package. If you want to use Postgres, you have to manually install psycopg2 for example. This is the same with Snowflake. Once the snowflake-python-connector is installed, it also installs pyarrow, however it doesn't install pandas.

If this experiment gets merged, we have to add a note in the documentation concerning Snowflake use the need of manually installing pandas.

YYYasin19 · 2022-06-23T09:30:33Z

If this experiment gets merged, we have to add a note in the documentation concerning Snowflake use the need of manually installing pandas.

I think you are right and this is the best way currently.
I have added a warning message for snowflake users, though, that when pandas is not available, the user is aware that optimizations won't be run.
The only issue is that I don't know how to show the message to snowflake-users only.

YYYasin19 · 2022-06-23T09:32:26Z

Also, the issue is now open here in the snowflake-python-connector.

ivergara · 2022-06-23T09:40:35Z

src/datajudge/db_access.py

-        result = engine.connect().execute(selection).scalars().all()
+
+        # snowflake-specific optimization iff pandas is installed
+        if is_snowflake(engine) and pandas_available:


Suggested change

if is_snowflake(engine) and pandas_available:

if is_snowflake(engine):

if not pandas_available:

print message and exit the snowflake code returning to sqlalchemy path.

Concerning the message, this is an option. You delay using the flag you created until here.

I did split the if to make it clearer, but you can find a way to do this without splitting the if, you just have to first check if pandas is available or not.

I wanted to try that one as well, unfortunately, now the error message would be printed for each and every query.

Fair enough, then you can try to import snowflake, if it succeeds you can do the pandas check and inform in case it's not present.

I have created a check_module_installed(...) since it seems like this functionality might be needed more often in such a dynamic environment.

Now, we just do

from .utils import check_module_installed snowflake_available = check_module_installed("snowflake") pandas_available = check_module_installed("pandas")

@ivergara are you satisfied with this solution?

YYYasin19 · 2022-06-29T15:14:30Z

After having implemented the KS test in SQL only, we don't have a pressing use case to get these changes implemented.
I still think that this is valuable, and, in addition does not affect a lot of our code base.

The essential question at this point is: Do we implement fast(er) loading for mssql and postgres as well or are we satisfied with just snowflake for now; maybe until someone opens an issue for another backend?

jonashaag · 2022-06-29T15:16:00Z

IMO Snowflake-only is fine

ivergara · 2022-06-29T18:09:07Z

The essential question at this point is: Do we implement fast(er) loading for mssql and postgres as well or are we satisfied with just snowflake for now; maybe until someone opens an issue for another backend?

Just Snowflake is fine, since the work is done already. For other backends, we see in the future.

ivergara

Looks good to me. What's your opinion @kklein?

docs/source/installation.rst

ivergara · 2022-06-29T21:19:57Z

src/datajudge/utils.py

+
+    try:
+        mod = importlib.import_module(module_name)
+        return mod is not None


Why not just return True? Or can it happen that the module is not found and still not trigger the exception?

Nope, just checked, it always throws an error :)

ivergara · 2022-06-29T21:20:05Z

src/datajudge/db_access.py

-        result = engine.connect().execute(selection).scalars().all()
+
+        # snowflake-specific optimization iff pandas is installed
+        if is_snowflake(engine) and pandas_available:


Co-authored-by: Ignacio Vergara Kausel <[email protected]>

…ajudge into snowflake-fast-fetch

ivergara · 2022-07-20T10:45:39Z

Since the main motivation to go for this solution has been solved differently (see #28), there is no pressing need of using native Snowflake communication. Thus, this PR is currently out of scope for the project and will be closed.

Thank you for your effort @YYYasin19. Maybe in the future, the needs change and we'll find ourselves needing to do something that cannot be pushed to the engine and this exploration might come in handy.

YYYasin19 · 2022-07-20T14:44:33Z

I agree, let's reevaluate this in the future, maybe even with other database engines supported as well.

Additionally, I hope that some of the snowflake-connector-python issues will be solved by then.

YYYasin19 added 2 commits June 14, 2022 13:53

introduce specific fast fetching behaviour for snowflake connector

72ddc43

formatting

3d9194f

YYYasin19 added 3 commits June 20, 2022 21:07

convert to numpy array

7788516

return empty list as result, otherwise

10cadb5

update comments

e64c943

ivergara reviewed Jun 20, 2022

View reviewed changes

YYYasin19 self-assigned this Jun 21, 2022

YYYasin19 added the enhancement New feature or request label Jun 21, 2022

YYYasin19 commented Jun 21, 2022

View reviewed changes

src/datajudge/db_access.py Show resolved Hide resolved

YYYasin19 added 3 commits June 21, 2022 21:04

update comment

6884f1a

remove snowflake cursor type hint and therefore (guarded) import

56b0b12

add pandas as (temporary) dependency

13c39c1

YYYasin19 added 2 commits June 23, 2022 11:22

check pandas availability to prevent segmentation fault

14b8023

single time warning message for snowflake users

443e8f3

ivergara reviewed Jun 23, 2022

View reviewed changes

YYYasin19 added 3 commits June 23, 2022 13:42

print message only for snowflake users

fd30f69

simplify import statements (flake errors for unused import statements)

985578d

simplify import statements

c4e9be9

YYYasin19 marked this pull request as ready for review June 29, 2022 14:59

YYYasin19 added 2 commits June 29, 2022 17:08

clean up check for module import

a748077

fix import

764464c

jonashaag closed this Jun 29, 2022

jonashaag reopened this Jun 29, 2022

add snowflake comment

fa1de3f

ivergara reviewed Jun 29, 2022

View reviewed changes

ivergara requested a review from kklein June 29, 2022 21:23

YYYasin19 and others added 3 commits June 30, 2022 10:38

easier import check

08fb114

Update docs/source/installation.rst

9dafd5f

Co-authored-by: Ignacio Vergara Kausel <[email protected]>

Merge branch 'snowflake-fast-fetch' of https://github.com/Quantco/dat…

0144b72

…ajudge into snowflake-fast-fetch

ivergara closed this Jul 20, 2022

Fast Data Retrieval for Analytics Databases #27

Fast Data Retrieval for Analytics Databases #27

Uh oh!

Conversation

YYYasin19 commented Jun 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivergara commented Jun 14, 2022

Uh oh!

YYYasin19 commented Jun 14, 2022

Uh oh!

ivergara commented Jun 14, 2022

Uh oh!

YYYasin19 commented Jun 14, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YYYasin19 commented Jun 20, 2022

Uh oh!

ivergara commented Jun 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

YYYasin19 commented Jun 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivergara commented Jun 22, 2022

Uh oh!

YYYasin19 commented Jun 23, 2022

Uh oh!

YYYasin19 commented Jun 23, 2022

Uh oh!

ivergara Jun 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YYYasin19 commented Jun 29, 2022

Uh oh!

jonashaag commented Jun 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivergara commented Jun 29, 2022

Uh oh!

ivergara left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivergara commented Jul 20, 2022

YYYasin19 commented Jun 14, 2022 •

edited

Loading

ivergara commented Jun 20, 2022 •

edited

Loading

YYYasin19 commented Jun 22, 2022 •

edited

Loading

ivergara Jun 23, 2022 •

edited

Loading

jonashaag commented Jun 29, 2022 •

edited

Loading