Skip to content

Fast Data Retrieval for Analytics Databases #27

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 19 commits into from

Conversation

YYYasin19
Copy link
Contributor

@YYYasin19 YYYasin19 commented Jun 14, 2022

Hey there,

after introducing a statistical test with #23, I now want to integrate a way to load a large batch of data (> 100,000,000) within a reasonable amount of time and memory use.

For this, snowflake's sqlalchemy connector already provides a convenient method that works surprisingly well. You can read more about it here.

Since I dislike or am unsure about some changes here, this will be a draft first. Concretely speaking, the following things are unclear now:

  • Is there a way to not use pandas as an intermediate representation? (Additional dependency, data needs to be transformed, etc.)
  • Can we only do this for snowflake, or do we have similar methods for other database systems as well?

@ivergara
Copy link
Collaborator

My guess is that we should revolve around arrow tables and use the following method instead. This could also help.

I'd be fine using a pyarrow Table as an internal backing storage instead of a dataframe. And I mean it as a general intermediate step within datajudge.

@YYYasin19
Copy link
Contributor Author

My guess is that we should revolve around arrow tables and use the following method instead. This could also help.

Yes, that could also make sense, I thought it wouldn't really make a difference since we'd need a new requirement for both of them and, in my experience, most projects already have pandas installed for some reason.
Or is Arrow already used somewhere within the project?

@ivergara
Copy link
Collaborator

My guess is that we should revolve around arrow tables and use the following method instead. This could also help.

Yes, that could also make sense, I thought it wouldn't really make a difference since we'd need a new requirement for both of them and, in my experience, most projects already have pandas installed for some reason. Or is Arrow already used somewhere within the project?

pyarrow is already installed by snowflake-connector

https://github.com/snowflakedb/snowflake-connector-python/blob/2ccd3c811958170854ff4cac159e7c125bf8b4f3/pyproject.toml#L2-L10

In a summary, arrow intends to be the common in-memory representation of tables/dataframes enabling interoperability between different analytics applications. It has plenty of other advantages, that the `snowflake-connector is using for example to improve data transfer.

And I'm more comfortable going that route than straight out requiring pandas. pandas, for what you want to do, is only an intermediary... and a rather bad one at that in my opinion.

@YYYasin19
Copy link
Contributor Author

pyarrow is already installed by snowflake-connector

Oh, didn't catch that! Then it makes perfect sense to just build on arrow. All of our snowflake users won't have to have anything additionally installed then.

@@ -9,6 +9,7 @@
from typing import Callable, Sequence, final, overload

import sqlalchemy as sa
from snowflake.connector.cursor import SnowflakeCursor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guard this import

try:
    from...
except ImportError:
    print("a useful message")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw nowadays we have ModuleNotFoundError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the import statement since it only is a type hint for the user

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you should do

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I really get what you mean.

I initially added the import to have a

cursor: SnowflakeCursor = engine.connect().connection.cursor()

line, where the IDE then knows that cursor has additional methods provided by snowflake.

Since this was not needed anymore (not important enough) I removed it again, so it had no checking functionality. So it should be fine to just remove it.

@@ -650,14 +651,12 @@ def get_column(
selection = sa.select([column])

if is_snowflake(engine): # check if we have a snowflake cursor
snowflake_cursor = engine.connect().connection.cursor()
snowflake_cursor: SnowflakeCursor = engine.connect().connection.cursor()

# note: this step requires pandas to be installed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is outdated I believe.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it, though? Because currently, it requires pyarrow and pandas because we get the SegFault otherwise. I'll change it to reflect that, though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote it before I was aware of the segfault without pandas.

@YYYasin19
Copy link
Contributor Author

Unfortunately, the snowflake connector is causing a segmentation fault when pandas is not installed.
The segmentation fault occurs in the snowflake-python-connector code, precisely when yielding from the iterator within the ThreadPoolExecutor.

I'll investigate this further.

@ivergara
Copy link
Collaborator

ivergara commented Jun 20, 2022

Unfortunately, the snowflake connector is causing a segmentation fault when pandas is not installed. The segmentation fault occurs in the snowflake-python-connector code, precisely when yielding from the iterator within the ThreadPoolExecutor.

Not much info on segmentation faults in snowflake-python-connector, this and this might be relevant, which it's "solved" by installing pandas.

I'll investigate this further.

👍

@YYYasin19 YYYasin19 self-assigned this Jun 21, 2022
@YYYasin19 YYYasin19 added the enhancement New feature or request label Jun 21, 2022
@YYYasin19
Copy link
Contributor Author

YYYasin19 commented Jun 22, 2022

Update: Although we now may have an SQL-only solution for the initial motivation of this PR (performing the KS-test, as mentioned in #23 ) I still think that there is value in allowing very fast data loading.

Although this adds complexity to our code it is, in my opinion, still justifiable because:
(1) we do it at a highly centralised place in our code. The get_column method is called by practically every data retrieval method. Our current solution is just a hook into this pipeline step. In the future, we could add other hooks on demand, e.g., for postgresql without changing any structural elements.

(2) the snowflake-connector-python solution with .fetch_arrow_all() adds virtually no overhead to even small sample sizes; tested with a quick timeit run.

These are therefore open sub-issues currently:

  • I'd like to include some fast checking if pandas and pyarrow are installed optimally once, after which the code works.
  • pandas and pyarrow should only be included as an optional dependency, i.e., not be included in the conda recipe.
  • Open an issue at snowflake-connector-python to solve the segfault when pandas is not installed.

@ivergara
Copy link
Collaborator

The approach concerning dependencies to interact with different dependencies has been until now to leave the responsibility of having the appropriate drivers to the end user and not to add them to the dependencies of the package. If you want to use Postgres, you have to manually install psycopg2 for example. This is the same with Snowflake. Once the snowflake-python-connector is installed, it also installs pyarrow, however it doesn't install pandas.

If this experiment gets merged, we have to add a note in the documentation concerning Snowflake use the need of manually installing pandas.

@YYYasin19
Copy link
Contributor Author

If this experiment gets merged, we have to add a note in the documentation concerning Snowflake use the need of manually installing pandas.

I think you are right and this is the best way currently.
I have added a warning message for snowflake users, though, that when pandas is not available, the user is aware that optimizations won't be run.
The only issue is that I don't know how to show the message to snowflake-users only.

@YYYasin19
Copy link
Contributor Author

Also, the issue is now open here in the snowflake-python-connector.

result = engine.connect().execute(selection).scalars().all()

# snowflake-specific optimization iff pandas is installed
if is_snowflake(engine) and pandas_available:
Copy link
Collaborator

@ivergara ivergara Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if is_snowflake(engine) and pandas_available:
if is_snowflake(engine):
if not pandas_available:
print message and exit the snowflake code returning to sqlalchemy path.

Concerning the message, this is an option. You delay using the flag you created until here.

I did split the if to make it clearer, but you can find a way to do this without splitting the if, you just have to first check if pandas is available or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to try that one as well, unfortunately, now the error message would be printed for each and every query.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, then you can try to import snowflake, if it succeeds you can do the pandas check and inform in case it's not present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created a check_module_installed(...) since it seems like this functionality might be needed more often in such a dynamic environment.

Now, we just do

from .utils import check_module_installed

snowflake_available = check_module_installed("snowflake")
pandas_available = check_module_installed("pandas")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivergara are you satisfied with this solution?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@YYYasin19 YYYasin19 marked this pull request as ready for review June 29, 2022 14:59
@YYYasin19
Copy link
Contributor Author

After having implemented the KS test in SQL only, we don't have a pressing use case to get these changes implemented.
I still think that this is valuable, and, in addition does not affect a lot of our code base.

The essential question at this point is: Do we implement fast(er) loading for mssql and postgres as well or are we satisfied with just snowflake for now; maybe until someone opens an issue for another backend?

@jonashaag
Copy link
Contributor

jonashaag commented Jun 29, 2022

IMO Snowflake-only is fine

@jonashaag jonashaag closed this Jun 29, 2022
@jonashaag jonashaag reopened this Jun 29, 2022
@ivergara
Copy link
Collaborator

The essential question at this point is: Do we implement fast(er) loading for mssql and postgres as well or are we satisfied with just snowflake for now; maybe until someone opens an issue for another backend?

Just Snowflake is fine, since the work is done already. For other backends, we see in the future.

Copy link
Collaborator

@ivergara ivergara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. What's your opinion @kklein?


try:
mod = importlib.import_module(module_name)
return mod is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just return True? Or can it happen that the module is not found and still not trigger the exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, just checked, it always throws an error :)

result = engine.connect().execute(selection).scalars().all()

# snowflake-specific optimization iff pandas is installed
if is_snowflake(engine) and pandas_available:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@ivergara ivergara requested a review from kklein June 29, 2022 21:23
@ivergara
Copy link
Collaborator

Since the main motivation to go for this solution has been solved differently (see #28), there is no pressing need of using native Snowflake communication. Thus, this PR is currently out of scope for the project and will be closed.

Thank you for your effort @YYYasin19. Maybe in the future, the needs change and we'll find ourselves needing to do something that cannot be pushed to the engine and this exploration might come in handy.

@ivergara ivergara closed this Jul 20, 2022
@YYYasin19
Copy link
Contributor Author

I agree, let's reevaluate this in the future, maybe even with other database engines supported as well.

Additionally, I hope that some of the snowflake-connector-python issues will be solved by then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants