Skip to content

Fast Data Retrieval for Analytics Databases #27

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 19 commits into from
Closed
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions docs/source/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,14 @@ or from a conda environment
::

conda install datajudge -c conda-forge



Snowflake
^^^^

If your backend is ``snowflake`` and you are querying large datasets,
you can additionally install ``pandas`` to make use of very fast query loading
(up to 50x speedup for large datasets).
Note: The ``pandas`` requirement is a bug in the snowflake-python-connector
and will not be needed in the future.
1 change: 1 addition & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ channels:
- conda-forge
- nodefaults
dependencies:
- pandas
- python>=3.8
- pytest
- pytest-cov
Expand Down
27 changes: 26 additions & 1 deletion src/datajudge/db_access.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import functools
import json
import operator
import warnings
from abc import ABC, abstractmethod
from collections import Counter
from dataclasses import dataclass
Expand All @@ -11,6 +12,17 @@
import sqlalchemy as sa
from sqlalchemy.sql.expression import FromClause

from .utils import check_module_installed

snowflake_available = check_module_installed("snowflake")
pandas_available = check_module_installed("pandas")


if snowflake_available and not pandas_available:
warnings.warn(
"For snowflake users: `pandas` is not installed, that means optimized data loading is not available."
)


def is_mssql(engine: sa.engine.Engine) -> bool:
return engine.name == "mssql"
Expand Down Expand Up @@ -648,7 +660,20 @@ def get_column(

if not aggregate_operator:
selection = sa.select([column])
result = engine.connect().execute(selection).scalars().all()

# snowflake-specific optimization iff pandas is installed
if is_snowflake(engine) and pandas_available:
Copy link
Collaborator

@ivergara ivergara Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if is_snowflake(engine) and pandas_available:
if is_snowflake(engine):
if not pandas_available:
print message and exit the snowflake code returning to sqlalchemy path.

Concerning the message, this is an option. You delay using the flag you created until here.

I did split the if to make it clearer, but you can find a way to do this without splitting the if, you just have to first check if pandas is available or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to try that one as well, unfortunately, now the error message would be printed for each and every query.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, then you can try to import snowflake, if it succeeds you can do the pandas check and inform in case it's not present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created a check_module_installed(...) since it seems like this functionality might be needed more often in such a dynamic environment.

Now, we just do

from .utils import check_module_installed

snowflake_available = check_module_installed("snowflake")
pandas_available = check_module_installed("pandas")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivergara are you satisfied with this solution?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

snowflake_cursor = engine.connect().connection.cursor()

# note: in addition to pyarrow, this currently requires pandas as well
pa_table = snowflake_cursor.execute(str(selection)).fetch_arrow_all()
if pa_table: # snowflake connector returns NoneType when the table is empty
result = pa_table.column(0).to_numpy()
else:
result = []

else:
result = engine.connect().execute(selection).scalars().all()

else:
selection = sa.select([aggregate_operator(column)])
Expand Down
8 changes: 8 additions & 0 deletions src/datajudge/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
def check_module_installed(module_name: str) -> bool:
import importlib

try:
mod = importlib.import_module(module_name)
return mod is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just return True? Or can it happen that the module is not found and still not trigger the exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, just checked, it always throws an error :)

except ModuleNotFoundError:
return False