-
Notifications
You must be signed in to change notification settings - Fork 3
Fast Data Retrieval for Analytics Databases #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Yes, that could also make sense, I thought it wouldn't really make a difference since we'd need a new requirement for both of them and, in my experience, most projects already have |
pyarrow is already installed by In a summary, arrow intends to be the common in-memory representation of tables/dataframes enabling interoperability between different analytics applications. It has plenty of other advantages, that the `snowflake-connector is using for example to improve data transfer. And I'm more comfortable going that route than straight out requiring pandas. pandas, for what you want to do, is only an intermediary... and a rather bad one at that in my opinion. |
Oh, didn't catch that! Then it makes perfect sense to just build on arrow. All of our snowflake users won't have to have anything additionally installed then. |
src/datajudge/db_access.py
Outdated
@@ -9,6 +9,7 @@ | |||
from typing import Callable, Sequence, final, overload | |||
|
|||
import sqlalchemy as sa | |||
from snowflake.connector.cursor import SnowflakeCursor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard this import
try:
from...
except ImportError:
print("a useful message")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw nowadays we have ModuleNotFoundError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the import statement since it only is a type hint for the user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then you should do
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from ....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I really get what you mean.
I initially added the import to have a
cursor: SnowflakeCursor = engine.connect().connection.cursor()
line, where the IDE then knows that cursor has additional methods provided by snowflake.
Since this was not needed anymore (not important enough) I removed it again, so it had no checking functionality. So it should be fine to just remove it.
src/datajudge/db_access.py
Outdated
@@ -650,14 +651,12 @@ def get_column( | |||
selection = sa.select([column]) | |||
|
|||
if is_snowflake(engine): # check if we have a snowflake cursor | |||
snowflake_cursor = engine.connect().connection.cursor() | |||
snowflake_cursor: SnowflakeCursor = engine.connect().connection.cursor() | |||
|
|||
# note: this step requires pandas to be installed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is outdated I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it, though? Because currently, it requires pyarrow
and pandas
because we get the SegFault otherwise. I'll change it to reflect that, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote it before I was aware of the segfault without pandas.
Unfortunately, the snowflake connector is causing a segmentation fault when pandas is not installed. I'll investigate this further. |
Not much info on segmentation faults in
👍 |
Update: Although we now may have an SQL-only solution for the initial motivation of this PR (performing the KS-test, as mentioned in #23 ) I still think that there is value in allowing very fast data loading. Although this adds complexity to our code it is, in my opinion, still justifiable because: (2) the snowflake-connector-python solution with These are therefore open sub-issues currently:
|
The approach concerning dependencies to interact with different dependencies has been until now to leave the responsibility of having the appropriate drivers to the end user and not to add them to the dependencies of the package. If you want to use Postgres, you have to manually install If this experiment gets merged, we have to add a note in the documentation concerning Snowflake use the need of manually installing |
I think you are right and this is the best way currently. |
Also, the issue is now open here in the |
result = engine.connect().execute(selection).scalars().all() | ||
|
||
# snowflake-specific optimization iff pandas is installed | ||
if is_snowflake(engine) and pandas_available: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if is_snowflake(engine) and pandas_available: | |
if is_snowflake(engine): | |
if not pandas_available: | |
print message and exit the snowflake code returning to sqlalchemy path. |
Concerning the message, this is an option. You delay using the flag you created until here.
I did split the if to make it clearer, but you can find a way to do this without splitting the if, you just have to first check if pandas is available or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to try that one as well, unfortunately, now the error message would be printed for each and every query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, then you can try to import snowflake, if it succeeds you can do the pandas check and inform in case it's not present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have created a check_module_installed(...)
since it seems like this functionality might be needed more often in such a dynamic environment.
Now, we just do
from .utils import check_module_installed
snowflake_available = check_module_installed("snowflake")
pandas_available = check_module_installed("pandas")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivergara are you satisfied with this solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
After having implemented the KS test in SQL only, we don't have a pressing use case to get these changes implemented. The essential question at this point is: Do we implement fast(er) loading for |
IMO Snowflake-only is fine |
Just Snowflake is fine, since the work is done already. For other backends, we see in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. What's your opinion @kklein?
src/datajudge/utils.py
Outdated
|
||
try: | ||
mod = importlib.import_module(module_name) | ||
return mod is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just return True
? Or can it happen that the module is not found and still not trigger the exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, just checked, it always throws an error :)
result = engine.connect().execute(selection).scalars().all() | ||
|
||
# snowflake-specific optimization iff pandas is installed | ||
if is_snowflake(engine) and pandas_available: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
Co-authored-by: Ignacio Vergara Kausel <[email protected]>
…ajudge into snowflake-fast-fetch
Since the main motivation to go for this solution has been solved differently (see #28), there is no pressing need of using native Snowflake communication. Thus, this PR is currently out of scope for the project and will be closed. Thank you for your effort @YYYasin19. Maybe in the future, the needs change and we'll find ourselves needing to do something that cannot be pushed to the engine and this exploration might come in handy. |
I agree, let's reevaluate this in the future, maybe even with other database engines supported as well. Additionally, I hope that some of the snowflake-connector-python issues will be solved by then. |
Hey there,
after introducing a statistical test with #23, I now want to integrate a way to load a large batch of data (
> 100,000,000
) within a reasonable amount of time and memory use.For this, snowflake's sqlalchemy connector already provides a convenient method that works surprisingly well. You can read more about it here.
Since I dislike or am unsure about some changes here, this will be a draft first. Concretely speaking, the following things are unclear now: