PyAthenaJDBC is a Python DB API 2.0 (PEP 249) compliant wrapper for Amazon Athena JDBC driver.
- Python
- CPython 2,7, 3,4, 3.5, 3.6, 3.7
- Java
- Java >= 8 (JDBC 4.2)
Version | JDBC driver version | Vendor |
---|---|---|
< 2.0.0 | == 1.1.0 | AWS (Early released JDBC driver. It is incompatible with Simba's JDBC driver) |
>= 2.0.0 | >= 2.0.5 | Simba |
$ pip install PyAthenaJDBC
Extra packages:
Package | Install command | Version |
---|---|---|
Pandas | pip install PyAthenaJDBC[Pandas] |
>=0.19.0 |
SQLAlchemy | pip install PyAthenaJDBC[SQLAlchemy] |
>=1.0.0, <2.0.0 |
from pyathenajdbc import connect
conn = connect(s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',
region_name='us-west-2')
try:
with conn.cursor() as cursor:
cursor.execute("""
SELECT * FROM one_row
""")
print(cursor.description)
print(cursor.fetchall())
finally:
conn.close()
from pyathenajdbc import connect
conn = connect(s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',
region_name='us-west-2')
try:
with conn.cursor() as cursor:
cursor.execute("""
SELECT * FROM many_rows LIMIT 10
""")
for row in cursor:
print(row)
finally:
conn.close()
Supported DB API paramstyle is only PyFormat
.
PyFormat
only supports named placeholders with old %
operator style and parameters specify dictionary format.
from pyathenajdbc import connect
conn = connect(s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',
region_name='us-west-2')
try:
with conn.cursor() as cursor:
cursor.execute("""
SELECT col_string FROM one_row_complex
WHERE col_string = %(param)s
""", {'param': 'a string'})
print(cursor.fetchall())
finally:
conn.close()
if %
character is contained in your query, it must be escaped with %%
like the following:
SELECT col_string FROM one_row_complex
WHERE col_string = %(param)s OR col_string LIKE 'a%%'
In the connect method or connection object, you can specify JVM options with a string array.
You can increase the JVM heap size like the following:
from pyathenajdbc import connect
conn = connect(s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',
region_name='us-west-2',
jvm_options=['-Xms1024m', '-Xmx4096m'])
try:
with conn.cursor() as cursor:
cursor.execute("""
SELECT * FROM many_rows
""")
print(cursor.fetchall())
finally:
conn.close()
If you want to use JDBC 4.1, download the corresponding JDBC driver
and specify the path of the downloaded JDBC driver as the argument driver_path
of the connect method or connection object.
- The AthenaJDBC41-2.0.7.jar is compatible with JDBC 4.1 and requires JDK 7.0 or later.
from pyathenajdbc import connect
conn = connect(s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',
region_name='us-west-2',
driver_path='/path/to/AthenaJDBC41_2.0.7.jar')
The connect method or connection object pass keyword arguments as options to the JDBC driver. If you want to change the behavior of the JDBC driver, specify the option as a keyword argument in the connect method or connection object.
from pyathenajdbc import connect
conn = connect(s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',
region_name='us-west-2',
LogPath='/path/to/pyathenajdbc/log/',
LogLevel='6')
For details of the JDBC driver options refer to the official documentation.
NOTE: Option names and values are case-sensitive. The option value is specified as a character string.
If you want to specify where athena stores the txt
/cxv
and txt.metadata
/csv.metadata
files containing the result of each query, you can specify it as follows:
from pyathenajdbc import connect
conn = connect(s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',
region_name='us-west-2',
LogPath='/path/to/pyathenajdbc/log/',
LogLevel='6',
S3OutputLocation='s3://YOUR_S3_BUCKET/path/to/query_results/')
For details see the Athena Documentation:
Install SQLAlchemy with pip install SQLAlchemy>=1.0.0
or pip install PyAthenaJDBC[SQLAlchemy]
.
Supported SQLAlchemy is 1.0.0 or higher and less than 2.0.0.
import contextlib
from urllib.parse import quote_plus # PY2: from urllib import quote_plus
from sqlalchemy.engine import create_engine
from sqlalchemy.sql.expression import select
from sqlalchemy.sql.functions import func
from sqlalchemy.sql.schema import Table, MetaData
conn_str = 'awsathena+jdbc://{access_key}:{secret_key}@athena.{region_name}.amazonaws.com:443/'\
'{schema_name}?s3_staging_dir={s3_staging_dir}'
engine = create_engine(conn_str.format(
access_key=quote_plus('YOUR_ACCESS_KEY'),
secret_key=quote_plus('YOUR_SECRET_ACCESS_KEY'),
region_name='us-west-2',
schema_name='default',
s3_staging_dir=quote_plus('s3://YOUR_S3_BUCKET/path/to/')))
try:
with contextlib.closing(engine.connect()) as conn:
many_rows = Table('many_rows', MetaData(bind=engine), autoload=True)
print(select([func.count('*')], from_obj=many_rows).scalar())
finally:
engine.dispose()
The connection string has the following format:
awsathena+jdbc://{access_key}:{secret_key}@athena.{region_name}.amazonaws.com:443/{schema_name}?s3_staging_dir={s3_staging_dir}&driver_path={driver_path}&...
If you do not specify access_key
and secret_key
using instance profile or boto3 configuration file:
awsathena+jdbc://:@athena.{region_name}.amazonaws.com:443/{schema_name}?s3_staging_dir={s3_staging_dir}&driver_path={driver_path}&...
NOTE: s3_staging_dir
requires quote. If access_key
, secret_key
and other parameter contain special characters, quote is also required.
Minimal example for Pandas DataFrame:
from pyathenajdbc import connect
import pandas as pd
conn = connect(access_key='YOUR_ACCESS_KEY_ID',
secret_key='YOUR_SECRET_ACCESS_KEY',
s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',
region_name='us-west-2',
jvm_path='/path/to/jvm') # optional, as used by JPype
df = pd.read_sql("SELECT * FROM many_rows LIMIT 10", conn)
As Pandas DataFrame:
import contextlib
from pyathenajdbc import connect
from pyathenajdbc.util import as_pandas
with contextlib.closing(
connect(s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/'
region_name='us-west-2'))) as conn:
with conn.cursor() as cursor:
cursor.execute("""
SELECT * FROM many_rows
""")
df = as_pandas(cursor)
print(df.describe())
Redash query runner example
Support AWS CLI credentials, Properties file credentials and AWS credentials provider chain.
~/.aws/credentials
[default]
aws_access_key_id=YOUR_ACCESS_KEY_ID
aws_secret_access_key=YOUR_SECRET_ACCESS_KEY
~/.aws/config
[default]
region=us-west-2
output=json
$ export AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_ID
$ export AWS_SECRET_ACCESS_KEY=YOUR_SECRET_ACCESS_KEY
$ export AWS_DEFAULT_REGION=us-west-2
Additional environment variable:
$ export AWS_ATHENA_S3_STAGING_DIR=s3://YOUR_S3_BUCKET/path/to/
Create a property file of the following format.
/path/to/AWSCredentials.properties
accessKeyId:YOUR_ACCESS_KEY_ID
secretKey:YOUR_SECRET_ACCESS_KEY
Specify the property file path with credential_file
of the connect method or connection object.
from pyathenajdbc import connect
conn = connect(credential_file='/path/to/AWSCredentials.properties',
s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',
region_name='us-west-2')
PyAthenaJDBC uses the property file to authenticate Amazon Athena.
See AWS credentials provider chain
AWS credentials provider chain that looks for credentials in this order:
- Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY (RECOMMENDED since they are recognized by all the AWS SDKs and CLI except for .NET), or AWS_ACCESS_KEY and AWS_SECRET_KEY (only recognized by Java SDK)
- Java System Properties - aws.accessKeyId and aws.secretKey
- Credential profiles file at the default location (~/.aws/credentials) shared by all AWS SDKs and the AWS CLI
- Credentials delivered through the Amazon EC2 container service if AWS_CONTAINER_CREDENTIALS_RELATIVE_URI" environment variable is set and security manager has permission to access the variable,
- Instance profile credentials delivered through the Amazon EC2 metadata service
In the connect method or connection object, you can connect by specifying at least s3_staging_dir
and region_name
.
It is not necessary to specify access_key
and secret_key
.
from pyathenajdbc import connect
conn = connect(s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',
region_name='us-west-2')
Terraform Instance profile example:
Depends on the following environment variables:
$ export AWS_ACCESS_KEY_ID=YOUR_ACCESS_KEY_ID
$ export AWS_SECRET_ACCESS_KEY=YOUR_SECRET_ACCESS_KEY
$ export AWS_DEFAULT_REGION=us-west-2
$ export AWS_ATHENA_S3_STAGING_DIR=s3://YOUR_S3_BUCKET/path/to/
$ pip install pipenv
$ pipenv install --dev
$ pipenv run scripts/test_data/upload_test_data.sh
$ pipenv run pytest
$ pipenv run scripts/test_data/delete_test_data.sh
$ pip install pipenv
$ pipenv install --dev
$ pipenv run scripts/test_data/upload_test_data.sh
$ pyenv local 3.7.2 3.6.8 3.5.7 3.4.10 2.7.16
$ pipenv run tox
$ pipenv run scripts/test_data/delete_test_data.sh
The license of all Python code except JDBC driver is MIT license.
For the license of JDBC driver, please check the following link.