Contributing Guidelines

Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional documentation, we greatly value feedback and contributions from our community.

Please read through this document before submitting any issues or pull requests to ensure we have all the necessary information to effectively respond to your bug report or contribution.

Reporting Bugs/Feature Requests

We welcome you to use the GitHub issue tracker to report bugs or suggest features.

When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:

A reproducible test case or series of steps
The version of our code being used
Any modifications you've made relevant to the bug
Anything unusual about your environment or deployment

Here is a list of tags to label issues and help us triage them:

question: A question on the library. Consider starting a discussion instead
bug: An error encountered when using the library
feature: A completely new idea not currently covered by the library
enhancement: A suggestion to enhance an existing feature

Contributing via Pull Requests

Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:

You are working against the latest source on the main branch.
You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
You open an issue to discuss any significant work - we would hate for your time to be wasted.

To send us a pull request, please:

Fork the repository.
Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
Ensure local tests pass.
Commit to your fork using clear commit messages.
Send us a pull request, answering any default questions in the pull request interface.
Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.

GitHub provides additional document on forking a repository and creating a pull request.

Note: An automated Code Build is triggered with every pull request. To skip it, add the prefix [skip-ci] to your commit message.

Finding contributions to work on

Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.

Code of Conduct

This project has adopted the Amazon Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opensource-codeofconduct@amazon.com with any additional questions or comments.

Security issue notifications

If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our vulnerability reporting page. Please do not create a public github issue.

Licensing

See the LICENSE file for our project's licensing. We will ask you to confirm the licensing of your contribution.

We may ask you to sign a Contributor License Agreement (CLA) for larger changes.

Environments

We have hundreds of test functions that runs against several AWS Services. You don't need to test everything to open a Pull Request. You can choose from three different environments to test your fixes/changes, based on what makes sense for your case.

Mocked test environment
- Based on moto.
- Does not require real AWS resources
- Fastest approach
- Basically Limited only for Amazon S3 tests
Data Lake test environment
- Requires some AWS services.
- Amazon S3, Amazon Athena, AWS Glue Catalog, AWS KMS
- Enable real tests on typical Data Lake cases
Full test environment
- Requires a bunch of real AWS services.
- Amazon S3, Amazon Athena, AWS Glue Catalog, AWS KMS, Amazon Redshift, Aurora PostgreSQL, Aurora MySQL, Amazon Quicksight, etc
- Enable real tests on all use cases.

Step-by-step

Mocked test environment

Pick up a Linux or MacOS.
Install Python 3.7, 3.8 or 3.9 with poetry for package management
Fork the AWS Data Wrangler repository and clone that into your development environment
Go to the project's directory create a Python's virtual environment for the project

python3 -m venv .venv && source .venv/bin/activate

or

python -m venv .venv && source .venv/bin/activate

Install dependencies:

poetry install --extras "sqlserver"

Run the validation script:

./validate.sh

To run a specific test function:

pytest tests/test_moto.py::test_get_bucket_region_succeed

To run all mocked test functions (Using 8 parallel processes):

pytest -n 8 tests/test_moto.py

Data Lake test environment

DISCLAIMER: Make sure you know what you are doing. These steps will charge some services on your AWS account and require a minimum security skill to keep your environment safe.

Pick up a Linux or MacOS.
Install Python 3.7, 3.8 or 3.9 with poetry for package management
Fork the AWS Data Wrangler repository and clone that into your development environment
Go to the project's directory create a Python's virtual environment for the project

python3 -m venv .venv && source .venv/bin/activate

or

python -m venv .venv && source .venv/bin/activate

Install dependencies:

poetry install --extras "sqlserver"

Go to the test_infra directory

cd test_infra

Install CDK dependencies:

poetry install

[OPTIONAL] Set AWS_DEFAULT_REGION to define the region the Data Lake Test environment will deploy into. You may want to choose a region which you don't currently use:

export AWS_DEFAULT_REGION=ap-northeast-1

Go to the scripts directory

cd scripts

Deploy the base CDK stack

./deploy-stack.sh base

Return to the project root directory

cd ../../

Run the validation script:

./validate.sh

To run a specific test function:

pytest tests/test_athena_parquet.py::test_parquet_catalog

To run all data lake test functions (Using 8 parallel processes):

pytest -n 8 tests/test_athena*

[OPTIONAL] To remove the base test environment cloud formation stack post testing:

./test_infra/scripts/delete-stack.sh base

Full test environment

DISCLAIMER: Make sure you know what you are doing. These steps will charge some services on your AWS account and require a minimum security skill to keep your environment safe.

DISCLAIMER: This environment contains Aurora MySQL, Aurora PostgreSQL and Redshift (single-node) clusters which will incur cost while running.

Pick up a Linux or MacOS.
Install Python 3.7, 3.8 or 3.9 with poetry for package management
Fork the AWS Data Wrangler repository and clone that into your development environment
Go to the project's directory create a Python's virtual environment for the project

python -m venv .venv && source .venv/bin/activate

Then run the command bellow to install all dependencies:

poetry install --extras "sqlserver"

Go to the test_infra directory

cd test_infra

Install CDK dependencies:

poetry install

[OPTIONAL] Set AWS_DEFAULT_REGION to define the region the Full Test environment will deploy into. You may want to choose a region which you don't currently use:

export AWS_DEFAULT_REGION=ap-northeast-1

Go to the scripts directory

cd scripts

Deploy the base and databases CDK stacks. This step could take about 15 minutes to deploy.

./deploy-stack.sh base ./deploy-stack.sh databases

[OPTIONAL] Deploy the lakeformation CDK stack (if you need to test against the AWS Lake Formation Service). You must ensure Lake Formation is enabled in the account.

./deploy-stack.sh lakeformation

[OPTIONAL] Deploy the opensearch CDK stack (if you need to test against the Amazon OpenSearch Service). This step could take about 15 minutes to deploy.

./deploy-stack.sh opensearch

Go to the EC2 -> SecurityGroups console, open the aws-data-wrangler-* security group and configure to accept your IP from any TCP port.
- Alternatively run:
./security-group-databases-add-local-ip.sh
- Check local IP was applied:
./security-group-databases-check.sh

P.S Make sure that your security group will not be open to the World! Configure your security group to only give access for your IP.

Return to the project root directory

cd ../../

[OPTIONAL] If you intend to run all test, you also need to make sure that you have Amazon QuickSight activated and your AWS user must be register on that.
Run the validation script:

./validate.sh

To run a specific test function:

pytest tests/test_db.py::test_sql

To run all database test functions (Using 8 parallel processes):

pytest -n 8 tests/test_db.py

To run all data lake test functions for all python versions (Only if Amazon QuickSight is activated and Amazon OpenSearch template is deployed):

./test.sh

[OPTIONAL] To remove the base test environment cloud formation stack post testing:

./test_infra/scripts/delete-stack.sh base

./test_infra/scripts/delete-stack.sh databases

Recommended Visual Studio Code Recommended setting

{
  "python.formatting.provider": "black",
  "python.linting.enabled": true,
  "python.linting.flake8Enabled": true,
  "python.linting.mypyEnabled": true,
  "python.linting.pylintEnabled": false
}

Common Errors

Check the file below to check the common errors and solutions ERRORS

Bumping version

When there is a new release you can use bump2version for updating the version number in relevant files. You can run bump2version major|minor|patch in the top directory and the following steps will be executed:

The version number in all files which are listed in .bumpversion.cfg is updated
A new commit with message Bump version: {current_version} → {new_version} is created
A new Git tag {new_version} is created

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CONTRIBUTING.md

CONTRIBUTING.md

Contributing Guidelines

Reporting Bugs/Feature Requests

Contributing via Pull Requests

Finding contributions to work on

Code of Conduct

Security issue notifications

Licensing

Environments

Step-by-step

Mocked test environment

Data Lake test environment

Full test environment

Recommended Visual Studio Code Recommended setting

Common Errors

Bumping version

Files

CONTRIBUTING.md

Latest commit

History

CONTRIBUTING.md

File metadata and controls

Contributing Guidelines

Reporting Bugs/Feature Requests

Contributing via Pull Requests

Finding contributions to work on

Code of Conduct

Security issue notifications

Licensing

Environments

Step-by-step

Mocked test environment

Data Lake test environment

Full test environment

Recommended Visual Studio Code Recommended setting

Common Errors

Bumping version