Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the Import module to actually download the raw files from AWS #10

Open
ChrisTheDBA opened this issue Apr 29, 2021 · 3 comments
Open
Labels
bug Something isn't working enhancement New feature or request good first issue Good for newcomers

Comments

@ChrisTheDBA
Copy link
Contributor

No description provided.

@ChrisTheDBA ChrisTheDBA added bug Something isn't working enhancement New feature or request good first issue Good for newcomers labels Apr 29, 2021
@davidpeckham
Copy link

I'd like to take this issue. Boto3 looks straightforward, but I'll need credentials for an S3 user with "programmatic access".

@davidpeckham
Copy link

davidpeckham commented Jul 12, 2021

This copies everything in a bucket. If we only need a subset of files in the bucket, perhaps we put that subset in a separate bucket, or add filtering here.

I tested this on my own S3 storage and an IAM user with AmazonS3ReadOnlyAccess.

$ pip install boto3

import boto3
from pathlib import Path

BUCKET_NAME = "nc-campaign-finance-storage"
LOCAL_DIR = Path.cwd() / 'data'

s3_resource = boto3.resource('s3')
bucket = s3_resource.Bucket(BUCKET_NAME)
for obj in bucket.objects.all():
    s3_file = obj.Object()
    local_file = LOCAL_DIR / s3_file.key
    if local_file.exists():
        if local_file.stat().st_size == s3_file.content_length:
            print(f'{s3_file.key} already downloaded')
            continue
    local_file.parent.mkdir(parents=True, exist_ok=True)
    s3_file.download_file(str(local_file))
    print(f'{s3_file.key}')

print("Done")

davidpeckham added a commit to davidpeckham/CampaignFinanceDataPipeline that referenced this issue Jul 24, 2021
@ChrisTheDBA
Copy link
Contributor Author

The change needs to be dynamic to download any and all files not already located in the docker image(a static list of files is not sufficient) and should require elevated privileges requiring AWS secrets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants