Skip to content

Piping AWS EC2/S3 files into BigQuery using Lambda and python-pandas

License

Notifications You must be signed in to change notification settings

pmueller1/s3-bigquery-conga

Folders and files

NameName
Last commit message
Last commit date

Latest commit

cb899d4 · Jul 2, 2016

History

11 Commits
Jun 24, 2016
Jul 1, 2016
Jun 24, 2016
Jun 24, 2016

Repository files navigation

s3-bigquery-conga

Piping AWS EC2/S3 files into BigQuery using Lambda and python-pandas##

I'm putting this file out into the ether to help AWS customers experience Google's BigQuery.
You can think of this as a pipeline that looks like so:

CSV_File_on_EC2 --> S3 --> ObjectNotification --> AWS LAMBDA --> BQ Streaming Inserts --> BQ_Tables_organized_by_day

In other words, get your CSV files onto S3 (however you like) and have them show up in BigQuery moments later.

There are a few key concepts you need to know about to get this snippet working. You need to understand and setup:

  • How to push data from your EC2 instances (or wherever) into S3 using a tool like 's3cmd' or 'aws s3'.
  • How to use S3's object notification, and how to prime it to tell Lambda about any new files arriving in an S3 bucket.
  • How to use python's 'virtualenv' command, so that you can create a self-contained zip file, structured the right way to include all the imported modules (like Pandas!) that we use here. A good overview can be found here.
  • How to deploy a Lambda function, and make sure it runs with a role suitable to read from your S3 bucket. (Alternately, how to just insert your AWS credentials so boto can use them directly.)
  • If you've used BigQuery before, you'll recognize that you need to change the attached to point to your correct <project_id, dataset_name, table_name and service_key>. Obviously, your service key, in json format, gets packaged up in the root directory of your zip file. You'll forgive me for not posting our own!
  • Don't skimp on timeout or RAM when it comes Lambda. It's dirt cheap, and you can always scale back later, once you're seeing successful completions in your Cloudwatch logs.
  • Recognize that there are streaming limits on BQ; currently 100K rows per second per table. Then again, if you're going over that, you probably have large CSV files and will run into a timeout problem on Lambda (300s). In other words, this approach is ideal when you have a lot of small files. (We run about 10K files/day of 8Mb each without problem. My guess is that those could scale up to 200Mb before you had a problem on the AWS side and/or the BQ streaming side.) As @felipehoffa has pointed out, you could just adapt this script to do 'copy to gs, then load', but you'd run into yet another limit if you tried to do this more than 1K times /day. Yet another variant of this is to wire in a Kinesis stream to the mix, in which you stream your CSVs to Kinesis, then have Lambda 'sip' from the stream before inserting it into BQ.
  • The essence of this whole project can be found in the 'stream-to-bq.py' file. If you've used Lambda for Python before, you could probably just get started by adapting that file to your needs.

Fortunately for us, Python-Pandas has done the heavy lifting of writing a '.to_bq' extension to a dataframe for BigQuery-streaming. There are other approaches, including doing this with Kinesis or PubSub, as well as using something like fluentd to publish directly to BigQuery, but this approach is elegant for people who are a) used to python-pandas and b) prefer to keep things most simple on the instance end: ie. only have to push to S3 to get the ball rolling.

Since we're using Pandas here, you could also manipulate your dataframe however you like before sending the results on up to BigQuery.

I previously wrote a post outlining why we went this route in the first place.

Comments welcome!

Twitter: @pmueller

Credits: Christopher Corus - ATS, Chris Brokes - ATS, Abhijit Chanda - Gamma, Felipe Hoffa - Google, Graham Polley - Shine

About

Piping AWS EC2/S3 files into BigQuery using Lambda and python-pandas

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages