A lambda function and CLI tool for processing AWS Cloudfront access logs in Kotlin and dumping the useful data into postgres.
My blog (benasher.co) is a static site hosted via S3 and AWS Cloudfront. I built this tool, so that I could get longer-living page view data beyond the 60-day retention period that Cloudfront gives you in its reports.
Be able to write queries to assess:
- Views per day, week, month for the site and per page
- Top referers
- Sanitize AWS log data to prepare it to be queried
- Run serverless to keep costs low— primary use case is occassional usage (site owner occasionally runs queries)
Cloudfront gives you reports for some of this information, but the data only goes back 60 days. Processing logs into a database allows quer
- Store data that would allow tracking users or locations
This parses Cloudfront access logs and extracts:
- Access date and time in UTC
- Referer header
- User Agent
- Path component of the URL accessed
🚨 All paths are normalized to remove the trailing slash. Once a log is processed, the extracted data is dumped into postgres, and the log file is deleted from S3 🧹.
- SDK credentials
PG_USER
: The postgres database userPG_PASSWORD
: The postgres database passwordPG_URL
: The postgres database url in the format:postgresql://YOUR_DB_LOCATION/YOUR_DB_NAME
LOG_BUCKET_REGION
(unless supplied on the command line): The AWS region where your S3 bucket livesLOG_BUCKET
(unless supplied on the command line): The name of the bucket where the logs live, to be parsed.
The below assumes you have the aws cli tool setup, and AWS credentials configured for it.
./gradlew clean fatJar
- Command to create the function (pay attention to all caps variables that need substitution):
aws lambda create-function --function-name YOUR_FUNCTION_NAME --runtime java8 \
--zip-file fileb://build/libs/KloudfrontBlogStats-1.0-SNAPSHOT-fat.jar --handler com.benasher44.kloudfrontblogstats.AppKt::s3Handler \
--role YOUR_ROLE_FOR_LAMBDA \
--vpc-config YOUR_VPC_CONFIG \
--environment "Variables={LOG_BUCKET=YOUR_LOG_BUCKET,LOG_BUCKET_REGION=YOUR_S3_BUCKET_REGION,PG_URL=postgresql://YOUR_DB_LOCATION/YOUR_DB_NAME,PG_USER=YOUR_PG_USER,PG_PASSWORD=YOUR_PG_PASSWORD}" \
--timeout 300 \
--memory-size 512
./gradlew clean fatJar
java -jar build/libs/KloudfrontBlogStats-1.0-SNAPSHOT-fat.jar --help
This is mainly useful for testing, though you could run it locally and not pay for AWS Lambda at all. By default, the CLI tool does not delete logs from S3. See the help text for how to enable that.