A python library for submitting log events form applications.
The logging client provides four things:
- A
Logger
class that manages submitting log events - Managed log stream classes
- Classes for creating log entries
- Validation that everything is being used properly
Docker is required on the local system to run test.
- Install Python dependencies:
pipenv sync --dev
- Run the test suite:
pytest
- Run lint checks:
ruff .
- Auto-format:
ruff format
We use moto
to test the various AWS moving parts. moto
builds (almost) everything from the CDK CloudFormation template.
If you make change to the Stack, or the first time you install the project, you
need to run make cfn_template_for_tests
. This file isn't checked in to Git
as it contains actual values from the deployment.
Install the desired version using pip or pipenv.
For pipenv, especially on projects deployed on AWS Lambda, it's advised to use
the zip
package from the release page:
pipenv install https://github.com/DemocracyClub/dc_logging/archive/refs/tags/[VERSION].zip
The library contains a single logger class per log stream. A log stream represents the category of log, and all logs for a single stream are stored together.
Currently, there is a single log stream defined: DCWidePostcodeLoggingClient
.
This is designed to log all postcodes entered from any DC site.
If the application in turn calls the developers.democracyclub.org.uk API then
calls_devs_dc_api
MUST be set to True
. This will prevent double counting
usage when querying later.
It's recommended that loggers are created globally to the application, for example in a Django settings module.
# settings.py
from dc_logging_client.log_client import DCWidePostcodeLoggingClient
POSTCODE_LOGGER = DCWidePostcodeLoggingClient(function_arn="arn")
The ARN to pass in should be the correct one for the log stream (currently only DCWidePostcodeLoggingClient) and the environment (currently only development or production). That means at the moment there are only two possible ARNs here. Find them in the DC dev handbook.
At the point you want to create a log entry
entry = POSTCODE_LOGGER.entry_class(
postcode="SW1A 1AA",
dc_product=POSTCODE_LOGGER.dc_product.wcivf
)
Note the dc_product
. This is an Enum that is validated against a set of known
and supported DC products. If you are trying to use this library in a DC
product that's not supported then please make a PR to this repo.
And log it
POSTCODE_LOGGER.log(entry)
Logs are submitted initially to a Lambda ingest function and then to AWS Kenisis Firehose.
Understanding how Firehose works shouldn't be required, but some high level basics are useful:
Firehose provides log streams that are essentially endpoints that accept data.
Each stream can be configured to process the data in various way. For example, by putting it in S3, calling a AWS Lambda ingest function, adding to a relational database, etc.
Firehose doesn't validate the incoming data, so it's important that clients write consistently.
This library mainly attempts to manage this consistency.
The initial Lambda ingest function is needed for cross account support: Firehose doesn't support organisational wide permissions, meaning it's only possible to write to the account that hosts the log stream. To get around this, we have a Lambda ingest function per log stream (and environment) that can be called cross-account, and this function relays the log message on to Firehose.
graph TB
dc_logging_client["DCWidePostcodeLoggingClient()"]
put_log["DCWidePostcodeLoggingClient().log(entry)"]
lambda_ingest["Lambda Ingest function"]
subgraph application_account [" Application Account"]
subgraph application ["Application"]
direction TB
dc_logging_client --> put_log
end
end
subgraph aws_monitoring ["AWS Monitoring Account"]
direction TB
application --Validate PrincipalOrgID--> lambda_ingest --> firehose
firehose["Firehose DataStream"]
convert["Convert to ORC"]
s3["S3 log storage"]
glue["AWS Glue table definition"]
athena["AWS Athena (SQL query logs)"]
firehose --> convert --> s3
s3 --> glue --> athena
end
The end result of this is that the client needs two things:
- To have a PrincipalOrgID of the DC organisation. This means, is in an account in th DC org, or is an authenticated user in that organisation
- The function ARN of the ingest function. Take this from the DC dev
handbook, and ensure it matches the environment you're deploying to
(currently either
development
orproduction
). DO NOT LOG TO THE WRONG PLACE
#### Deploying AWS Services
The AWS services are deployed using the CDK. The CDK is a framework for
defining AWS infrastructure as code. It's written in Python, and the code
is in the dc_logging_aws
directory.
Deployment is done automatically by CircleCI. To deploy manually, you need
to install the CDK and run cdk deploy
in the root of this repo, with these two
environment variables set for development (production deploys should be handled
by CircleCI):
DC_ENVIRONMENT
:development
LOGS_BUCKET_NAME
: Runaws s3 ls
to find this, it likely ends withlogging
.
The logs are stored in S3 in a format that can be queried using Athena. The logs
are partitioned by day and hour, so in order to query them efficiently you need
to also specify ranges to filter by. The day is a string in the format
YYYY/MM/DD
and the hour is an int. You can use >
, <
, >=
, <=
etc, and
also LIKE
to match a string prefix for the day.
The partitions are based on when the log was sent to S3 by Firehose, not when
the log entry was created. This means that timestamps can be off versus the
partitions by up to 5 minutes. For precise analysis, you should check the
timestamp
field in the log entry.
-- All of May and June 2023
SELECT dc_product, count(*) FROM "dc-wide-logs"."dc_postcode_searches_table"
WHERE day >= '2023/05' AND day < '2023/07'
GROUP BY 1
-- All of May 3rd 2023, using the timestamp field for precision
SELECT dc_product, count(*) FROM "dc-wide-logs"."dc_postcode_searches_table"
WHERE day IN('2023/05/03', '2023/05/04')
AND timestamp >= cast('2023-05-03' AS timestamp)
AND timestamp < cast('2023-05-04' AS timestamp)
GROUP BY 1
-- All of 2023
SELECT dc_product, count(*) FROM "dc-wide-logs"."dc_postcode_searches_table"
WHERE day LIKE '2023/%'
-- Joining on the local_authorities table
SELECT substr(nuts, 1, 1), count(*)
FROM "dc-wide-logs"."dc_postcode_searches_table"
JOIN (select distinct pcds, nuts FROM "local_authorities"."local_authorities") AS las
ON replace("postcode", ' ', '') = replace("las"."pcds", ' ', '')
WHERE "timestamp" >= cast('2023-04-01' AS timestamp)
AND "timestamp" <= cast('2023-05-04 22:00' AS timestamp)
AND day >= '2023/03/31' AND day <= '2023/05/05'
GROUP BY substr(nuts, 1, 1)