Skip to content

Latest commit

 

History

History
220 lines (155 loc) · 7.43 KB

README.en.md

File metadata and controls

220 lines (155 loc) · 7.43 KB

Clickstream on AWS

繁體中文版說明

Getting started

We recommand that using Cloud9 environment to deploy, or you must ensure you had installed following requirements in local before starting

  • AWS CDK for constructing AWS environment
  • Poetry for Pythen dependency management
  • Docker runtime.

Deploying your AWS environment

Clone the repo

git clone https://github.com/gavinjwl/clickstream-on-aws

cd clickstream-on-aws

Activate Python virtual environment

poetry install

source .venv/bin/activate

Deploy CDK stacks

Deploy all stacks

cdk deploy --all \
    --parameters CoreStack:WriteKey='<define-your-write-key>' \
    --parameters CoreStack:RedshiftServerlessSubnetIds='<assign-subnets-to-redshift>' \
    --parameters CoreStack:RedshiftServerlessSecurityGroupIds='assign-security-groups-for-redshift'

Or, deploy ONLY CoreStack by

cdk deploy CoreStack \
    --parameters CoreStack:WriteKey='<define-your-write-key>' \
    --parameters CoreStack:RedshiftServerlessSubnetIds='<assign-subnets-to-redshift>' \
    --parameters CoreStack:RedshiftServerlessSecurityGroupIds='assign-security-groups-for-redshift'

Or, deploy ONLY CoreStack with Dashboard by

cdk deploy CoreStack Dashboard \
    --parameters CoreStack:WriteKey='<define-your-write-key>' \
    --parameters CoreStack:RedshiftServerlessSubnetIds='<assign-subnets-to-redshift>' \
    --parameters CoreStack:RedshiftServerlessSecurityGroupIds='assign-security-groups-for-redshift'

Or, deploy ONLY CoreStack with Scheduled Refresh feature by

cdk deploy CoreStack ScheduledRefreshStack \
    --parameters CoreStack:WriteKey='<define-your-write-key>' \
    --parameters CoreStack:RedshiftServerlessSubnetIds='<assign-subnets-to-redshift>' \
    --parameters CoreStack:RedshiftServerlessSecurityGroupIds='assign-security-groups-for-redshift'

Change Redshift Serverless Namespace password

After CDK deployments complete, we need to change Redshift Serverless Namespace password, so that we can connect.

redshift-change-namespace-password

Connect to Redshift Serverless Namespace

Use Username-Password to Connect to Redshift Serverless Namespace. Following show how to connect with QueryEditorV2

redshift-connect-with-password

Enable Redshift Streaming Ingestion

Create an external schema for Kinesis Stream

-- Create external schema for kinesis
CREATE EXTERNAL SCHEMA IF NOT EXISTS kinesis FROM KINESIS IAM_ROLE default;

Create clickstream schema

-- Create schema for clickstream
CREATE SCHEMA IF NOT EXISTS clickstream;

Create user and grant permissions

-- Create clickstream user and grant required permissions
-- Please do not change `IAMR:ClickstreamRedshiftRole`
CREATE USER "IAMR:ClickstreamRedshiftRole" PASSWORD DISABLE;

GRANT ALL ON SCHEMA kinesis TO "IAMR:ClickstreamRedshiftRole";

GRANT ALL ON SCHEMA clickstream TO "IAMR:ClickstreamRedshiftRole";
GRANT ALL ON ALL TABLES IN SCHEMA clickstream TO "IAMR:ClickstreamRedshiftRole";

Create a materialized view to consume the stream data

SET enable_case_sensitive_identifier TO true;
CREATE MATERIALIZED VIEW clickstream.mv_kinesisSource
AS
SELECT
    ApproximateArrivalTimestamp AS approximateArrivalTimestamp,
    PartitionKey AS partitionKey,
    ShardId AS shardId,
    SequenceNumber AS sequenceNumber,
    -- JSON_PARSE(from_varbyte(Data, 'utf-8')) as data,
    json_extract_path_text(from_varbyte(data, 'utf-8'), 'messageId')::VARCHAR AS messageId,
    json_extract_path_text(from_varbyte(data, 'utf-8'), 'timestamp')::VARCHAR AS event_timestamp,
    json_extract_path_text(from_varbyte(data, 'utf-8'), 'type')::VARCHAR AS type,
    -- Common
    json_extract_path_text(from_varbyte(data, 'utf-8'), 'userId')::VARCHAR AS userId,
    json_extract_path_text(from_varbyte(data, 'utf-8'), 'anonymousId')::VARCHAR AS anonymousId,
    json_extract_path_text(from_varbyte(data, 'utf-8'), 'context')::SUPER AS context,
    json_extract_path_text(from_varbyte(data, 'utf-8'), 'integrations')::SUPER AS integrations,

    -- Identify
    json_extract_path_text(from_varbyte(data, 'utf-8'), 'traits')::SUPER AS traits,

    -- Track
    json_extract_path_text(from_varbyte(data, 'utf-8'), 'event')::VARCHAR AS event,
    json_extract_path_text(from_varbyte(data, 'utf-8'), 'properties')::SUPER AS properties,

    -- Alias
    json_extract_path_text(from_varbyte(data, 'utf-8'), 'previousId')::VARCHAR AS previousId,

    -- Group
    json_extract_path_text(from_varbyte(data, 'utf-8'), 'groupId')::VARCHAR AS groupId,

    -- Page
    json_extract_path_text(from_varbyte(data, 'utf-8'), 'category')::VARCHAR AS category,
    json_extract_path_text(from_varbyte(data, 'utf-8'), 'name')::VARCHAR AS name
FROM kinesis."ClickstreamKinesisStream"
WHERE is_utf8(Data) AND is_valid_json(from_varbyte(Data, 'utf-8'));

Change materialized view owner to IAMR:ClickstreamRedshiftRole so that ScheduledRefreshStack can work.

SET enable_case_sensitive_identifier TO true;
ALTER TABLE clickstream.mv_kinesisSource OWNER TO "IAMR:ClickstreamRedshiftRole";

Simulate clickstream

  • The easiest way to simulate is doing follow command, for more detail

    # Enable your python venv, if not
    source .venv/bin/activate
    
    # Execute simulator
    python3 simulator.py --host <API Gateway URL> --writeKey <Your Write Key>
  • If you want to simulate more users, you can leverage Locust.

    Note You need to change HOST = '<API Gateway URL>' and WRITE_KEY = '<Your Write Key>' in main.py first.

    # Enable your python venv, if not
    source .venv/bin/activate
    
    # Start locust
    locust -f benchmark/main.py \
        --web-port 8089
    
    # Open your browser and input <API Gateway URL> and how many users you want.

Explore clickstream data

Open Redshift Query Editor V2

SET enable_case_sensitive_identifier TO true;

SELECT *
FROM clickstream.mv_kinesisSource
LIMIT 10
;

Install Tracking Code

Note You need to change HOST to your API Gateway url and WRITE_KEY to the value you defined in CDK deployment in any SDK.

Client Side based

Full List

Server Side based

Full List