Skip to content

Latest commit

 

History

History
250 lines (186 loc) · 13.2 KB

MANUAL.md

File metadata and controls

250 lines (186 loc) · 13.2 KB

Create SimplyE Analytics Testbed Manually

These are the steps required to create a testbed manually:

Creating an S3 bucket

Creating a bucket

  1. Sign in to the AWS Management Console, open the Amazon S3 console at https://console.aws.amazon.com/s3/ and click on Create bucket button: Create a new S3 bucket

  2. In the newly opened dialog enter a new bucket name: Create a new S3 bucket

ℹ️ To be able to make the name unique you may want to add a GUID to its end.

  1. Scroll to the end of the page and click on Create bucket: Create a new S3 bucket

  2. After the bucket is created, you'll be redirected to the list of all available buckets. Find the newly created bucket in the list and click on its name: Create a new S3 bucket

Creating required folders

  1. In the new window showing bucket's settings click on Create folder button: Create a new folder in the S3 bucket

  2. In the folder's setting window enter the name: json-input. It's the folder that will be storing JSON files containing Circulation Manager analytics events: Create a new folder in the S3 bucket

Repeat steps 5 - 6 and create the following folder structure:

|- athena
|- glue
   |- scripts
   |- temporary
|- json-input
|- parquet-output
  1. After creating all the folders, the bucket's folder structure should look like as it's shown on the screenshot below: S3 bucket folder structure

Uploading test data to the bucket

  1. Now upload test files to the bucket. Go to json-input folder and click on Upload button: Upload test files to the bucket

  2. Drag and drop the files from test-data folder: Upload test files to the bucket

  3. After adding all the files scroll down to the end of the page and click on Upload button: Upload test files to the bucket

Creating a Glue crawler for json-input folder

Creating a new crawler

  1. Open the AWS Glue console at https://console.aws.amazon.com/glue/, choose Crawlers in the navigation pane and then click Add crawler: Create AWS Glue crawler for json-input folder

  2. Enter the name of the new crawler and click on Next: Create AWS Glue crawler for json-input folder

  3. Select Crawl new folders and click on Next: Create AWS Glue crawler for json-input folder

  4. Select S3 data store and choose json-input folder as Input path: Create AWS Glue crawler for json-input folder

  5. Let Glue to generate a new IAM role, specify its name and click on Next: Create AWS Glue crawler for json-input folder

  6. Choose Run on demand as Frequency and click on Next: Create AWS Glue crawler for json-input folder

  7. Click on Add database: Create AWS Glue crawler for json-input folder

  8. Enter the name of the database and click on Create: Create AWS Glue crawler for json-input folder

  9. Check Update all new and existing partitions with metadata from the table: Create AWS Glue crawler for json-input folder

  10. Click on Next until the last page of the wizard and then click on Finish.

Running the crawler

  1. Select the newly created crawler in the list and click on Run to trigger it. After running you should be able to see the message saying that it completed and a new table has been successfully created: Create AWS Glue crawler for json-input folder

Updating the schema created by the crawler

  1. Select Tables on the left, choose json_input and click on Edit schema: Create AWS Glue crawler for json-input folder

  2. Select Tables on the left, choose json_input and click on Edit schema: Create AWS Glue crawler for json-input folder

  3. Walk through all the columns and change data type to timestamp for the following columns:

  • issued
  • end
  • availability_time
  • start
  • published Create AWS Glue crawler for json-input folder After finishing scroll down to the end of the page and click on Save.

Creating a Glue job for converting json-input data to the Apache Parquet format

Creating an IAM policy for the Parquet converter

  1. Open the IAM console at https://console.aws.amazon.com/iam/, in the navigation pane on the left choose Policies: Create a new IAM policy

  2. Click on Create policy: Create a new IAM policy

  3. Switch to JSON tab and insert the following:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::<cm-analytics-bucket>/json-input*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::<cm-analytics-bucket>/glue*",
                "arn:aws:s3:::<cm-analytics-bucket>/parquet-output*"
            ]
        }
    ]
}

where <cm-analytics-bucket> must be replaced with the name of the bucket created in 1 - 3. Create a new IAM policy

  1. Enter the new policy's name on the review page and click on Finish: Create a new IAM policy

Creating an IAM role for the Parquet converter

  1. In the navigation pane on the left choose Roles and click on Create role: Create a new IAM role

  2. Choose Glue as a trusted entity and go to the next page: Create a new IAM role

  3. Choose the following policies:

  • AWSGlueServiceRole
  • AWSGlueServiceRole-CMAnalyticsParquetConverter, the policy created in 21 - 24: Create a new IAM role
  1. Enter the new role's name and click on Create role: Create a new IAM role

Creating a Glue job

  1. Open the AWS Glue console at https://console.aws.amazon.com/glue/, choose Crawlers in the navigation pane and then click Add job: Create a Glue job

  2. Enter the name of the new job, select the role created in 29 - 32: Create a Glue job

  3. Then scroll down to Advanced properties, enable Job bookmark and scroll down to the next page of the wizard.

  4. Choose json-input as a data source and click on Next: Create a Glue job

  5. Leave the tranform type as is and click on Next: Create a Glue job

  6. Choose Parqeut and parquet-output as a target type and target path respectively: Create a Glue job

Creating a Glue crawler for parquet-output

Creating a new Glue crawler

  1. In the navigation bar on the left select Crawlers again and click on Add crawler: Create a Glue crawler for parquet-output

  2. Enter the name and click on Next: Create a Glue crawler for parquet-output

  3. Specify source type and click on Next: Create a Glue crawler for parquet-output

  4. Specify parquet-output as a target data source: Create a Glue crawler for parquet-output

  5. Specify parquet-output as a target data source: Create a Glue crawler for parquet-output

  6. Let Glue create a new IAM role: Create a Glue crawler for parquet-output

  7. Select cm-analytics as a database where the crawler will reside the output table: Create a Glue crawler for parquet-output

  8. Run the newly created crawler: Create a Glue crawler for parquet-output

Set up AWS Athena

  1. Open the Athena console at https://console.aws.amazon.com/athena/ and start setting it up: Set up AWS Athena

  2. Set athena folder as query result location and click on Save: Set up AWS Athena

  3. Run the query to ensure that Athena has been set up correctly: Query AWS Athena

Setting up QuickSight

  1. Create a new analysis in QuickSight: Set up a new dataset in QuickSight

  2. Create a new dataset: Set up a new dataset in QuickSight

  3. Set up a new Athena dataset and select cm-analytics database: Set up a new dataset in QuickSight

  4. Select parquet-output table: Set up a new dataset in QuickSight

  5. Don't use SPICE, directly query data: Set up a new dataset in QuickSight

  6. Change to N. Virginia regioon and click on Manage QuickSight: Set up QuickSight security settings

  7. Click on Security & permissions: Set up QuickSight security settings

  8. Under QuickSight access to AWS services click on Add or remove: Set up QuickSight security settings

  9. Scroll down to Amazon S3 and click on Select buckets: Set up QuickSight security settings

  10. Select the bucket created in Creating an S3 bucket and click on Finish: Set up QuickSight security settings

  11. Click on Update: Set up QuickSight security settings

  12. Try to create a new dashboard: Create a QuickSight visual