These are the steps required to create a testbed manually:
-
Sign in to the AWS Management Console, open the Amazon S3 console at https://console.aws.amazon.com/s3/ and click on Create bucket button:
ℹ️ To be able to make the name unique you may want to add a GUID to its end.
-
After the bucket is created, you'll be redirected to the list of all available buckets. Find the newly created bucket in the list and click on its name:
-
In the new window showing bucket's settings click on Create folder button:
-
In the folder's setting window enter the name: json-input. It's the folder that will be storing JSON files containing Circulation Manager analytics events:
Repeat steps 5 - 6 and create the following folder structure:
|- athena
|- glue
|- scripts
|- temporary
|- json-input
|- parquet-output
- After creating all the folders, the bucket's folder structure should look like as it's shown on the screenshot below:
-
Now upload test files to the bucket. Go to json-input folder and click on Upload button:
-
Drag and drop the files from test-data folder:
-
After adding all the files scroll down to the end of the page and click on Upload button:
-
Open the AWS Glue console at https://console.aws.amazon.com/glue/, choose Crawlers in the navigation pane and then click Add crawler:
-
Select S3 data store and choose json-input folder as Input path:
-
Let Glue to generate a new IAM role, specify its name and click on Next:
-
Check Update all new and existing partitions with metadata from the table:
-
Click on Next until the last page of the wizard and then click on Finish.
- Select the newly created crawler in the list and click on Run to trigger it. After running you should be able to see the message saying that it completed and a new table has been successfully created:
-
Select Tables on the left, choose json_input and click on Edit schema:
-
Select Tables on the left, choose json_input and click on Edit schema:
-
Walk through all the columns and change data type to
timestamp
for the following columns:
- issued
- end
- availability_time
- start
- published After finishing scroll down to the end of the page and click on Save.
-
Open the IAM console at https://console.aws.amazon.com/iam/, in the navigation pane on the left choose Policies:
-
Switch to JSON tab and insert the following:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::<cm-analytics-bucket>/json-input*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::<cm-analytics-bucket>/glue*",
"arn:aws:s3:::<cm-analytics-bucket>/parquet-output*"
]
}
]
}
where <cm-analytics-bucket>
must be replaced with the name of the bucket created in 1 - 3.
-
In the navigation pane on the left choose Roles and click on Create role:
-
Choose the following policies:
-
Open the AWS Glue console at https://console.aws.amazon.com/glue/, choose Crawlers in the navigation pane and then click Add job:
-
Enter the name of the new job, select the role created in 29 - 32:
-
Then scroll down to Advanced properties, enable Job bookmark and scroll down to the next page of the wizard.
-
Choose Parqeut and parquet-output as a target type and target path respectively:
-
In the navigation bar on the left select Crawlers again and click on Add crawler:
-
Select cm-analytics as a database where the crawler will reside the output table:
-
Open the Athena console at https://console.aws.amazon.com/athena/ and start setting it up:
-
Set athena folder as query result location and click on Save:
-
Run the query to ensure that Athena has been set up correctly: