Merge branch 'devportal-730' of https://github.com/Teradata/quickstarts…

… into devportal-730
Teradata · Mar 26, 2024 · 225f37e · 225f37e
2 parents d16d4f3 + 7510109
commit 225f37e
Show file tree

Hide file tree

Showing 35 changed files with 966 additions and 48 deletions.
diff --git a/modules/ROOT/nav.adoc b/modules/ROOT/nav.adoc
@@ -55,6 +55,7 @@
 ** xref::perform-time-series-analysis-using-teradata-vantage.adoc[Perform time series analysis]
 ** xref:modelops:deploy-and-monitor-machine-learning-models-with-teradata-modelops-and-byom.adoc[]
 ** xref:modelops:deploy-and-monitor-machine-learning-models-with-teradata-modelops-and-git.adoc[]
+** xref:modelops:execute-airflow-workflows-with-clearscape-analytics-modelops-model-factory-solution.adoc[]
 ** xref:cloud-guides:sagemaker-with-teradata-vantage.adoc[]
 ** xref:cloud-guides:use-teradata-vantage-with-azure-machine-learning-studio.adoc[]
 ** xref:cloud-guides:integrate-teradata-jupyter-extensions-with-google-vertex-ai.adoc[]

diff --git a/modules/ROOT/pages/airflow.adoc b/modules/ROOT/pages/airflow.adoc
@@ -14,7 +14,7 @@ This tutorial demonstrates how to use airflow with Teradata Vantage. Airflow wil
 
 == Prerequisites
 
-* Ubuntu 22.x 
+* Ubuntu 22.x
 * Access to a Teradata Vantage instance.
 +
 include::ROOT:partial$vantage_clearscape_analytics.adoc[]
@@ -62,7 +62,13 @@ airflow standalone
 ----
 2. Access the Airflow UI. Visit https://localhost:8080 in the browser and log in with the admin account details shown in the terminal.
 
-== Define a Teradata connection in Airflow UI
+
+Teradata Connections may be defined in Airflow in the following ways:
+
+1. Using Airflow Web UI
+2. Using Environment Variable
+
+== Define a Teradata connection in Airflow Web UI
 
 1. Open the Admin -> Connections section of the UI. Click the Create link to create a new connection.
 +
@@ -78,7 +84,46 @@ image::{dir}/airflow-newconnection.png[Airflow New Connection, width=75%]
 * Password (required): Specify the password to connect.
 * Click on Test and Save.
 
+== Define a Teradata connection in Environment Variable
+Airflow connections may be defined in environment variables in either of one below formats.
+
+1. JSON format
+2. URI format
+
++
+NOTE: The naming convention is AIRFLOW_CONN_{CONN_ID}, all uppercase (note the single underscores surrounding CONN).
+So if your connection id is teradata_conn_id then the variable name should be AIRFLOW_CONN_TERADATA_CONN_ID
++
+
+
+== JSON format example
+
+
+[source, bash]
+----
+export AIRFLOW_CONN_TERADATA_CONN_ID='{
+    "conn_type": "teradata",
+    "login": "teradata_user",
+    "password": "my-password",
+    "host": "my-host",
+    "schema": "my-schema",
+    "extra": {
+        "tmode": "TERA",
+        "sslmode": "verify-ca"
+    }
+}'
+
+----
+
+== URI format example
+
+
+[source, bash]
+----
+export AIRFLOW_CONN_TERADATA_CONN_ID='teradata://teradata_user:my-password@my-host/my-schema?tmode=TERA&sslmode=verify-ca'
+----
 
+Refer https://airflow.apache.org/docs/apache-airflow-providers-teradata/stable/connections/teradata.html[Teradata Hook]  for detailed information on Teradata Connection in Airflow.
 
 == Define a DAG in Airflow
 
@@ -135,7 +180,7 @@ with DAG(
 
 == Summary
 
-This tutorial demonstrated how to use Airflow and the Airflow Teradata provider with a Teradata Vantage instance. The example DAG provided creates `my_users` table in the Teradata Vantage instance defined in Connection UI. 
+This tutorial demonstrated how to use Airflow and the Airflow Teradata provider with a Teradata Vantage instance. The example DAG provided creates `my_users` table in the Teradata Vantage instance defined in Connection UI.
 
 == Further reading
 * link:https://airflow.apache.org/docs/apache-airflow/stable/start.html[airflow documentation]

diff --git a/...les/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Buckets-1.PNG b/...les/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Buckets-1.PNG
diff --git a/...les/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Buckets-2.PNG b/...les/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Buckets-2.PNG
diff --git a/...les/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Buckets-3.PNG b/...les/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Buckets-3.PNG
diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Cat-1.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Cat-1.PNG
diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Cat-2.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Cat-2.PNG
diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-1.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-1.PNG
diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-2.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-2.PNG
diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-3.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-3.PNG
diff --git a/...cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-script-1.PNG b/...cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-script-1.PNG
diff --git a/...cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-script-2.PNG b/...cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-script-2.PNG
diff --git a/...cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-script-3.PNG b/...cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-script-3.PNG
diff --git a/...cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-script-4.PNG b/...cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-script-4.PNG
diff --git a/...cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-script-5.PNG b/...cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-script-5.PNG
diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Results.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Results.PNG
diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Role-1.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Role-1.PNG
diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Role-2.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Role-2.PNG
diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Role-3.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Role-3.PNG
diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Role-4.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Role-4.PNG
diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/secret-1.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/secret-1.PNG
diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/secret-2.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/secret-2.PNG
diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/secret-3.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/secret-3.PNG
diff --git a/modules/cloud-guides/pages/ingest-catalog-data-teradata-s3-with-glue.adoc b/modules/cloud-guides/pages/ingest-catalog-data-teradata-s3-with-glue.adoc
@@ -0,0 +1,287 @@
+= Ingest and Catalog Data from Teradata Vantage to Amazon S3 with AWS Glue Scripts
+:experimental:
+:page-author: Daniel Herrera
+:page-email: [email protected]
+:page-revdate: March 18, 2024
+:description: Ingest and catalog data from Teradata Vantage to Amazon S3
+:keywords: data warehouses, object storage, teradata, vantage, cloud data platform, data engineering, enterprise analytics, aws glue, aws lake formation, aws glue catalog 
+:dir: ingest-catalog-data-teradata-s3-with-glue
+
+== Overview
+This quickstart details the process of ingesting and cataloging data from Teradata Vantage to Amazon S3 with AWS Glue. 
+
+TIP: For ingesting data to Amazon S3 when cataloging is not a requirement consider https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/SQL-Data-Manipulation-Language/Working-with-External-Data/WRITE_NOS[Teradata Write NOS capabilities.]
+
+== Prerequisites
+* Access to an https://aws.amazon.com[Amazon AWS account]
+* Access to a Teradata Vantage instance
++
+include::ROOT:partial$vantage_clearscape_analytics.adoc[]
+* A database https://quickstarts.teradata.com/other-integrations/configure-a-teradata-vantage-connection-in-dbeaver.html[client] to send queries for loading the test data
+
+== Loading of test data
+* In your favorite database client run the following queries
++
+[source, teradata-sql]
+----
+CREATE DATABASE teddy_retailers_inventory
+AS PERMANENT = 110e6;
+
+CREATE TABLE teddy_retailers_inventory.source_catalog AS
+(
+  SELECT product_id, product_name, product_category, price_cents
+    FROM (
+		LOCATION='/s3/dev-rel-demos.s3.amazonaws.com/demo-datamesh/source_products.csv') as products
+) WITH DATA;
+
+CREATE TABLE teddy_retailers_inventory.source_stock AS
+(
+  SELECT entry_id, product_id, product_quantity, purchase_price_cents, entry_date
+    FROM (
+		LOCATION='/s3/dev-rel-demos.s3.amazonaws.com/demo-datamesh/source_stock.csv') as stock
+) WITH DATA;
+----
+
+== Amazon AWS setup
+In this section, we will cover in detail each of the steps below:
+
+* Create an Amazon S3 bucket to ingest data
+* Create an AWS Glue Catalog Database for storing metadata
+* Store Teradata Vantage credentials in AWS Secrets Manager
+* Create an AWS Glue Service Role to assign to ETL jobs
+* Create a connection to a Teradata Vantage Instance in AWS Glue
+* Create an AWS Glue Job
+* Draft a script for automated ingestion and cataloging of Teradata Vantage data into Amazon S3
+
+== Create an Amazon S3 Bucket to Ingest Data
+* In Amazon S3, select `Create bucket`.
++
+image::{dir}/Buckets-1.PNG[create bucket,align="center",width=80%]
+* Assign a name to your bucket and take note of it.
++
+image::{dir}/Buckets-2.PNG[name bucket,align="center",width=80%]
+* Leave all settings at their default values.
+* Click on `Create bucket`.
++
+image::{dir}/Buckets-3.PNG[save bucket,align="center",width=80%]
+
+== Create an AWS Glue Catalog Database for Storing Metadata
+
+* In AWS Glue, select Data catalog, Databases.
+* Click on `Add database`.
++
+image::{dir}/Cat-1.PNG[add database,align="center",width=80%]
+* Define a database name and click on `Create database`.
++
+image::{dir}/Cat-2.PNG[add database name,align="center",width=80%]
+
+== Store Teradata Vantage credentials in AWS Secrets Manager
+
+* In AWS Secrets Manager, select `Create new secret`.
++
+image::{dir}/secret-1.PNG[create secret,align="center",width=80%]
+* The secret should be an `Other type of secret` with the following keys and values according to your Teradata Vantage Instance:
+** USER
+** PASSWORD
++
+TIP: In the case of ClearScape Analytics Experience, the user is always "demo_user," and the password is the one you defined when creating your ClearScape Analytics Experience environment.
++ 
+image::{dir}/secret-2.PNG[secret values,align="center",width=80%]
+* Assign a name to the secret.
+* The rest of the steps can be left with the default values.
+* Create the secret.
+
+== Create an AWS Glue Service Role to Assign to ETL Jobs
+The role you create should have access to the typical permissions of a Glue Service Role, but also access to read the secret and S3 bucket you've created.
+
+* In AWS, go to the IAM service.
+* Under Access Management, select `Roles`.
+* In roles, click on `Create role`.
++
+image::{dir}/Role-1.PNG[create role,align="center",width=80%]
+* In select trusted entity, select `AWS service` and pick `Glue` from the dropdown.
++
+image::{dir}/Role-2.PNG[role type,align="center",width=80%]
+* In add permissions:
+** Search for `AWSGlueServiceRole`.
+** Click the related checkbox.
+** Search for `SecretsManagerReadWrite`.
+** Click the related checkbox.
+* In Name, review, and create:
+** Define a name for your role.
++
+image::{dir}/Role-3.PNG[name role,align="center",width=80%]
+* Click on `Create role`.
+* Return to Access Management, Roles, and search for the role you've just created.
+* Select your role.
+* Click on `Add permissions`, then `Create inline policy`.
+* Click on `JSON`.
+* In the Policy editor, paste the JSON object below, substituting the name of the bucket you've created.
+[source,json]
+----
+{
+    "Version": "2012-10-17",
+    "Statement": [
+        {
+            "Sid": "FullAccessToSpecificBucket",
+            "Effect": "Allow",
+            "Action": "s3:*",
+            "Resource": [1
+                "arn:aws:s3:::<bucket-name>",
+                "arn:aws:s3:::<bucket-name>/*"
+            ]
+        }
+    ]
+}
+----
+* Click `Next`.
++ 
+image::{dir}/Role-4.PNG[inline policy,align="center",width=80%]
+* Assign a name to your policy.
+* Click on `Create policy`.
+
+== Create a connection to a Teradata Vantage Instance in AWS Glue
+
+* In AWS Glue, select `Data connections`.
++
+image::{dir}/Glue-1.PNG[connection,align="center",width=80%]
+* Under Connectors, select `Create connection`.
+* Search for and select the Teradata Vantage data source.
++ 
+image::{dir}/Glue-2.PNG[teradata type,align="center",width=80%]
+* In the dialog box, enter the URL of your Teradata Vantage instance in JDBC format.
++
+TIP: In the case of ClearScape Analytics Experience, the URL follows the following structure: 
+`jdbc:teradata://<URL Host>/DATABASE=demo_user,DBS_PORT=1025`
+* Select the AWS Secret created in the previous step.
+* Name your connection and finish the creation process.
++ 
+image::{dir}/Glue-3.PNG[connection configuration,align="center",width=80%]
+
+== Create an AWS Glue Job
+* In AWS Glue, select `ETL Jobs` and click on `Script editor`.
++
+image::{dir}/Glue-script-1.PNG[script editor creation,align="center",width=80%]
+* Select `Spark` as the engine and choose to start fresh.
++
+image::{dir}/Glue-script-2.PNG[script editor type,align="center",width=80%]
+
+== Draft a script for automated ingestion and cataloging of Teradata Vantage data into Amazon S3
+
+* Copy the following script into the editor.
+** The script requires the following modifications:
+*** Substitute the name of your S3 bucket.
+*** Substitute the name of your Glue catalog database.
+*** If you are not following the example in the guide, modify the database name and the tables to be ingested and cataloged.
+*** For cataloging purposes, only the first row of each table is ingested in the example. This query can be modified to ingest the whole table or to filter selected rows.
+
+[source, python, id="glue-script-first-run" role="emits-gtm-events"]
+----
+# Import section
+import sys
+from awsglue.transforms import *
+from awsglue.utils import getResolvedOptions
+from pyspark.context import SparkContext
+from awsglue.context import GlueContext
+from awsglue.job import Job
+from pyspark.sql import SQLContext
+
+# PySpark Config Section
+args = getResolvedOptions(sys.argv, ["JOB_NAME"])
+sc = SparkContext()
+glueContext = GlueContext(sc)
+spark = glueContext.spark_session
+job = Job(glueContext)
+job.init(args["JOB_NAME"], args)
+
+#ETL Job Parameters Section
+# Source database
+database_name = "teddy_retailers_inventory"
+
+# Source tables
+table_names = ["source_catalog","source_stock"]
+
+# Target S3 Bucket
+target_s3_bucket = "s3://<your-bucket-name>"
+
+#Target catalog database 
+catalog_database_name = "<your-catalog-database-name>" 
+
+
+# Job function abstraction
+def process_table(table_name, transformation_ctx_prefix, catalog_database, catalog_table_name):
+    dynamic_frame = glueContext.create_dynamic_frame.from_options(
+        connection_type="teradata",
+        connection_options={
+            "dbtable": table_name,
+            "connectionName": "Teradata connection default",
+            "query": f"SELECT TOP 1 * FROM {table_name}", # This line can be modified to ingest the full table or rows that fulfill an specific condition
+        },
+        transformation_ctx=transformation_ctx_prefix + "_read",
+    )
+
+    s3_sink = glueContext.getSink(
+        path=target_s3_bucket,
+        connection_type="s3",
+        updateBehavior="UPDATE_IN_DATABASE",
+        partitionKeys=[],
+        compression="snappy",
+        enableUpdateCatalog=True,
+        transformation_ctx=transformation_ctx_prefix + "_s3",
+    )
+    # Dynamically set catalog table name based on function parameter
+    s3_sink.setCatalogInfo(
+        catalogDatabase=catalog_database, catalogTableName=catalog_table_name
+    )
+    s3_sink.setFormat("csv")
+    s3_sink.writeFrame(dynamic_frame)
+
+
+# Job execution section
+for table_name in table_names:
+    full_table_name = f"{database_name}.{table_name}"  
+    transformation_ctx_prefix = f"{database_name}_{table_name}"  
+    catalog_table_name = f"{table_name}_catalog"  
+    # Call your process_table function for each table
+    process_table(full_table_name, transformation_ctx_prefix, catalog_database_name, catalog_table_name)
+
+job.commit()
+----
+
+* Assign a name to your script
++
+image::{dir}/Glue-script-3.PNG[script in editor,align="center",width=80%]
+
+* In Job details, Basic properties:
+** Select the IAM role you created for the ETL job.
+** For testing, select "2" as the Requested number of workers, this is the minimum allowed.
++
+image::{dir}/Glue-script-4.PNG[script configurations,align="center",width=80%]
+** In `Advanced properties`, `Connections` select your connection to Teradata Vantage. 
++
+TIP: The connection created must be referenced twice, once in the job configuration, once in the script itself.
++
+image::{dir}/Glue-script-5.PNG[script configuration connection,align="center",width=80%]
+* Click on `Save`.
+* Click on `Run`.
+** The ETL job takes a couple of minutes to complete, most of this time is related to starting the Spark cluster.
+
+== Checking the Results
+
+* After the job is finished:
+** Go to Data Catalog, Databases.
+** Click on the catalog database you created.
+** In this location, you will see the tables extracted and cataloged through your Glue ETL job.
++
+image::{dir}/Results.PNG[result tables,align="center",width=80%]
+
+* All tables ingested are also present as compressed files in S3. Rarely, these files would be queried directly. Services such as AWS Athena can be used to query the files relying on the catalog metadata.
+
+== Summary
+
+In this quick start, we learned how to ingest and catalog data in Teradata Vantage to Amazon S3 with AWS Glue Scripts.
+
+== Further reading
+* https://quickstarts.teradata.com/cloud-guides/integrate-teradata-vantage-with-google-cloud-data-catalog.html[Integrate Teradata Vantage with Google Cloud Data Catalog]
+
+include::ROOT:partial$community_link.adoc[]
diff --git a/...rkflows-with-clearscape-analytics-modelops-model-factory-solution/DAG_graph.png b/...rkflows-with-clearscape-analytics-modelops-model-factory-solution/DAG_graph.png
diff --git a/...ow-workflows-with-clearscape-analytics-modelops-model-factory-solution/DAGs.png b/...ow-workflows-with-clearscape-analytics-modelops-model-factory-solution/DAGs.png
diff --git a/...rkflows-with-clearscape-analytics-modelops-model-factory-solution/LoginPage.png b/...rkflows-with-clearscape-analytics-modelops-model-factory-solution/LoginPage.png
diff --git a/...orkflows-with-clearscape-analytics-modelops-model-factory-solution/Workflow.png b/...orkflows-with-clearscape-analytics-modelops-model-factory-solution/Workflow.png
diff --git a/...rkflows-with-clearscape-analytics-modelops-model-factory-solution/modelOps1.png b/...rkflows-with-clearscape-analytics-modelops-model-factory-solution/modelOps1.png
diff --git a/...lows-with-clearscape-analytics-modelops-model-factory-solution/successTasks.png b/...lows-with-clearscape-analytics-modelops-model-factory-solution/successTasks.png
diff --git a/...deploy-and-monitor-machine-learning-models-with-teradata-modelops-and-byom.adoc b/...deploy-and-monitor-machine-learning-models-with-teradata-modelops-and-byom.adoc
@@ -115,6 +115,7 @@ Use your connection and select a database. e.g "aoa_byom_models"
 In this quick start we have learned how to follow a full lifecycle of BYOM models into ModelOps and how to deploy it into Vantage. Then how we can schedule a batch scoring or test restful or on-demand scorings and start monitoring on Data Drift and Model Quality metrics.
 
 == Further reading
-* link:https://docs.teradata.com/search/documents?query=ModelOps&sort=last_update&virtual-field=title_only&content-lang=
+* https://docs.teradata.com/search/documents?query=ModelOps&sort=last_update&virtual-field=title_only&content-lang=[+++ModelOps documentation+++].
 
 include::ROOT:partial$community_link.adoc[]
+
diff --git a/.../deploy-and-monitor-machine-learning-models-with-teradata-modelops-and-git.adoc b/.../deploy-and-monitor-machine-learning-models-with-teradata-modelops-and-git.adoc
@@ -196,6 +196,6 @@ Use your connection and select a database. e.g "aoa_byom_models"
 In this quick start we have learned how to follow a full lifecycle of GIT models into ModelOps and how to deploy it into Vantage or into Docker containers for Edge deployments. Then how we can schedule a batch scoring or test restful or on-demand scorings and start monitoring on Data Drift and Model Quality metrics.
 
 == Further reading
-* link:https://docs.teradata.com/search/documents?query=ModelOps&sort=last_update&virtual-field=title_only&content-lang=
+* https://docs.teradata.com/search/documents?query=ModelOps&sort=last_update&virtual-field=title_only&content-lang=[+++ModelOps documentation+++].
 
-include::ROOT:partial$community_link.adoc[]
+include::ROOT:partial$community_link.adoc[]