diff --git a/modules/ROOT/nav.adoc b/modules/ROOT/nav.adoc index 53738aa4d..9d00a9525 100644 --- a/modules/ROOT/nav.adoc +++ b/modules/ROOT/nav.adoc @@ -55,6 +55,7 @@ ** xref::perform-time-series-analysis-using-teradata-vantage.adoc[Perform time series analysis] ** xref:modelops:deploy-and-monitor-machine-learning-models-with-teradata-modelops-and-byom.adoc[] ** xref:modelops:deploy-and-monitor-machine-learning-models-with-teradata-modelops-and-git.adoc[] +** xref:modelops:execute-airflow-workflows-with-clearscape-analytics-modelops-model-factory-solution.adoc[] ** xref:cloud-guides:sagemaker-with-teradata-vantage.adoc[] ** xref:cloud-guides:use-teradata-vantage-with-azure-machine-learning-studio.adoc[] ** xref:cloud-guides:integrate-teradata-jupyter-extensions-with-google-vertex-ai.adoc[] diff --git a/modules/ROOT/pages/airflow.adoc b/modules/ROOT/pages/airflow.adoc index c035f3b31..1c1b456bf 100644 --- a/modules/ROOT/pages/airflow.adoc +++ b/modules/ROOT/pages/airflow.adoc @@ -14,7 +14,7 @@ This tutorial demonstrates how to use airflow with Teradata Vantage. Airflow wil == Prerequisites -* Ubuntu 22.x +* Ubuntu 22.x * Access to a Teradata Vantage instance. + include::ROOT:partial$vantage_clearscape_analytics.adoc[] @@ -62,7 +62,13 @@ airflow standalone ---- 2. Access the Airflow UI. Visit https://localhost:8080 in the browser and log in with the admin account details shown in the terminal. -== Define a Teradata connection in Airflow UI + +Teradata Connections may be defined in Airflow in the following ways: + +1. Using Airflow Web UI +2. Using Environment Variable + +== Define a Teradata connection in Airflow Web UI 1. Open the Admin -> Connections section of the UI. Click the Create link to create a new connection. + @@ -78,7 +84,46 @@ image::{dir}/airflow-newconnection.png[Airflow New Connection, width=75%] * Password (required): Specify the password to connect. * Click on Test and Save. +== Define a Teradata connection in Environment Variable +Airflow connections may be defined in environment variables in either of one below formats. + +1. JSON format +2. URI format + ++ +NOTE: The naming convention is AIRFLOW_CONN_{CONN_ID}, all uppercase (note the single underscores surrounding CONN). +So if your connection id is teradata_conn_id then the variable name should be AIRFLOW_CONN_TERADATA_CONN_ID ++ + + +== JSON format example + + +[source, bash] +---- +export AIRFLOW_CONN_TERADATA_CONN_ID='{ + "conn_type": "teradata", + "login": "teradata_user", + "password": "my-password", + "host": "my-host", + "schema": "my-schema", + "extra": { + "tmode": "TERA", + "sslmode": "verify-ca" + } +}' + +---- + +== URI format example + + +[source, bash] +---- +export AIRFLOW_CONN_TERADATA_CONN_ID='teradata://teradata_user:my-password@my-host/my-schema?tmode=TERA&sslmode=verify-ca' +---- +Refer https://airflow.apache.org/docs/apache-airflow-providers-teradata/stable/connections/teradata.html[Teradata Hook] for detailed information on Teradata Connection in Airflow. == Define a DAG in Airflow @@ -135,7 +180,7 @@ with DAG( == Summary -This tutorial demonstrated how to use Airflow and the Airflow Teradata provider with a Teradata Vantage instance. The example DAG provided creates `my_users` table in the Teradata Vantage instance defined in Connection UI. +This tutorial demonstrated how to use Airflow and the Airflow Teradata provider with a Teradata Vantage instance. The example DAG provided creates `my_users` table in the Teradata Vantage instance defined in Connection UI. == Further reading * link:https://airflow.apache.org/docs/apache-airflow/stable/start.html[airflow documentation] diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Buckets-1.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Buckets-1.PNG new file mode 100644 index 000000000..76345834b Binary files /dev/null and b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Buckets-1.PNG differ diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Buckets-2.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Buckets-2.PNG new file mode 100644 index 000000000..88b8009e4 Binary files /dev/null and b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Buckets-2.PNG differ diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Buckets-3.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Buckets-3.PNG new file mode 100644 index 000000000..47ea932a0 Binary files /dev/null and b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Buckets-3.PNG differ diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Cat-1.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Cat-1.PNG new file mode 100644 index 000000000..ee9c0a0e9 Binary files /dev/null and b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Cat-1.PNG differ diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Cat-2.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Cat-2.PNG new file mode 100644 index 000000000..43859d5b8 Binary files /dev/null and b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Cat-2.PNG differ diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-1.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-1.PNG new file mode 100644 index 000000000..0a58d302f Binary files /dev/null and b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-1.PNG differ diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-2.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-2.PNG new file mode 100644 index 000000000..365acc592 Binary files /dev/null and b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-2.PNG differ diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-3.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-3.PNG new file mode 100644 index 000000000..ec5fb8e11 Binary files /dev/null and b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-3.PNG differ diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-script-1.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-script-1.PNG new file mode 100644 index 000000000..f306fa191 Binary files /dev/null and b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-script-1.PNG differ diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-script-2.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-script-2.PNG new file mode 100644 index 000000000..70b14bc60 Binary files /dev/null and b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-script-2.PNG differ diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-script-3.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-script-3.PNG new file mode 100644 index 000000000..eeaeb1836 Binary files /dev/null and b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-script-3.PNG differ diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-script-4.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-script-4.PNG new file mode 100644 index 000000000..920d304df Binary files /dev/null and b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-script-4.PNG differ diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-script-5.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-script-5.PNG new file mode 100644 index 000000000..a291096bd Binary files /dev/null and b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Glue-script-5.PNG differ diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Results.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Results.PNG new file mode 100644 index 000000000..014a9e0fd Binary files /dev/null and b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Results.PNG differ diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Role-1.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Role-1.PNG new file mode 100644 index 000000000..d689b918f Binary files /dev/null and b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Role-1.PNG differ diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Role-2.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Role-2.PNG new file mode 100644 index 000000000..21e9c1551 Binary files /dev/null and b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Role-2.PNG differ diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Role-3.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Role-3.PNG new file mode 100644 index 000000000..c7afa089e Binary files /dev/null and b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Role-3.PNG differ diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Role-4.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Role-4.PNG new file mode 100644 index 000000000..c7d066d1c Binary files /dev/null and b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/Role-4.PNG differ diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/secret-1.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/secret-1.PNG new file mode 100644 index 000000000..acf39c710 Binary files /dev/null and b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/secret-1.PNG differ diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/secret-2.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/secret-2.PNG new file mode 100644 index 000000000..066121b43 Binary files /dev/null and b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/secret-2.PNG differ diff --git a/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/secret-3.PNG b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/secret-3.PNG new file mode 100644 index 000000000..1eb88d551 Binary files /dev/null and b/modules/cloud-guides/images/ingest-catalog-data-teradata-s3-with-glue/secret-3.PNG differ diff --git a/modules/cloud-guides/pages/ingest-catalog-data-teradata-s3-with-glue.adoc b/modules/cloud-guides/pages/ingest-catalog-data-teradata-s3-with-glue.adoc new file mode 100644 index 000000000..464575788 --- /dev/null +++ b/modules/cloud-guides/pages/ingest-catalog-data-teradata-s3-with-glue.adoc @@ -0,0 +1,287 @@ += Ingest and Catalog Data from Teradata Vantage to Amazon S3 with AWS Glue Scripts +:experimental: +:page-author: Daniel Herrera +:page-email: daniel.herrera2@teradata.com +:page-revdate: March 18, 2024 +:description: Ingest and catalog data from Teradata Vantage to Amazon S3 +:keywords: data warehouses, object storage, teradata, vantage, cloud data platform, data engineering, enterprise analytics, aws glue, aws lake formation, aws glue catalog +:dir: ingest-catalog-data-teradata-s3-with-glue + +== Overview +This quickstart details the process of ingesting and cataloging data from Teradata Vantage to Amazon S3 with AWS Glue. + +TIP: For ingesting data to Amazon S3 when cataloging is not a requirement consider https://docs.teradata.com/r/Enterprise_IntelliFlex_VMware/SQL-Data-Manipulation-Language/Working-with-External-Data/WRITE_NOS[Teradata Write NOS capabilities.] + +== Prerequisites +* Access to an https://aws.amazon.com[Amazon AWS account] +* Access to a Teradata Vantage instance ++ +include::ROOT:partial$vantage_clearscape_analytics.adoc[] +* A database https://quickstarts.teradata.com/other-integrations/configure-a-teradata-vantage-connection-in-dbeaver.html[client] to send queries for loading the test data + +== Loading of test data +* In your favorite database client run the following queries ++ +[source, teradata-sql] +---- +CREATE DATABASE teddy_retailers_inventory +AS PERMANENT = 110e6; + +CREATE TABLE teddy_retailers_inventory.source_catalog AS +( + SELECT product_id, product_name, product_category, price_cents + FROM ( + LOCATION='/s3/dev-rel-demos.s3.amazonaws.com/demo-datamesh/source_products.csv') as products +) WITH DATA; + +CREATE TABLE teddy_retailers_inventory.source_stock AS +( + SELECT entry_id, product_id, product_quantity, purchase_price_cents, entry_date + FROM ( + LOCATION='/s3/dev-rel-demos.s3.amazonaws.com/demo-datamesh/source_stock.csv') as stock +) WITH DATA; +---- + +== Amazon AWS setup +In this section, we will cover in detail each of the steps below: + +* Create an Amazon S3 bucket to ingest data +* Create an AWS Glue Catalog Database for storing metadata +* Store Teradata Vantage credentials in AWS Secrets Manager +* Create an AWS Glue Service Role to assign to ETL jobs +* Create a connection to a Teradata Vantage Instance in AWS Glue +* Create an AWS Glue Job +* Draft a script for automated ingestion and cataloging of Teradata Vantage data into Amazon S3 + +== Create an Amazon S3 Bucket to Ingest Data +* In Amazon S3, select `Create bucket`. ++ +image::{dir}/Buckets-1.PNG[create bucket,align="center",width=80%] +* Assign a name to your bucket and take note of it. ++ +image::{dir}/Buckets-2.PNG[name bucket,align="center",width=80%] +* Leave all settings at their default values. +* Click on `Create bucket`. ++ +image::{dir}/Buckets-3.PNG[save bucket,align="center",width=80%] + +== Create an AWS Glue Catalog Database for Storing Metadata + +* In AWS Glue, select Data catalog, Databases. +* Click on `Add database`. ++ +image::{dir}/Cat-1.PNG[add database,align="center",width=80%] +* Define a database name and click on `Create database`. ++ +image::{dir}/Cat-2.PNG[add database name,align="center",width=80%] + +== Store Teradata Vantage credentials in AWS Secrets Manager + +* In AWS Secrets Manager, select `Create new secret`. ++ +image::{dir}/secret-1.PNG[create secret,align="center",width=80%] +* The secret should be an `Other type of secret` with the following keys and values according to your Teradata Vantage Instance: +** USER +** PASSWORD ++ +TIP: In the case of ClearScape Analytics Experience, the user is always "demo_user," and the password is the one you defined when creating your ClearScape Analytics Experience environment. ++ +image::{dir}/secret-2.PNG[secret values,align="center",width=80%] +* Assign a name to the secret. +* The rest of the steps can be left with the default values. +* Create the secret. + +== Create an AWS Glue Service Role to Assign to ETL Jobs +The role you create should have access to the typical permissions of a Glue Service Role, but also access to read the secret and S3 bucket you've created. + +* In AWS, go to the IAM service. +* Under Access Management, select `Roles`. +* In roles, click on `Create role`. ++ +image::{dir}/Role-1.PNG[create role,align="center",width=80%] +* In select trusted entity, select `AWS service` and pick `Glue` from the dropdown. ++ +image::{dir}/Role-2.PNG[role type,align="center",width=80%] +* In add permissions: +** Search for `AWSGlueServiceRole`. +** Click the related checkbox. +** Search for `SecretsManagerReadWrite`. +** Click the related checkbox. +* In Name, review, and create: +** Define a name for your role. ++ +image::{dir}/Role-3.PNG[name role,align="center",width=80%] +* Click on `Create role`. +* Return to Access Management, Roles, and search for the role you've just created. +* Select your role. +* Click on `Add permissions`, then `Create inline policy`. +* Click on `JSON`. +* In the Policy editor, paste the JSON object below, substituting the name of the bucket you've created. +[source,json] +---- +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "FullAccessToSpecificBucket", + "Effect": "Allow", + "Action": "s3:*", + "Resource": [1 + "arn:aws:s3:::", + "arn:aws:s3:::/*" + ] + } + ] +} +---- +* Click `Next`. ++ +image::{dir}/Role-4.PNG[inline policy,align="center",width=80%] +* Assign a name to your policy. +* Click on `Create policy`. + +== Create a connection to a Teradata Vantage Instance in AWS Glue + +* In AWS Glue, select `Data connections`. ++ +image::{dir}/Glue-1.PNG[connection,align="center",width=80%] +* Under Connectors, select `Create connection`. +* Search for and select the Teradata Vantage data source. ++ +image::{dir}/Glue-2.PNG[teradata type,align="center",width=80%] +* In the dialog box, enter the URL of your Teradata Vantage instance in JDBC format. ++ +TIP: In the case of ClearScape Analytics Experience, the URL follows the following structure: +`jdbc:teradata:///DATABASE=demo_user,DBS_PORT=1025` +* Select the AWS Secret created in the previous step. +* Name your connection and finish the creation process. ++ +image::{dir}/Glue-3.PNG[connection configuration,align="center",width=80%] + +== Create an AWS Glue Job +* In AWS Glue, select `ETL Jobs` and click on `Script editor`. ++ +image::{dir}/Glue-script-1.PNG[script editor creation,align="center",width=80%] +* Select `Spark` as the engine and choose to start fresh. ++ +image::{dir}/Glue-script-2.PNG[script editor type,align="center",width=80%] + +== Draft a script for automated ingestion and cataloging of Teradata Vantage data into Amazon S3 + +* Copy the following script into the editor. +** The script requires the following modifications: +*** Substitute the name of your S3 bucket. +*** Substitute the name of your Glue catalog database. +*** If you are not following the example in the guide, modify the database name and the tables to be ingested and cataloged. +*** For cataloging purposes, only the first row of each table is ingested in the example. This query can be modified to ingest the whole table or to filter selected rows. + +[source, python, id="glue-script-first-run" role="emits-gtm-events"] +---- +# Import section +import sys +from awsglue.transforms import * +from awsglue.utils import getResolvedOptions +from pyspark.context import SparkContext +from awsglue.context import GlueContext +from awsglue.job import Job +from pyspark.sql import SQLContext + +# PySpark Config Section +args = getResolvedOptions(sys.argv, ["JOB_NAME"]) +sc = SparkContext() +glueContext = GlueContext(sc) +spark = glueContext.spark_session +job = Job(glueContext) +job.init(args["JOB_NAME"], args) + +#ETL Job Parameters Section +# Source database +database_name = "teddy_retailers_inventory" + +# Source tables +table_names = ["source_catalog","source_stock"] + +# Target S3 Bucket +target_s3_bucket = "s3://" + +#Target catalog database +catalog_database_name = "" + + +# Job function abstraction +def process_table(table_name, transformation_ctx_prefix, catalog_database, catalog_table_name): + dynamic_frame = glueContext.create_dynamic_frame.from_options( + connection_type="teradata", + connection_options={ + "dbtable": table_name, + "connectionName": "Teradata connection default", + "query": f"SELECT TOP 1 * FROM {table_name}", # This line can be modified to ingest the full table or rows that fulfill an specific condition + }, + transformation_ctx=transformation_ctx_prefix + "_read", + ) + + s3_sink = glueContext.getSink( + path=target_s3_bucket, + connection_type="s3", + updateBehavior="UPDATE_IN_DATABASE", + partitionKeys=[], + compression="snappy", + enableUpdateCatalog=True, + transformation_ctx=transformation_ctx_prefix + "_s3", + ) + # Dynamically set catalog table name based on function parameter + s3_sink.setCatalogInfo( + catalogDatabase=catalog_database, catalogTableName=catalog_table_name + ) + s3_sink.setFormat("csv") + s3_sink.writeFrame(dynamic_frame) + + +# Job execution section +for table_name in table_names: + full_table_name = f"{database_name}.{table_name}" + transformation_ctx_prefix = f"{database_name}_{table_name}" + catalog_table_name = f"{table_name}_catalog" + # Call your process_table function for each table + process_table(full_table_name, transformation_ctx_prefix, catalog_database_name, catalog_table_name) + +job.commit() +---- + +* Assign a name to your script ++ +image::{dir}/Glue-script-3.PNG[script in editor,align="center",width=80%] + +* In Job details, Basic properties: +** Select the IAM role you created for the ETL job. +** For testing, select "2" as the Requested number of workers, this is the minimum allowed. ++ +image::{dir}/Glue-script-4.PNG[script configurations,align="center",width=80%] +** In `Advanced properties`, `Connections` select your connection to Teradata Vantage. ++ +TIP: The connection created must be referenced twice, once in the job configuration, once in the script itself. ++ +image::{dir}/Glue-script-5.PNG[script configuration connection,align="center",width=80%] +* Click on `Save`. +* Click on `Run`. +** The ETL job takes a couple of minutes to complete, most of this time is related to starting the Spark cluster. + +== Checking the Results + +* After the job is finished: +** Go to Data Catalog, Databases. +** Click on the catalog database you created. +** In this location, you will see the tables extracted and cataloged through your Glue ETL job. ++ +image::{dir}/Results.PNG[result tables,align="center",width=80%] + +* All tables ingested are also present as compressed files in S3. Rarely, these files would be queried directly. Services such as AWS Athena can be used to query the files relying on the catalog metadata. + +== Summary + +In this quick start, we learned how to ingest and catalog data in Teradata Vantage to Amazon S3 with AWS Glue Scripts. + +== Further reading +* https://quickstarts.teradata.com/cloud-guides/integrate-teradata-vantage-with-google-cloud-data-catalog.html[Integrate Teradata Vantage with Google Cloud Data Catalog] + +include::ROOT:partial$community_link.adoc[] \ No newline at end of file diff --git a/modules/modelops/images/execute-airflow-workflows-with-clearscape-analytics-modelops-model-factory-solution/DAG_graph.png b/modules/modelops/images/execute-airflow-workflows-with-clearscape-analytics-modelops-model-factory-solution/DAG_graph.png new file mode 100644 index 000000000..69dc4cddf Binary files /dev/null and b/modules/modelops/images/execute-airflow-workflows-with-clearscape-analytics-modelops-model-factory-solution/DAG_graph.png differ diff --git a/modules/modelops/images/execute-airflow-workflows-with-clearscape-analytics-modelops-model-factory-solution/DAGs.png b/modules/modelops/images/execute-airflow-workflows-with-clearscape-analytics-modelops-model-factory-solution/DAGs.png new file mode 100644 index 000000000..6f17a8ea2 Binary files /dev/null and b/modules/modelops/images/execute-airflow-workflows-with-clearscape-analytics-modelops-model-factory-solution/DAGs.png differ diff --git a/modules/modelops/images/execute-airflow-workflows-with-clearscape-analytics-modelops-model-factory-solution/LoginPage.png b/modules/modelops/images/execute-airflow-workflows-with-clearscape-analytics-modelops-model-factory-solution/LoginPage.png new file mode 100644 index 000000000..5ba39af71 Binary files /dev/null and b/modules/modelops/images/execute-airflow-workflows-with-clearscape-analytics-modelops-model-factory-solution/LoginPage.png differ diff --git a/modules/modelops/images/execute-airflow-workflows-with-clearscape-analytics-modelops-model-factory-solution/Workflow.png b/modules/modelops/images/execute-airflow-workflows-with-clearscape-analytics-modelops-model-factory-solution/Workflow.png new file mode 100644 index 000000000..8d68be2d3 Binary files /dev/null and b/modules/modelops/images/execute-airflow-workflows-with-clearscape-analytics-modelops-model-factory-solution/Workflow.png differ diff --git a/modules/modelops/images/execute-airflow-workflows-with-clearscape-analytics-modelops-model-factory-solution/modelOps1.png b/modules/modelops/images/execute-airflow-workflows-with-clearscape-analytics-modelops-model-factory-solution/modelOps1.png new file mode 100644 index 000000000..7d8fb964c Binary files /dev/null and b/modules/modelops/images/execute-airflow-workflows-with-clearscape-analytics-modelops-model-factory-solution/modelOps1.png differ diff --git a/modules/modelops/images/execute-airflow-workflows-with-clearscape-analytics-modelops-model-factory-solution/successTasks.png b/modules/modelops/images/execute-airflow-workflows-with-clearscape-analytics-modelops-model-factory-solution/successTasks.png new file mode 100644 index 000000000..c2e0c8c6f Binary files /dev/null and b/modules/modelops/images/execute-airflow-workflows-with-clearscape-analytics-modelops-model-factory-solution/successTasks.png differ diff --git a/modules/modelops/pages/deploy-and-monitor-machine-learning-models-with-teradata-modelops-and-byom.adoc b/modules/modelops/pages/deploy-and-monitor-machine-learning-models-with-teradata-modelops-and-byom.adoc index cb799cf97..6b6d0edcf 100644 --- a/modules/modelops/pages/deploy-and-monitor-machine-learning-models-with-teradata-modelops-and-byom.adoc +++ b/modules/modelops/pages/deploy-and-monitor-machine-learning-models-with-teradata-modelops-and-byom.adoc @@ -115,6 +115,7 @@ Use your connection and select a database. e.g "aoa_byom_models" In this quick start we have learned how to follow a full lifecycle of BYOM models into ModelOps and how to deploy it into Vantage. Then how we can schedule a batch scoring or test restful or on-demand scorings and start monitoring on Data Drift and Model Quality metrics. == Further reading -* link:https://docs.teradata.com/search/documents?query=ModelOps&sort=last_update&virtual-field=title_only&content-lang= +* https://docs.teradata.com/search/documents?query=ModelOps&sort=last_update&virtual-field=title_only&content-lang=[+++ModelOps documentation+++]. include::ROOT:partial$community_link.adoc[] + diff --git a/modules/modelops/pages/deploy-and-monitor-machine-learning-models-with-teradata-modelops-and-git.adoc b/modules/modelops/pages/deploy-and-monitor-machine-learning-models-with-teradata-modelops-and-git.adoc index 999c6dd1d..32b35fe2d 100644 --- a/modules/modelops/pages/deploy-and-monitor-machine-learning-models-with-teradata-modelops-and-git.adoc +++ b/modules/modelops/pages/deploy-and-monitor-machine-learning-models-with-teradata-modelops-and-git.adoc @@ -196,6 +196,6 @@ Use your connection and select a database. e.g "aoa_byom_models" In this quick start we have learned how to follow a full lifecycle of GIT models into ModelOps and how to deploy it into Vantage or into Docker containers for Edge deployments. Then how we can schedule a batch scoring or test restful or on-demand scorings and start monitoring on Data Drift and Model Quality metrics. == Further reading -* link:https://docs.teradata.com/search/documents?query=ModelOps&sort=last_update&virtual-field=title_only&content-lang= +* https://docs.teradata.com/search/documents?query=ModelOps&sort=last_update&virtual-field=title_only&content-lang=[+++ModelOps documentation+++]. -include::ROOT:partial$community_link.adoc[] +include::ROOT:partial$community_link.adoc[] \ No newline at end of file diff --git a/modules/modelops/pages/execute-airflow-workflows-with-clearscape-analytics-modelops-model-factory-solution.adoc b/modules/modelops/pages/execute-airflow-workflows-with-clearscape-analytics-modelops-model-factory-solution.adoc new file mode 100644 index 000000000..f009b2c9a --- /dev/null +++ b/modules/modelops/pages/execute-airflow-workflows-with-clearscape-analytics-modelops-model-factory-solution.adoc @@ -0,0 +1,624 @@ += Execute Airflow workflows with ModelOps - Model Factory Solution Accelerator +:experimental: +:page-author: Tayyaba Batool +:page-email: tayyaba.batool@teradata.com +:page-revdate: Mar 19th, 2024 +:description: Tutorial for Model Factory Solution - Executing Airflow workflows with ClearScape Analytics ModelOps +:keywords: modelfactory, modelops, byom, python, clearscape analytics, teradata, data warehouses, teradata, vantage, cloud data platform, machine learning, artificial intelligence, business intelligence, enterprise analytics +:dir: execute-airflow-workflows-with-clearscape-analytics-modelops-model-factory-solution + + +== Overview + +The purpose of the *Model Factory Solution Accelerator* of *ClearScape Analytics* is to streamline and accelerate the end-to-end process of developing, deploying, and managing machine learning models within an organization at *Horizontal Scale* by operationalizing *hundreds of models for a business domain at one effort*. It leverages the scalability of in-database analytics and the openness of supporting partner model formats such as H2O or Dataiku. This unique combination enhances efficiency, scalability, and consistency across various stages of the machine learning lifecycle in Enterprise environments. + +By incorporating best practices, automation, and standardized workflows, the Model Factory Solution Accelerator enables teams to rapidly select the data to be used, configure the model required, ensure reproducibility, and deploy *unlimited* number of models seamlessly into production. Ultimately, it aims to reduce the time-to-value for machine learning initiatives and promote a more structured and efficient approach to building and deploying models at scale. Here is the diagram of an automated Workflow: + +image::{dir}/Workflow.png[Workflow, width=75%] + +Here are the steps to implement Model Factory Solution Accelerator using Airflow and ClearScape Analytics ModelOps. Apache Airflow is used for the scheduling and orchestration of data pipelines or workflows. So in this tutorial we are creating an Airflow DAG (Directed Acyclic Graph) which will be executed to automate the lifecycle of ModelOps. + +== Prerequisites + +* In this tutorial it is implemented on local machine using **Visual Studio code** IDE. + +In order to execute shell commands, you can install the VS code extension **"Remote Development"** using the followng link. This extension pack includes the WSL extension, in addition to the Remote - SSH, and Dev Containers extensions, enabling you to open any folder in a container, on a remote machine, or in WSL: +https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.vscode-remote-extensionpack[+++VS code marketplace+++]. + +* Access to a Teradata Vantage instance with ClearScape Analytics (includes ModelOps) + +include::ROOT:partial$vantage_clearscape_analytics.adoc[] + + +== Configuring Visual Studio Code and Installing Airflow on docker-compose + +* Open Visual Studio code and select the option of open a remote window. Then select Connect to WSL-Ubuntu + +* Select File > Open Folder. Then select the desired folder or create a new one using this command: mkdir [folder_name] + +* Set the AIRFLOW_HOME environment variable. Airflow requires a home directory and uses ~/airflow by default, but you can set a different location if you prefer. The AIRFLOW_HOME environment variable is used to inform Airflow of the desired location. + +[source, bash, id="set Airflow Home directory", role="content-editable emits-gtm-events"] +---- +AIRFLOW_HOME=./[folder_name] +---- + +* Install apache-airflow stable version 2.8.2 from PyPI repository.: + +[source, bash, id="Install Airflow", role="content-editable emits-gtm-events"] +---- + AIRFLOW_VERSION=2.8.2 + + PYTHON_VERSION="$(python3 --version | cut -d " " -f 2 | cut -d "." -f 1-2)" + + CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt" + + pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}" --default-timeout=100 +---- + +* Install the Airflow Teradata provider stable version from PyPI repository. + +[source, bash, id="Install Airflow Teradata", role="content-editable emits-gtm-events"] +---- +pip install "apache-airflow-providers-teradata" --default-timeout=100 +---- + +* Install Docker Desktop so that you can use docker container for running airflow. Ensure that the docker desktop is running. + +* Check docker version using this command: + +[source, bash, id="Check Docker version", role="content-editable emits-gtm-events"] +---- +docker --version +---- + +Check the version of docker compose. Docker Compose is a tool for defining and running multi-container applications + +[source, bash, id="Check Docker compose version", role="content-editable emits-gtm-events"] +---- +docker-compose --version +---- + +To deploy Airflow on Docker Compose, you need to fetch docker-compose.yaml using this curl command. + +[source, bash, id="Fetch docker-compose yaml", role="content-editable emits-gtm-events"] +---- + curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.8.2/docker-compose.yaml' +---- + +Create these folders to use later using following command: + +[source, bash, id="Create Airflow folders", role="content-editable emits-gtm-events"] +---- +mkdir -p ./dags ./logs ./plugins ./config +---- + + +== Configuring Model Factory Solution Accelerator + +Create a config file inside config folder and set the parameters to corresponding values depending on which model you want to train. + +.Click to reveal the Python code +[%collapsible] +==== +[source, python, id="Model Factory Solution Config File", role="content-editable emits-gtm-events"] +---- +from configparser import ConfigParser +import os + +config = ConfigParser() + +config['MAIN'] = { + "projectId": "23e1df4b-b630-47a1-ab80-7ad5385fcd8d", + "bearerToken": os.environ['BEARER_TOKEN'], + "trainDatasetId": "ba39e766-2fdf-426f-ba5c-4ca3e90955fc", + "evaluateDatasetId": "74489d62-2af5-4402-b264-715e151a420a", + "datasetConnectionId" : "151abf05-1914-4d38-a90d-272d850f212c", + "datasetTemplateId": "d8a35d98-21ce-47d0-b9f2-00d355777de1" +} + +config['HYPERPARAMETERS'] = { + "eta": 0.2, + "max_depth": 6 +} + +config['RESOURCES'] = { + "memory": "500m", + "cpu": "0.5" +} + +config['MODEL'] = { + "modelId": "f937b5d8-02c6-5150-80c7-1e4ff07fea31", + "approvalComments": "Approving this model!", + "cron": "@once", + "engineType": "DOCKER_BATCH", + "engine": "python-batch", + "dockerImage": "artifacts.td.teradata.com/tdproduct-docker-snapshot/avmo/aoa-python-base:3.9.13-1" +} + + +with open('./config/modelOpsConfig.ini', 'w') as f: + config.write(f) +---- +==== + +Now copy the Bearer token from the ModelOps user interface (Left Menu -> Your Account -> Session Details) and set it here as an environment varibale using the following command: + +[source, bash, id="Bearer token", role="content-editable emits-gtm-events"] +---- +export BEARER_TOKEN='your_token_here' +---- + +Now you can execute the previously created config file, which will create a new ini file inside config folder containing all the required parameters which will be used in the DAG creation step. + +[source, python, id="Create config ini", role="content-editable emits-gtm-events"] +---- +python3 createConfig.py +---- + +== Create a Airflow DAG containing full ModelOps Lifecycle + +Now you can create a DAG using the following python code. Add this python code file inside dags folder. This DAG contains 5 tasks of ModelOps lifecycle (i.e., Train, Evaluate, Approve, Deploy and Retire) + +.Click to reveal the Python code +[%collapsible] +==== +[source, python, id="DAG Code", role="content-editable emits-gtm-events"] +---- +import base64 +from datetime import datetime, timedelta, date +import json +import os +import time + +from airflow import DAG +from airflow.operators.python import PythonOperator + +import requests + +from configparser import ConfigParser + +# Read from Config file +config = ConfigParser() +config.read('config/modelOpsConfig.ini') + +config_main = config["MAIN"] +config_hyper_params = config["HYPERPARAMETERS"] +config_resources = config["RESOURCES"] +config_model = config["MODEL"] + +# Default args for DAG +default_args = { + 'owner': 'Tayyaba', + 'retries': 5, + 'retry_delay': timedelta(minutes=2) +} + +def get_job_status(job_id): + + # Use the fetched Job ID to check Job Status + headers_for_status = { + 'AOA-PROJECT-ID': config_main['projectid'], + 'Authorization': 'Bearer ' + config_main['bearertoken'], + } + + status_response = requests.get('https://airflow-u9usja4twtauvt3s.env.clearscape.teradata.com:8443/modelops/core/api/jobs/' + job_id + '?projection=expandJob', headers=headers_for_status) + status_json = status_response.json() + job_status = status_json.get('status') + return job_status + + +def train_model(ti): + + headers = { + 'AOA-Project-ID': config_main['projectid'], + 'Accept': 'application/json, text/plain, */*', + 'Accept-Language': 'en-US,en;q=0.9', + 'Authorization': 'Bearer ' + config_main['bearertoken'], + 'Content-Type': 'application/json', + } + + json_data = { + 'datasetId': config_main['trainDatasetId'], + 'datasetConnectionId': config_main['datasetConnectionId'], + 'modelConfigurationOverrides': { + 'hyperParameters': { + 'eta': config_hyper_params['eta'], + 'max_depth': config_hyper_params['max_depth'], + }, + }, + 'automationOverrides': { + 'resources': { + 'memory': config_resources['memory'], + 'cpu': config_resources['cpu'], + }, + 'dockerImage': config_model['dockerImage'], + }, + } + + + response = requests.post('https://airflow-u9usja4twtauvt3s.env.clearscape.teradata.com:8443/modelops/core/api/models/' + config_model['modelid'] + '/train', headers=headers, json=json_data) + + json_data = response.json() + + # Get the Training Job ID + job_id = json_data.get('id') + ti.xcom_push(key='train_job_id', value=job_id) + + job_status = get_job_status(job_id) + print("Started - Training Job - Status: ", job_status) + + while job_status != "COMPLETED": + if job_status=="ERROR": + print("The training job is terminated due to an Error") + ti.xcom_push(key='trained_model_id', value='NONE') # Setting the Trained Model Id to None here and check in next step (Evaluate) + break + elif job_status=="CANCELLED": + ti.xcom_push(key='trained_model_id', value='NONE') + print("The training job is Cancelled !!") + break + print("Job is not completed yet. Current status", job_status) + time.sleep(5) #wait 5s + job_status = get_job_status(job_id) + + # Checking Job status at the end to push the correct trained_model_id + if(job_status == "COMPLETED"): + train_model_id = json_data['metadata']['trainedModel']['id'] + ti.xcom_push(key='trained_model_id', value=train_model_id) + print('Model Trained Successfully! Job ID is : ', job_id, 'Trained Model Id : ', train_model_id, ' Status : ', job_status) + else: + ti.xcom_push(key='trained_model_id', value='NONE') + print("Training Job is terminated !!") + + +def evaluate_model(ti): + + trained_model_id = ti.xcom_pull(task_ids = 'task_train_model', key = 'trained_model_id') + + headers = { + 'AOA-Project-ID': config_main['projectid'], + 'Accept': 'application/json, text/plain, */*', + 'Accept-Language': 'en-US,en;q=0.9', + 'Authorization': 'Bearer ' + config_main['bearertoken'], + 'Content-Type': 'application/json', + } + + json_data = { + 'datasetId': config_main['evaluatedatasetid'], + 'datasetConnectionId': config_main['datasetConnectionId'], + 'modelConfigurationOverrides': { + 'hyperParameters': { + 'eta': config_hyper_params['eta'], + 'max_depth': config_hyper_params['max_depth'], + }, + }, + 'automationOverrides': { + 'resources': { + 'memory': config_resources['memory'], + 'cpu': config_resources['cpu'], + }, + 'dockerImage': config_model['dockerImage'], + }, + } + + if trained_model_id == 'NONE': + ti.xcom_push(key='evaluated_model_status', value='FALIED') + print("Evaluation cannot be done as the Training Job was terminated !!") + else: + response = requests.post('https://airflow-u9usja4twtauvt3s.env.clearscape.teradata.com:8443/modelops/core/api/trainedModels/' + trained_model_id + '/evaluate', headers=headers, json=json_data) + json_data = response.json() + + # Get the Evaluation Job ID + eval_job_id = json_data.get('id') + ti.xcom_push(key='evaluate_job_id', value=eval_job_id) + + job_status = get_job_status(eval_job_id) + print("Started - Job - Status: ", job_status) + + while job_status != "COMPLETED": + if job_status=="ERROR": + print("The evaluation job is terminated due to an Error") + # Set the Trained Model Id to None here and check in next step (Evaluate) + break + elif job_status=="CANCELLED": + print("The evaluation job is Cancelled !!") + break + print("Job is not completed yet. Current status", job_status) + time.sleep(5) # wait 5s + job_status = get_job_status(eval_job_id) + + # Checking Job status at the end to push the correct evaluate_job_id + if(job_status == "COMPLETED"): + ti.xcom_push(key='evaluated_model_status', value='EVALUATED') + print('Model Evaluated Successfully! Job ID is : ', eval_job_id, ' Status : ', job_status) + else: + ti.xcom_push(key='evaluated_model_status', value='FAILED') + print("Evaluation Job is terminated !!") + + +def approve_model(ti): + + evaluated_model_status = ti.xcom_pull(task_ids = 'task_evaluate_model', key = 'evaluated_model_status') + + if evaluated_model_status == 'FAILED': + ti.xcom_push(key='approve_model_status', value='FALIED') + print("Approval cannot be done as the Evaluation was failed !!") + else: + trained_model_id = ti.xcom_pull(task_ids = 'task_train_model', key = 'trained_model_id') + + headers = { + 'AOA-Project-ID': config_main['projectid'], + 'Accept': 'application/json, text/plain, */*', + 'Accept-Language': 'en-US,en;q=0.9', + 'Authorization': 'Bearer ' + config_main['bearertoken'], + 'Content-Type': 'application/json', + } + + json_data = { + "comments": (base64.b64encode(config_model['approvalComments'].encode()).decode()) + } + + response = requests.post('https://airflow-u9usja4twtauvt3s.env.clearscape.teradata.com:8443/modelops/core/api/trainedModels/' + trained_model_id + '/approve' , headers=headers, json=json_data) + response_json = response.json() + approval_status = response_json['status'] + if(approval_status == 'APPROVED'): + ti.xcom_push(key='approve_model_status', value='EVALUATED') + print('Model Approved Successfully! Status: ', approval_status) + else: + ti.xcom_push(key='approve_model_status', value='FAILED') + print('Model not approved! Status: ', approval_status) + + +def deploy_model(ti): + + approve_model_status = ti.xcom_pull(task_ids = 'task_approve_model', key = 'approve_model_status') + + headers = { + 'AOA-Project-ID': config_main['projectid'], + 'Accept': 'application/json, text/plain, */*', + 'Accept-Language': 'en-US,en;q=0.9', + 'Authorization': 'Bearer ' + config_main['bearertoken'], + 'Content-Type': 'application/json', + } + + + json_data = { + 'engineType': config_model['engineType'], + 'engineTypeConfig': { + 'dockerImage': config_model['dockerImage'], + 'engine': "python-batch", + 'resources': { + 'memory': config_resources['memory'], + 'cpu': config_resources['cpu'], + } + }, + 'language':"python", + 'datasetConnectionId': config_main['datasetConnectionId'], + 'datasetTemplateId': config_main['datasetTemplateId'], + 'cron': config_model['cron'], + 'publishOnly': "false", + 'args':{} + } + + if approve_model_status == 'FAILED': + ti.xcom_push(key='deploy_model_status', value='FALIED') + print("Deployment cannot be done as the model is not approved !!") + else: + trained_model_id = ti.xcom_pull(task_ids = 'task_train_model', key = 'trained_model_id') + + response = requests.post('https://airflow-u9usja4twtauvt3s.env.clearscape.teradata.com:8443/modelops/core/api/trainedModels/' + trained_model_id + '/deploy', headers=headers, json=json_data) + json_data = response.json() + + # Get the Deployment Job ID + deploy_job_id = json_data.get('id') + ti.xcom_push(key='deploy_job_id', value=deploy_job_id) + + # deployed_model_id = json_data['metadata']['deployedModel']['id'] + + job_status = get_job_status(deploy_job_id) + print("Started - Deployment Job - Status: ", job_status) + + while job_status != "COMPLETED": + if job_status=="ERROR": + ti.xcom_push(key='deploy_model_status', value='FAILED') + print("The deployment job is terminated due to an Error") + break + elif job_status=="CANCELLED": + ti.xcom_push(key='deploy_model_status', value='FAILED') + print("The deployment job is Cancelled !!") + break + print("Job is not completed yet. Current status", job_status) + time.sleep(5) # wait 5s + job_status = get_job_status(deploy_job_id) + + # Checking Job status at the end to push the correct deploy_model_status + if(job_status == "COMPLETED"): + ti.xcom_push(key='deploy_model_status', value='DEPLOYED') + print('Model Deployed Successfully! Job ID is : ', deploy_job_id, ' Status : ', job_status) + else: + ti.xcom_push(key='deploy_model_status', value='FAILED') + print("Deployment Job is terminated !!") + + + +def retire_model(ti): + + deployed_model_status = ti.xcom_pull(task_ids = 'task_deploy_model', key = 'deploy_model_status') + + if deployed_model_status == 'FAILED': + ti.xcom_push(key='retire_model_status', value='FALIED') + print("Retirement cannot be done as the model is not deployed !!") + else: + trained_model_id = ti.xcom_pull(task_ids = 'task_train_model', key = 'trained_model_id') + + headers = { + 'AOA-Project-ID': config_main['projectid'], + 'Accept': 'application/json, text/plain, */*', + 'Accept-Language': 'en-US,en;q=0.9', + 'Authorization': 'Bearer ' + config_main['bearertoken'], + 'Content-Type': 'application/json', + } + + # Identifying the deployment ID + get_deployment_id_response = requests.get('https://airflow-u9usja4twtauvt3s.env.clearscape.teradata.com:8443/modelops/core/api/deployments/search/findByStatusAndTrainedModelId?projection=expandDeployment&status=DEPLOYED&trainedModelId=' + trained_model_id , headers=headers) + + get_deployment_id_json = get_deployment_id_response.json() + deployment_id = get_deployment_id_json['_embedded']['deployments'][0]['id'] + + json_data = { + "deploymentId": deployment_id + } + + # Retire the specific deployment + retire_model_response = requests.post('https://airflow-u9usja4twtauvt3s.env.clearscape.teradata.com:8443/modelops/core/api/trainedModels/' + trained_model_id + '/retire', headers=headers, json=json_data) + retire_model_response_json = retire_model_response.json() + + # Get the Evaluation Job ID + retire_job_id = retire_model_response_json.get('id') + ti.xcom_push(key='retire_job_id', value=retire_job_id) + + job_status = get_job_status(retire_job_id) + print("Started - Job - Status: ", job_status) + + while job_status != "COMPLETED": + if job_status=="ERROR": + print("The Retire job is terminated due to an Error") + # Set the Trained Model Id to None here and check in next step (Evaluate) + break + elif job_status=="CANCELLED": + print("The Retire job is Cancelled !!") + break + print("Job is not completed yet. Current status", job_status) + time.sleep(5) # wait 5s + job_status = get_job_status(retire_job_id) + + # Checking Job status at the end to push the correct evaluate_job_id + if(job_status == "COMPLETED"): + ti.xcom_push(key='retire_model_status', value='RETIRED') + print('Model Retired Successfully! Job ID is : ', retire_job_id, ' Status : ', job_status) + else: + ti.xcom_push(key='retire_model_status', value='FAILED') + print("Retire Job is terminated !!") + + + +with DAG( + dag_id = 'ModelOps_Accelerator_v1', + default_args=default_args, + description = 'ModelOps lifecycle accelerator for Python Diabetes Prediction model', + start_date=datetime.now(), # Set the start_date as per requirement + schedule_interval='@daily' +) as dag: + task1 = PythonOperator( + task_id='task_train_model', + python_callable=train_model + ) + task2 = PythonOperator( + task_id='task_evaluate_model', + python_callable=evaluate_model + ) + task3 = PythonOperator( + task_id='task_approve_model', + python_callable=approve_model + ) + task4 = PythonOperator( + task_id='task_deploy_model', + python_callable=deploy_model + ) + task5 = PythonOperator( + task_id='task_retire_model', + python_callable=retire_model + ) + + +task1.set_downstream(task2) +task2.set_downstream(task3) +task3.set_downstream(task4) +task4.set_downstream(task5) +---- +==== + +== Initialize Airflow in Docker Compose + +While initializing Airflow services like the internal Airflow database, for operating systems other than Linux, you may get a warning that AIRFLOW_UID is not set, but you can safely ignore it. by setting its environment variable using the following command. + +[source, bash, id="UID Airflow variable", role="content-editable emits-gtm-events"] +---- +echo -e "AIRFLOW_UID=5000" > .env +---- + +To run internal database migrations and create the first user account, initialize the database using this command: + +[source, bash, id="", role="content-editable emits-gtm-events"] +---- +docker compose up airflow-init +---- + +After initialization is complete, you should see a message like this: + +[source, bash, id="Check Airflow init", role="content-editable emits-gtm-events"] +---- + airflow-init_1 | Upgrades done + airflow-init_1 | Admin user airflow created + airflow-init_1 | 2.8.2 + start_airflow-init_1 exited with code 0 +---- + +== Clean up Airflow demo environment¶ + +You can clean up the environment which will remove the preloaded example DAGs using this command: + +[source, bash, id="Docker compose down", role="content-editable emits-gtm-events"] +---- +docker-compose down -v +---- + +Then update this parameter in docker-compose.yaml file as given below: + +[source, bash, id="Docker compose yaml", role="content-editable emits-gtm-events"] +---- +AIRFLOW__CORE__LOAD_EXAMPLES: 'false' +---- + +== Launch Airflow with Model Factory Solution Accelerator + +Launch Airflow using this command: + +[source, bash, id="Docker compose up", role="content-editable emits-gtm-events"] +---- +docker-compose up -d +---- + + +== Run Airflow DAG of Model Factory Solution with ModelOps + +* Now you can access Airflow UI uisng the following link: http://localhost:8080/ + +image::{dir}/LoginPage.png[Airflow login, width=75%] + +* Login with Usename: airflow and Password: airflow. In the DAGs menu you will be able to see your created DAGs. + +image::{dir}/DAGs.png[DAGs, width=75%] + +* Select your latest created DAG and the graph will look like this: + +image::{dir}/DAG_graph.png[DAGs, width=75%] + +* Now you can trigger the DAG using the play icon on the top right side. + +* You can check the logs by selecting any task and then click on the logs menu: + +* On the ClearScape Analytics ModelOps - Jobs section you can see that the jobs have started running: + +image::{dir}/modelOps1.png[DAGs, width=75%] + +* Now you can see that all the tasks are successfully executed. + +image::{dir}/successTasks.png[DAGs, width=75%] + +== Summary + +This tutorial aimed at providing a hands on exercise on how to install an Airflow environment on a Linux server and how to use Airflow to interact with ClearScape Analytics ModelOps and Teradata Vantage database. An additional example is provided on how to integrate Airflow and the data modelling and maintenance tool dbt to create and load a Teradata Vantage database. + +== Further reading +* https://docs.teradata.com/search/documents?query=ModelOps&sort=last_update&virtual-field=title_only&content-lang=[+++ModelOps documentation+++]. diff --git a/modules/other-integrations/images/getting.started.dbt-feast-teradata-pipeline/dbt-feast.png b/modules/other-integrations/images/getting.started.dbt-feast-teradata-pipeline/dbt-feast.png new file mode 100644 index 000000000..dd7d8a8d6 Binary files /dev/null and b/modules/other-integrations/images/getting.started.dbt-feast-teradata-pipeline/dbt-feast.png differ diff --git a/modules/other-integrations/pages/getting.started.dbt-feast-teradata-pipeline.adoc b/modules/other-integrations/pages/getting.started.dbt-feast-teradata-pipeline.adoc index a929a0614..80d2664e7 100644 --- a/modules/other-integrations/pages/getting.started.dbt-feast-teradata-pipeline.adoc +++ b/modules/other-integrations/pages/getting.started.dbt-feast-teradata-pipeline.adoc @@ -5,6 +5,7 @@ :page-revdate: August 4th, 2023 :description: dbt Feast integration with Teradata :keywords: data warehouses, compute storage separation, teradata, vantage, cloud data platform, object storage, business intelligence, enterprise analytics, AI/ML, AI, ML, feature engineering, feature store, FEAST +:page-image-directory: getting.started.dbt-feast-teradata-pipeline :tabs: == Overview @@ -122,48 +123,7 @@ raw_accounts 1--* raw_transactions dbt takes this raw data and builds the following model, which is more suitable for ML modeling and analytics tools: -[erd, format=svg, width=100%] -.... -# Entities - -[`fact: Analytic_Dataset`] {bgcolor: "#f37843", color: "#ffffff", border: "0", border-color: "#ffffff"} - *`cust_id ` {bgcolor: "#f9d6cd", color: "#000000", label: "INTEGER", border: "1", border-color: "#ffffff"} - `income ` {bgcolor: "#fcece8", color: "#868686", label: "DECIMAL(15, 1)", border: "1", border-color: "#ffffff"} - `age ` {bgcolor: "#fcece8", color: "#868686", label: "INTEGER", border: "1", border-color: "#ffffff"} - `years_with_bank ` {bgcolor: "#fcece8", color: "#868686", label: "INTEGER", border: "1", border-color: "#ffffff"} - `nbr_children ` {bgcolor: "#fcece8", color: "#868686", label: "INTEGER", border: "1", border-color: "#ffffff"} - `marital_status_0 ` {bgcolor: "#fcece8", color: "#868686", label: "INTEGER", border: "1", border-color: "#ffffff"} - `marital_status_1 ` {bgcolor: "#fcece8", color: "#868686", label: "INTEGER", border: "1", border-color: "#ffffff"} - `marital_status_2 ` {bgcolor: "#fcece8", color: "#868686", label: "INTEGER", border: "1", border-color: "#ffffff"} - `marital_status_other ` {bgcolor: "#fcece8", color: "#868686", label: "INTEGER", border: "1", border-color: "#ffffff"} - `gender_0 ` {bgcolor: "#fcece8", color: "#868686", label: "INTEGER", border: "1", border-color: "#ffffff"} - `gender_1 ` {bgcolor: "#fcece8", color: "#868686", label: "INTEGER", border: "1", border-color: "#ffffff"} - `gender_other ` {bgcolor: "#fcece8", color: "#868686", label: "INTEGER", border: "1", border-color: "#ffffff"} - `state_code_0 ` {bgcolor: "#fcece8", color: "#868686", label: "INTEGER", border: "1", border-color: "#ffffff"} - `state_code_1 ` {bgcolor: "#fcece8", color: "#868686", label: "INTEGER", border: "1", border-color: "#ffffff"} - `state_code_2 ` {bgcolor: "#fcece8", color: "#868686", label: "INTEGER", border: "1", border-color: "#ffffff"} - `state_code_3 ` {bgcolor: "#fcece8", color: "#868686", label: "INTEGER", border: "1", border-color: "#ffffff"} - `state_code_4 ` {bgcolor: "#fcece8", color: "#868686", label: "INTEGER", border: "1", border-color: "#ffffff"} - `state_code_5 ` {bgcolor: "#fcece8", color: "#868686", label: "INTEGER", border: "1", border-color: "#ffffff"} - `state_code_other ` {bgcolor: "#fcece8", color: "#868686", label: "INTEGER", border: "1", border-color: "#ffffff"} - `acct_type_0 ` {bgcolor: "#fcece8", color: "#868686", label: "INTEGER", border: "1", border-color: "#ffffff"} - `acct_type_1 ` {bgcolor: "#fcece8", color: "#868686", label: "INTEGER", border: "1", border-color: "#ffffff"} - `acct_type_2 ` {bgcolor: "#fcece8", color: "#868686", label: "INTEGER", border: "1", border-color: "#ffffff"} - `acct_type_other ` {bgcolor: "#fcece8", color: "#868686", label: "INTEGER", border: "1", border-color: "#ffffff"} - `CK_avg_bal ` {bgcolor: "#fcece8", color: "#868686", label: "FLOAT", border: "1", border-color: "#ffffff"} - `CK_avg_tran_amt ` {bgcolor: "#fcece8", color: "#868686", label: "FLOAT", border: "1", border-color: "#ffffff"} - `CC_avg_bal ` {bgcolor: "#fcece8", color: "#868686", label: "FLOAT", border: "1", border-color: "#ffffff"} - `CC_avg_tran_amt ` {bgcolor: "#fcece8", color: "#868686", label: "FLOAT", border: "1", border-color: "#ffffff"} - `SV_avg_bal ` {bgcolor: "#fcece8", color: "#868686", label: "FLOAT", border: "1", border-color: "#ffffff"} - `SV_avg_tran_amt ` {bgcolor: "#fcece8", color: "#868686", label: "FLOAT", border: "1", border-color: "#ffffff"} - `q1_trans_cnt ` {bgcolor: "#fcece8", color: "#868686", label: "DECIMAL(15, 0)", border: "1", border-color: "#ffffff"} - `q2_trans_cnt ` {bgcolor: "#fcece8", color: "#868686", label: "DECIMAL(15, 0)", border: "1", border-color: "#ffffff"} - `q3_trans_cnt ` {bgcolor: "#fcece8", color: "#868686", label: "DECIMAL(15, 0)", border: "1", border-color: "#ffffff"} - `q4_trans_cnt ` {bgcolor: "#fcece8", color: "#868686", label: "DECIMAL(15, 0)", border: "1", border-color: "#ffffff"} - `event_timestamp ` {bgcolor: "#fcece8", color: "#868686", label: "TIMESTAMP(0)", border: "1", border-color: "#ffffff"} - `created ` {bgcolor: "#fcece8", color: "#868686", label: "TIMESTAMP(0)", border: "1", border-color: "#ffffff"} -.... - +image::{page-image-directory}//dbt-feast.png[dbt feast,align="center",width=50%] == Configure dbt Create file `$HOME/.dbt/profiles.yml` with the following content. Adjust ``, ``, `` to match your Teradata instance.