This Python program utilizes the Striim API to orchestrate the creation, deployment, starting, reviewing status, undeploying, and dropping of Striim Applications. It helps parallelize and automate the load by splitting data into pieces, such as utilizing reading huge Oracle tables, computing read ranges for parallel reading, or splitting based on primary key values or date ranges. The resulting output is a set of queries.
This app currently utilizes either a BQ table or local TinyDB (current_position.json file) as both a historical record and a place to orchestrate progress.
- Automation: Handles the entire lifecycle of Striim applications, including creation, deployment, starting, monitoring, stopping, undeploying, and dropping.
- Parallelization: Divides the data load into smaller units to run multiple Striim applications concurrently, optimizing throughput.
- Query-Based Processing: Reads queries from a file (
queryfile.txt
) where each line contains a query and its target table. - TQL-Driven Deployment: Uses a TQL template file (
admin.SW.tql
) to generate and deploy Striim applications for each query. - State Management: Utilizes a database (BigQuery or TinyDB) to track the progress and status of each query execution.
- Monitoring and Logging: Continuously monitors the status of applications and logs events for debugging and analysis.
- Python 3.6 or higher: The program is written in Python and requires a compatible version.
- Striim Environment: Access to a Striim cluster with API connectivity.
- Required Python Libraries:
requests
,google-cloud-bigquery
(if using BigQuery),tinydb
. - Configuration File: A
config.py
file to store Striim credentials, database settings, and other parameters. - Input Files:
queryfile.txt
: Contains the queries to be executed, one per line, with the target table separated by a delimiter (configurable inconfig.py
).admin.SW.tql
: A TQL file that serves as a template for creating Striim applications. This file should include placeholders for the query and the target table.
The program's behavior can be customized through the config.py
file. Here are some of the key configuration options:
- Striim Connection Details:
STRIIM_URL_PREFIX
,STRIIM_NODE
,STRIIM_ADMIN_USER
,STRIIM_ADMIN_PWD
,STRIIM_API_TOKEN
. - Database Selection:
STAGE_DB_LOCATION
(choose betweenBQ
for BigQuery orTinyDB
for a local file-based database). - BigQuery Settings:
BQ_KEYFILE_LOCATION
,PROJECT_ID
,DATASET_ID
,TABLE_ID
(if using BigQuery). - Concurrency Control:
CONCURRENT_APPS_MAX
(maximum number of concurrent Striim applications). - Monitoring Interval:
APP_MONITOR_INTERVAL_SECONDS
(how often to check the status of applications). - Deployment Delay:
DEPLOY_WAIT_TIME_SECONDS
(minimum time to wait between deploying new applications). - Logging:
LOG_OUTPUT_NAME
,LOG_OUTPUT_PATH
.
This file is simply a two-column file, containing the query to run and the target table name. The delimeter is definied in config.py and is pipe (|) by default. Here is an example:
SELECT * FROM QATEST.WF_PENDING_ACTIVITY WHERE ID < 1000 |QATEST2.WF_PENDING_ACTIVITY
SELECT * FROM QATEST.WF_PENDING_ACTIVITY WHERE ID BETWEEN 1000 AND 2000|QATEST2.WF_PENDING_ACTIVITY
SELECT * FROM QATEST.WF_PENDING_ACTIVITY WHERE ID BETWEEN 2001 AND 3000|QATEST2.WF_PENDING_ACTIVITY
SELECT * FROM QATEST.WF_PENDING_ACTIVITY WHERE ID > 3000 |QATEST2.WF_PENDING_ACTIVITY
SELECT * FROM QATEST.BIGTABLE WHERE ID < 1000 |QATEST2.BIGTABLE
SELECT * FROM QATEST.BIGTABLE WHERE ID BETWEEN 1000 AND 2000|QATEST2.BIGTABLE
SELECT * FROM QATEST.BIGTABLE WHERE ID BETWEEN 2001 AND 3000|QATEST2.BIGTABLE
SELECT * FROM QATEST.BIGTABLE WHERE ID > 3000 |QATEST2.BIGTABLE
The TQL template file should utilize Property Variables (for connection string, username, and password), and the following placeholder variables:
Query: "~QUERYTEXT~"
: This portion must exist in your TQL Sample App, on your Source Reader.Query: "Tables: 'QUERY,~TARGETTABLE~'"
: This portion must exist in your TQL Sample App, on your Target Writer.
- Install Dependencies: Ensure that all required Python libraries are installed.
- Configure: Update the
config.py
file with your Striim credentials, database settings, and other desired parameters. - Prepare Input Files: Create the
queryfile.txt
andadmin.SW.tql
files as described in the Requirements section. - Execute: Run the
main.py
script to start the orchestration process.
The program will then read the queries from queryfile.txt
, generate TQL files for each query using the admin.SW.tql
template, create and deploy Striim applications, monitor their execution, and finally undeploy and drop them.
- The program includes basic error handling and retries to ensure robust operation.
- The database (BigQuery or TinyDB) is used to persist the state of the orchestration process, allowing for resuming from interruptions.
- The program provides logging capabilities for monitoring and debugging purposes.
This README provides a comprehensive overview of the Striim API Orchestration Python program, its features, requirements, and usage instructions.
The output is stored in the following BigQuery table:
CREATE TABLE `striimfieldproject.Daniel.striim_orchestration` (
id INTEGER NOT NULL,
roworder INTEGER,
uniquerunid INTEGER,
query STRING,
appname STRING,
targettbl STRING,
status STRING,
namespace STRING,
started_datetime TIMESTAMP,
finished_datetime TIMESTAMP,
notes STRING,
iscurrentrow BOOL
);
- id: INTEGER, NOT NULL. Unique identifier for each record.
- roworder: INTEGER. Order of the row in the sequence.
- uniquerunid: INTEGER. Unique identifier for each run.
- query: STRING. The actual query text.
- appname: STRING. Keeps track of the full app name created.
- targettbl: STRING. The target table (full schema.tablename) that the query results will write to. Include ColumnMap or KeyColumns if needed.
- status: STRING. Keeps track of the status.
- namespace: STRING. Keeps track of the Namespace used in deployment.
- started_datetime: TIMESTAMP. Tracks when the app was confirmed started (may not be exact).
- finished_datetime: TIMESTAMP. Tracks when the app was confirmed completed (may not be exact).
- notes: STRING. Any additional notes related to the output.
- iscurrentrow: BOOL. Indicates if this is the current row.
<blank>
→ Not yet started to process yetRUNNING
→ Has been created, deployed, and startedCOMPLETED
→ Has been detected as completed successfully.FAILED
→ Has been detected as failed, and added any failure messages to notes.