This script is designed to fetch and process job runs data from an Azure Databricks instance using the Databricks REST API. It extracts relevant information about job runs, processes the data, and provides an output in the form of a Pandas DataFrame and a CSV file.
Before running this script, ensure you have the following:
- Azure Databricks Instance: You need access to an Azure Databricks instance.
- API Token: Generate an API token from your Databricks instance with appropriate permissions to access job run data.
- Install the required libraries using the following command:
pip install requests pandas
- Replace the placeholders in the code with your actual values:
baseURI: Replace with your Azure Databricks instance URL.
apiToken: Replace with your API token.
The script starts by importing necessary libraries: requests, pandas, math, datetime, and json.
The script defines the function fetch_and_process_job_runs
responsible for fetching job run data using the Databricks API. The function takes three arguments:
base_uri:
The base URL of your Databricks instance.api_token:
Your API token for authentication.params:
A dictionary containing query parameters, including start_time_from, start_time_to, and expand_tasks.
Inside the function:
- An API request is made to the specified endpoint.
- The response is processed to extract job run details.
- Processed data is accumulated and transformed into a Pandas DataFrame.
- Pagination is managed using the
has_more
field in the response.
After fetching and processing the job run data:
- The resulting DataFrame is sorted based on the
execution_duration_in_mins
column in descending order. - The total execution time for all job runs is calculated and added as a row in the DataFrame.
- The processed DataFrame is saved as a CSV file named
jobs.csv
. - The sorted DataFrame is printed to the console.
Make sure you have fulfilled the prerequisites and replaced the placeholder values in the code.
Run the script. It will fetch and process job runs data, display the sorted results, save them to a CSV file, and print a Markdown table.
Note: This script provides a basic example of how to fetch and process job runs data from Azure Databricks using the Databricks REST API. You can further enhance and customize the script to suit your specific use case and requirements.
##output
Total Jobs: 160 Total Tasks: 214 Successful Tasks: 174 Failed Tasks: 15 Total Execution Time (mins): 1158 Average Execution Time (mins): 10.82 Min Execution Time (mins): 0 Max Execution Time (mins): 1158
Key Insights:
- Task Status Distribution:
{
"SUCCESS": 174,
"CANCELED": 24,
"FAILED": 15
}
-
Execution Duration Distribution:
- Min: 0 mins
- Max: 1158 mins
- Average: 10.82 mins
-
Jobs with Longest Execution Time:
job_id execution_duration_in_mins 260792223809789 140 74519312719017 93 371241484431340 88 655421446142082 85 887636488212750 65