|
| 1 | +# Installation and deployment instructions (using Postgres as example) |
| 2 | + |
| 3 | +Below are the instructions for connecting a Postgress server. The installation steps should be the same for connecting all kinds of servers. Different servers would require different configurations in the .yaml or DAG files. See https://docs.open-metadata.org/integrations/connectors for your configuration. |
| 4 | + |
| 5 | +# Goal: To run Postgres metadata ingestion and quality tests with OpenMetadata using Airflow scheduler |
| 6 | + |
| 7 | +Note: This procedure does not support Windows, because Windows does not implement "signal.SIGALRM". **It is highly recommended to use WSL 2 if you are on Windows**. |
| 8 | + |
| 9 | +## Requirements: |
| 10 | +See https://docs.open-metadata.org/overview/run-openmetadata-with-prefect "Requirements" section |
| 11 | + |
| 12 | +## Installation: |
| 13 | +1. Clone this git hub repo: |
| 14 | +`git clone https://github.com/open-metadata/OpenMetadata.git` |
| 15 | + |
| 16 | +2. Cd to ~/.../openmetadata/docker/metadata |
| 17 | + |
| 18 | +3. Start the OpenMetadata containers. This will allow you run OpenMetadata in Docker: |
| 19 | +`docker compose up -d` |
| 20 | +- To check the status of services, run `docker compose ps` |
| 21 | +- To access the UI: http://localhost:8585 |
| 22 | + |
| 23 | +4. Install the OpenMetadata ingestion package. |
| 24 | +- (optional but highly recommended): Before installing this package, it is recommended to create and activate a virtual environment. To do this, run: |
| 25 | +`python -m venv env` and `source env/bin/activate` |
| 26 | + |
| 27 | +- To install the OpenMetadata ingestion package: |
| 28 | +`pip install --upgrade "openmetadata-ingestion[docker]==0.10.3"` (specify the release version to ensure compatibility) |
| 29 | + |
| 30 | +5. Install Airflow: |
| 31 | +- 5A: Install Airflow Lineage Backend: `pip3 install "openmetadata-ingestion[airflow-container]"==0.10.3` |
| 32 | +- 5B: Install Airflow postgres connector module: `pip3 install "openmetadata-ingestion[postgres]"==0.10.3` |
| 33 | +- 5C: Install Airflow APIs: `pip3 install "openmetadata-airflow-managed-apis"==0.10.3` |
| 34 | +- 5D: Install necessary Airflow plugins: |
| 35 | + - 1) Download the latest openmetadata-airflow-apis-plugins release from https://github.com/open-metadata/OpenMetadata/releases |
| 36 | + - 2) Untar it under your {AIRFLOW_HOME} directory (usually c/Users/Yourname/airflow). This will create and setup a plugins directory under {AIRFLOW_HOME} . |
| 37 | + - 3) `cp -r {AIRFLOW_HOME}/plugins/dag_templates {AIRFLOW_HOME}` |
| 38 | + - 4) `mkdir -p {AIRFLOW_HOME}/dag_generated_configs` |
| 39 | + - 5) (re)start the airflow webserver and scheduler |
| 40 | + |
| 41 | +6. Configure Airflow: |
| 42 | +- 6A: configure airflow.cfg in your AIRFLOW_HOME directory. Check and make all the folder directories point to the right places. For instance, dags_folder = YOUR_AIRFLOW_HOME/dags |
| 43 | +- 6B: configure openmetadata.yaml and update the airflowConfiguration section. See: https://docs.open-metadata.org/integrations/airflow/configure-airflow-in-the-openmetadata-server |
| 44 | + |
| 45 | +## To run a metadata ingestion workflow with Airflow ingestion DAGs on Postgres data: |
| 46 | + |
| 47 | +1. Prepare the Ingestion DAG: |
| 48 | +To see a more complete tutorial on ingestion DAG, see https://docs.open-metadata.org/integrations/connectors/postgres/run-postgres-connector-with-the-airflow-sdk |
| 49 | +To be brief, below is my own DAG. Copy & Paste the following into a python file (postgres_demo.py): |
| 50 | + |
| 51 | +``` |
| 52 | +import pathlib |
| 53 | +import json |
| 54 | +from datetime import timedelta |
| 55 | +from airflow import DAG |
| 56 | +
|
| 57 | +try: |
| 58 | + from airflow.operators.python import PythonOperator |
| 59 | +except ModuleNotFoundError: |
| 60 | + from airflow.operators.python_operator import PythonOperator |
| 61 | +
|
| 62 | +from metadata.config.common import load_config_file |
| 63 | +from metadata.ingestion.api.workflow import Workflow |
| 64 | +from airflow.utils.dates import days_ago |
| 65 | +
|
| 66 | +default_args = { |
| 67 | + "owner": "user_name", |
| 68 | + "email": ["username@org.com"], |
| 69 | + "email_on_failure": False, |
| 70 | + "retries": 3, |
| 71 | + "retry_delay": timedelta(minutes=5), |
| 72 | + "execution_timeout": timedelta(minutes=60) |
| 73 | +} |
| 74 | +
|
| 75 | +config = """ |
| 76 | +{ |
| 77 | + "source":{ |
| 78 | + "type": "postgres", |
| 79 | + "serviceName": "postgres_demo", |
| 80 | + "serviceConnection": { |
| 81 | + "config": { |
| 82 | + "type": "Postgres", |
| 83 | + "username": "postgres", (change to your username) |
| 84 | + "password": "postgres", (change to your password) |
| 85 | + "hostPort": "192.168.1.55:5432", (change to your hostPort) |
| 86 | + "database": "surveillance_hub" (change to your database) |
| 87 | + } |
| 88 | + }, |
| 89 | + "sourceConfig":{ |
| 90 | + "config": { (all of the following can switch to true or false) |
| 91 | + "enableDataProfiler": "true" or "false", |
| 92 | + "markDeletedTables": "true" or "false", |
| 93 | + "includeTables": "true" or "false", |
| 94 | + "includeViews": "true" or "false", |
| 95 | + "generateSampleData": "true" or "false" |
| 96 | + } |
| 97 | + } |
| 98 | + }, |
| 99 | + "sink":{ |
| 100 | + "type": "metadata-rest", |
| 101 | + "config": {} |
| 102 | + }, |
| 103 | + "workflowConfig": { |
| 104 | + "openMetadataServerConfig": { |
| 105 | + "hostPort": "http://localhost:8585/api", |
| 106 | + "authProvider": "no-auth" |
| 107 | + } |
| 108 | + } |
| 109 | + |
| 110 | + |
| 111 | +} |
| 112 | +""" |
| 113 | +
|
| 114 | +def metadata_ingestion_workflow(): |
| 115 | + workflow_config = json.loads(config) |
| 116 | + workflow = Workflow.create(workflow_config) |
| 117 | + workflow.execute() |
| 118 | + workflow.raise_from_status() |
| 119 | + workflow.print_status() |
| 120 | + workflow.stop() |
| 121 | +
|
| 122 | +
|
| 123 | +with DAG( |
| 124 | + "sample_data", |
| 125 | + default_args=default_args, |
| 126 | + description="An example DAG which runs a OpenMetadata ingestion workflow", |
| 127 | + start_date=days_ago(1), |
| 128 | + is_paused_upon_creation=False, |
| 129 | + schedule_interval='*/5 * * * *', |
| 130 | + catchup=False, |
| 131 | +) as dag: |
| 132 | + ingest_task = PythonOperator( |
| 133 | + task_id="ingest_using_recipe", |
| 134 | + python_callable=metadata_ingestion_workflow, |
| 135 | + ) |
| 136 | +
|
| 137 | +if __name__ == "__main__": |
| 138 | + metadata_ingestion_workflow() |
| 139 | +``` |
| 140 | + |
| 141 | +2. Run the DAG: |
| 142 | +` |
| 143 | +python postgres_demo.py |
| 144 | +` |
| 145 | + |
| 146 | +- Alternatively, we could run without Airflow SDK and with OpenMetadata's own methods. Run `metadata ingest -c /Your_Path_To_Json/.json` |
| 147 | +The json configuration is exactly the same as the json configuration in the DAG. |
| 148 | +- Or, we could also run it with `metadata ingest -c /Your_Path_To_Yaml/.yaml` |
| 149 | +The yaml configuration would be the exact same except without the curly brackets and the double quotes. |
| 150 | + |
| 151 | +## To run a profiler workflow on Postgres data |
| 152 | +1. Prepare the DAG OR configure the yaml/json: |
| 153 | +- To configure the quality tests in json/yaml, see https://docs.open-metadata.org/data-quality/data-quality-overview/tests |
| 154 | +- To prepare the DAG, see https://github.com/open-metadata/OpenMetadata/tree/0.10.3-release/data-quality/data-quality-overview |
| 155 | + |
| 156 | +Example yaml I was using: |
| 157 | +``` |
| 158 | +source: |
| 159 | + type: postgres |
| 160 | + serviceName: your_service_name |
| 161 | + serviceConnection: |
| 162 | + config: |
| 163 | + type: Postgres |
| 164 | + username: your_username |
| 165 | + password: your_password |
| 166 | + hostPort: |
| 167 | + database: your_database |
| 168 | + sourceConfig: |
| 169 | + config: |
| 170 | + type: Profiler |
| 171 | +
|
| 172 | +processor: |
| 173 | + type: orm-profiler |
| 174 | + config: |
| 175 | + test_suite: |
| 176 | + name: demo_test |
| 177 | + tests: |
| 178 | + - table: your_table_name (FQN) |
| 179 | + column_tests: |
| 180 | + - columnName: id |
| 181 | + testCase: |
| 182 | + columnTestType: columnValuesToBeBetween |
| 183 | + config: |
| 184 | + minValue: 0 |
| 185 | + maxValue: 10 |
| 186 | +sink: |
| 187 | + type: metadata-rest |
| 188 | + config: {} |
| 189 | +workflowConfig: |
| 190 | + openMetadataServerConfig: |
| 191 | + hostPort: http://localhost:8585/api |
| 192 | + authProvider: no-auth |
| 193 | +``` |
| 194 | +Note that the table name must be FQN and match exactly with the table path on the OpenMetadata UI. |
| 195 | + |
| 196 | +2. Run it with |
| 197 | +`metadata profile -c /path_to_yaml/.yaml` |
| 198 | + |
| 199 | +Make sure to refresh the OpenMetadata UI and click on the Data Quality tab to see the results. |
0 commit comments