diff --git a/.github/workflows/develop.yml b/.github/workflows/develop.yml index f433b7b..b1c13f8 100644 --- a/.github/workflows/develop.yml +++ b/.github/workflows/develop.yml @@ -28,7 +28,7 @@ env: SOLR_API_URL: ${{ secrets.DEV_SOLR_API_URL }} SOLR_USER: ${{ secrets.DEV_SOLR_USER }} SOLR_PASSWORD: ${{ secrets.DEV_SOLR_PASSWORD }} - SOLR_PARALLEL_PROCESSES: 10 + SOLR_PARALLEL_PROCESSES: ${{ vars.DEV_SOLR_PARALLEL_PROCESSES }} DB_USER: ${{ secrets.DB_USER }} DB_PASS: ${{ secrets.DB_PASS }} DB_HOST: ${{ secrets.DB_HOST }} diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml index 785ac2b..1974b1a 100644 --- a/.github/workflows/main.yml +++ b/.github/workflows/main.yml @@ -23,7 +23,7 @@ env: SOLR_API_URL: ${{ secrets.PROD_SOLR_API_URL }} SOLR_USER: ${{ secrets.PROD_SOLR_USER }} SOLR_PASSWORD: ${{ secrets.PROD_SOLR_PASSWORD }} - SOLR_PARALLEL_PROCESSES: 5 + SOLR_PARALLEL_PROCESSES: ${{ vars.PROD_SOLR_PARALLEL_PROCESSES }} DB_USER: ${{ secrets.PROD_DB_USER }} DB_PASS: ${{ secrets.PROD_DB_PASS }} DB_HOST: ${{ secrets.PROD_DB_HOST }} diff --git a/IATI_Data_Flow.drawio.svg b/IATI_Data_Flow.drawio.svg index 1ce3c64..36c9175 100644 --- a/IATI_Data_Flow.drawio.svg +++ b/IATI_Data_Flow.drawio.svg @@ -1,4 +1,1744 @@ - - - -

Unified Platform Data Flow
Unified Platform Data Flow
Refresh
Refresh
Sync Publishers
Sync Publishers
Download Documents
Download Documents
Sync Documents
Sync Documents
Validate
Validate
File Level Schema Check
File Level Schema Check
Validate - Codelist, Ruleset, Activity Level Schema validate
Validate - Codelist, Rul...
1
1
2
2
Safety Valve / Publisher Flag
Safety Valve / Publisher...
Clean
Clean
Copy all valid documents and activities to "clean" container
Copy all valid documents...
Flatten
Flatten
Flatten
Flatten
3
3
Lakify
Lakify
Lakify
Lakify
Solrize
Solrize
Solrize
Solrize
PSQL
PSQL
Source XMLClean XML
Solr
Solr
Lake
File Level Schema Validator API Function
File Level Schema Validator API Function
Full Validator API Function
Full Validator API Function
Flattener API Function
Flattener API Function
1
1
2
2
3
3
Blob Storage
Blob Storage
Database
Database
API
API
Legend
Legend
Text is not SVG - cannot display
\ No newline at end of file + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Unified Platform Data Flow + + + + Unified Platform Data Flow + + + + + + + + + + + Refresh + + + + Refresh + + + + + + + + + + + Sync Publishers + + + + Sync Publishers + + + + + + + + + Download Documents + + + + Download Documents + + + + + + + + + + + Sync Documents + + + + Sync Documents + + + + + + + + + + + Validate + + + + Validate + + + + + + + + + + + File Level Schema Check + + + + File Level Schema Check + + + + + + + + + Validate - Codelist, Ruleset, Activity Level Schema validate + + + + Validate - Codelist, Rul... + + + + + + + + + 1 + + + + 1 + + + + + + + + + 2 + + + + 2 + + + + + + + + + Safety Valve / Publisher Flag + + + + Safety Valve / Publisher... + + + + + + + + + + + + + Clean + + + + Clean + + + + + + + + + + Copy all valid documents and activities to "clean" container + + + + + Copy all valid documents... + + + + + + + + + + + Flatten + + + + Flatten + + + + + + + + + Flatten + + + + Flatten + + + + + + + + + + + Lakify + + + + Lakify + + + + + + + + + Lakify + + + + Lakify + + + + + + + + + + + Solrize + + + + Solrize + + + + + + + + + Solrize + + + + Solrize + + + + + + + + + + PSQL + + + + PSQL + + + + + + + + + + + + + Source XML + + + + + + + + + + + + Clean XML + + + + + + + + + + + Solr + + + + Solr + + + + + + + + + + + Lake + + + + + + + + + + + + File Level Schema Validator API Function + + + + File Level Schema Validator API Function + + + + + + + + + + + Full Validator API Function + + + + Full Validator API Function + + + + + + + + + 1 + + + + 1 + + + + + + + + + 2 + + + + 2 + + + + + + + + + + + Blob Storage + + + + Blob Storage + + + + + + + + + + Database + + + + + Database + + + + + + + + + API + + + + API + + + + + + + + Legend + + + + Legend + + + + + + Text is not SVG - cannot display + + + + diff --git a/README.md b/README.md index 6f04ee5..0107150 100644 --- a/README.md +++ b/README.md @@ -134,7 +134,8 @@ Service Loop (when container starts) - Checks for `stale_datasets` - `document.last_seen` is from a previous run (so no longer in registry) - `clean_datasets()` - Removes `stale_datasets` from Activity lake, decided it wasn't worth updating `changed_datasets` from activity lake because filenames are hash of `iati_identifier` so less likely to change. - - Removes `changed_datasets` and `stale_datasets` from source xml blob container and Solr. + - Removes `stale_datasets` from source and clean xml blob container and Solr. + - Removes `changed_datasets`from source and clean xml blob container. Not Solr as this will be removed later, and we want the older data to be available to data store users during processing. - Removes `stale_datasets` from DB documents table - `reload(retry_errors)` - `retry_errors` is True after RETRY_ERRORS_AFTER_LOOP refreshes. @@ -222,27 +223,23 @@ Service Loop (when container starts) ## Functions -- `main()` - Sends XML to the [iati-flattener](https://github.com/IATI/iati-flattener) which transforms it into a flat JSON document, then stores it in the database (`document.flattened_activities`) in JSONB format. +- `main()` - Flattens XML into a flat JSON document, then stores it in the database (`document.flattened_activities`) in JSONB format. + +Used to use the [iati-flattener service](https://github.com/IATI/iati-flattener), but now it does it using a Python class it the same process. ## Logic - `main()` - - Reset unfinished flattens + - Reset unfinished and errored flattens - Get unflattened (`db.getUnflattenedDatasets`) - process_hash_list() - - If prior_error = 422, 400, 413, break out of loop for this file - Start flatten in db (db.startFlatten) - Download source XML from Azure blobs - If charset error, breaks out of loop for file - - POST's to flattener API - - Update solrize_start column - - If status code != 200 - - `404` - update DB `document.flatten_api_error`, pause 1min, continue loop - - `400 - 499` - update DB `document.flatten_api_error`, break out of loop - - `500 +` - update DB `document.flatten_api_error`, break out of loop - - else - log warning, continue + - Uses Python class `Flattener` to flatten. + - Mark done and store results in DB (db.completeFlatten) - If exception - Can't download BLOB, then `"UPDATE document SET downloaded = null WHERE id = %(id)s"`, to force re-download - - Other Exception, log message, no change to DB + - Other Exception, log message, `UPDATE document SET flatten_api_error = %(error)s WHERE id = %(doc_id)s` # Lakify