You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This system performs geospatial analysis on small nations, focusing on women’s empowerment. Multiple workflows are run in parallel, with each indicator processed independently, and the results stored in a GeoPackage.
2. Architecture Overview
We follow a hierarchy of dimensions, factors, and indicators, with values aggregated at each level. The system efficiently processes these layers using parallel tasks on a single machine.
The system generates a 100m x 100m grid for the study area, using the boundaries of the relevant regions or islands. This grid is stored in the study_area_grid table, with each grid cell corresponding to a polygon.
Grid cells are assigned Likert-scale values (0-5) based on workflows.
graph TD;
A[Study Area] --> B[Bounding Box]
B --> C[100m x 100m Grid]
C --> D[STUDY_AREA_GRID]
Loading
4. Data Model
4.1 Bounding Box Table (study_area_bbox)
The study_area_bbox table stores the bounding box for the entire study areas, aligned to a 100-meter grid. This table serves as a reference for the spatial extent of the study area. There will only ever be one feature in this table.
id
bbox_polygon
area_name
area_id
1
POLYGON((...))
Nation 1
1
...
...
...
...
4.2 Local Area Boxes Table (local_area_boxes)
The local_area_boxes table contains a similar structure to study_area_bbox, but instead of representing overall bounding box for the entire region, this table stores the bounding boxes for each individual polygon that the system analyzes. Multipart polygons in the original input data will be broken out into single parts.
id
polygon_box
area_name
area_id
1
POLYGON((...))
Island_1
101
2
POLYGON((...))
Island_2
102
...
...
...
...
The bboxes of this table will all be aligned to a 100m interval consistent to the origin of the study_area_bbox.
4.3 Output Raster Structure (study_area_mask)
The study_area_mask is a vrt raster compsed of one mask raster per feature.
We create one raster per input polygon part.
The raster will always be grid aligned to the origin of the entire study area such that the raster origin is divisible by 100 when offset from the study area origin.
The mask rasters will be written as Bit format tiffs.
The vrt will combine all of these constituent mask rasters into a single virtual grid with extents coincident with the study area.
Grid cells (100m x 100m).
The study_area_mask will be used to exclude any non study area pixels from analysis outputs.
4.4 Output Raster Structure (intermediate_result)
The _result is a vrt raster compsed of one byte raster per feature.
We create one raster per input polygon part.
The raster will always be grid aligned to the origin of the entire study area such that the raster origin is divisible by 100 when offset from the study area origin.
The _result rasters will be written as byte format tiffs.
They will be suffixed by the normalized area name.
The vrt will combine all of these constituent _result rasters into a single virtual grid with extents coincident with the study area.
Grid cells (100m x 100m).
The _result will be used to store intermediate analysis outputs e.g. index scores.
4.4 Output Raster Structure (final result)
The _final is a vrt raster composed of one byte raster per feature.
We create one raster per input polygon part.
The raster will always be grid aligned to the origin of the entire study area such that the raster origin is divisible by 100 when offset from the study area origin.
The _final rasters will be written as byte format tiffs.
They will be suffixed by the normalized area name.
The vrt will combine all of these constituent _final rasters into a single virtual grid with extents coincident with the study area.
Grid cells (100m x 100m).
The _final will be used to store final analysis outputs with values scaled to the Likert scale.
5. Workflows
Each indicator has its own workflow, which calculates values and assigns them to the grid cells.
For example:
Raster Layer Input Workflow: This workflow processes raster data and assigns scaled values (0-5) based on raster statistics to the grid cells.
A workflow comprises all of the steps needed to do the calculation for that indicator. We will re-use common logic between workflows by storing them in functions.
A workflow is a QgsTask meaning we can submit it to the QgsTaskManager and queue it or run multiple workflows in parallel.
6. Factor and Dimension Calculation
Factors are computed by aggregating indicators, and dimensions are calculated by aggregating factors. Weights can be applied to indicators and factors to determine their influence on the final calculation.
The factors and indicators do not need to be weighted equally, but need to, when combined, equate to a weighting of 1.
7. Visualization
The system uses QGIS-style formatting to visualize the results. Each indicator, factor, and dimension has its own styling rules, which are applied when rendering the data in QGIS. We will use style definitions with no outline so that they create the effect of a raster layer when visualized. This will also give us additional vizualization options like extruding cells in 3D, adding labels to cells, producing charts per indicator and so on.
8. Workflow Optimization and Parallelization
As mentioned above, we can accelerate the workflow processing by parallelizing indicator computations. Each indicator is processed by a separate task that runs concurrently.
By parallelizing indicator workflows and running each one independently, the system efficiently processes multiple indicators at the same time on a single machine. This method avoids grid-based partitioning (i.e. we parallelize by iterating over indicators rather than grid cells) and allows each indicator to be processed as its own unit, updating the corresponding columns in the GeoPackage. Individual workflows may choose to process cell by cell or feature by feature depending on what is expedient.
The text was updated successfully, but these errors were encountered:
System Design Document: GEEST
1. Introduction and Overview
This system performs geospatial analysis on small nations, focusing on women’s empowerment. Multiple workflows are run in parallel, with each indicator processed independently, and the results stored in a GeoPackage.
2. Architecture Overview
We follow a hierarchy of dimensions, factors, and indicators, with values aggregated at each level. The system efficiently processes these layers using parallel tasks on a single machine.
3. Study Area and Grid System
The system generates a 100m x 100m grid for the study area, using the boundaries of the relevant regions or islands. This grid is stored in the study_area_grid table, with each grid cell corresponding to a polygon.
4. Data Model
4.1 Bounding Box Table (study_area_bbox)
The study_area_bbox table stores the bounding box for the entire study areas, aligned to a 100-meter grid. This table serves as a reference for the spatial extent of the study area. There will only ever be one feature in this table.
4.2 Local Area Boxes Table (local_area_boxes)
The local_area_boxes table contains a similar structure to study_area_bbox, but instead of representing overall bounding box for the entire region, this table stores the bounding boxes for each individual polygon that the system analyzes. Multipart polygons in the original input data will be broken out into single parts.
The bboxes of this table will all be aligned to a 100m interval consistent to the origin of the study_area_bbox.
4.3 Output Raster Structure (study_area_mask)
The study_area_mask is a vrt raster compsed of one mask raster per feature.
The study_area_mask will be used to exclude any non study area pixels from analysis outputs.
4.4 Output Raster Structure (intermediate_result)
The _result is a vrt raster compsed of one byte raster per feature.
The _result will be used to store intermediate analysis outputs e.g. index scores.
4.4 Output Raster Structure (final result)
The _final is a vrt raster composed of one byte raster per feature.
The _final will be used to store final analysis outputs with values scaled to the Likert scale.
5. Workflows
Each indicator has its own workflow, which calculates values and assigns them to the grid cells.
For example:
A workflow comprises all of the steps needed to do the calculation for that indicator. We will re-use common logic between workflows by storing them in functions.
A workflow is a QgsTask meaning we can submit it to the QgsTaskManager and queue it or run multiple workflows in parallel.
6. Factor and Dimension Calculation
Factors are computed by aggregating indicators, and dimensions are calculated by aggregating factors. Weights can be applied to indicators and factors to determine their influence on the final calculation.
The factors and indicators do not need to be weighted equally, but need to, when combined, equate to a weighting of 1.
7. Visualization
The system uses QGIS-style formatting to visualize the results. Each indicator, factor, and dimension has its own styling rules, which are applied when rendering the data in QGIS. We will use style definitions with no outline so that they create the effect of a raster layer when visualized. This will also give us additional vizualization options like extruding cells in 3D, adding labels to cells, producing charts per indicator and so on.
8. Workflow Optimization and Parallelization
As mentioned above, we can accelerate the workflow processing by parallelizing indicator computations. Each indicator is processed by a separate task that runs concurrently.
8.1 Example of Parallel Indicator Processing
8.2 Optimizing Workflow Execution
We use several optimization techniques to ensure efficient parallel execution:
Example Python Code: Parallel GeoPackage Writing Using QGIS
📃 Please note the above needs to be tested and verified.
8.4 Error Handling and Workflow Recovery
Final System Workflow Integration
Here’s how the overall system executes indicator workflows in parallel, then aggregates the results into factors and dimensions:
Conclusion
By parallelizing indicator workflows and running each one independently, the system efficiently processes multiple indicators at the same time on a single machine. This method avoids grid-based partitioning (i.e. we parallelize by iterating over indicators rather than grid cells) and allows each indicator to be processed as its own unit, updating the corresponding columns in the GeoPackage. Individual workflows may choose to process cell by cell or feature by feature depending on what is expedient.
The text was updated successfully, but these errors were encountered: