Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System Design #300

Open
timlinux opened this issue Sep 23, 2024 · 0 comments
Open

System Design #300

timlinux opened this issue Sep 23, 2024 · 0 comments

Comments

@timlinux
Copy link
Contributor

timlinux commented Sep 23, 2024

System Design Document: GEEST


1. Introduction and Overview

This system performs geospatial analysis on small nations, focusing on women’s empowerment. Multiple workflows are run in parallel, with each indicator processed independently, and the results stored in a GeoPackage.


2. Architecture Overview

We follow a hierarchy of dimensions, factors, and indicators, with values aggregated at each level. The system efficiently processes these layers using parallel tasks on a single machine.

classDiagram
    class dimension {
      +id: string
      +name: string
    }
    class factor {
      +id: string
      +name: string
      +weight: float
    }
    class indicator {
      +id: string
      +name: string
      +weight: float
    }

    dimension "1" --> "1..*" factor
    factor "1" --> "1..*" indicator
Loading

3. Study Area and Grid System

The system generates a 100m x 100m grid for the study area, using the boundaries of the relevant regions or islands. This grid is stored in the study_area_grid table, with each grid cell corresponding to a polygon.

  • Grid cells are assigned Likert-scale values (0-5) based on workflows.
graph TD;
    A[Study Area] --> B[Bounding Box]
    B --> C[100m x 100m Grid]
    C --> D[STUDY_AREA_GRID]
Loading

4. Data Model

4.1 Bounding Box Table (study_area_bbox)

The study_area_bbox table stores the bounding box for the entire study areas, aligned to a 100-meter grid. This table serves as a reference for the spatial extent of the study area. There will only ever be one feature in this table.

id bbox_polygon area_name area_id
1 POLYGON((...)) Nation 1 1
... ... ... ...

4.2 Local Area Boxes Table (local_area_boxes)

The local_area_boxes table contains a similar structure to study_area_bbox, but instead of representing overall bounding box for the entire region, this table stores the bounding boxes for each individual polygon that the system analyzes. Multipart polygons in the original input data will be broken out into single parts.

id polygon_box area_name area_id
1 POLYGON((...)) Island_1 101
2 POLYGON((...)) Island_2 102
... ... ... ...

The bboxes of this table will all be aligned to a 100m interval consistent to the origin of the study_area_bbox.

4.3 Output Raster Structure (study_area_mask)

The study_area_mask is a vrt raster compsed of one mask raster per feature.

  • We create one raster per input polygon part.
  • The raster will always be grid aligned to the origin of the entire study area such that the raster origin is divisible by 100 when offset from the study area origin.
  • The mask rasters will be written as Bit format tiffs.
  • The vrt will combine all of these constituent mask rasters into a single virtual grid with extents coincident with the study area.
  • Grid cells (100m x 100m).

The study_area_mask will be used to exclude any non study area pixels from analysis outputs.

4.4 Output Raster Structure (intermediate_result)

The _result is a vrt raster compsed of one byte raster per feature.

  • We create one raster per input polygon part.
  • The raster will always be grid aligned to the origin of the entire study area such that the raster origin is divisible by 100 when offset from the study area origin.
  • The _result rasters will be written as byte format tiffs.
  • They will be suffixed by the normalized area name.
  • The vrt will combine all of these constituent _result rasters into a single virtual grid with extents coincident with the study area.
  • Grid cells (100m x 100m).

The _result will be used to store intermediate analysis outputs e.g. index scores.

4.4 Output Raster Structure (final result)

The _final is a vrt raster composed of one byte raster per feature.

  • We create one raster per input polygon part.
  • The raster will always be grid aligned to the origin of the entire study area such that the raster origin is divisible by 100 when offset from the study area origin.
  • The _final rasters will be written as byte format tiffs.
  • They will be suffixed by the normalized area name.
  • The vrt will combine all of these constituent _final rasters into a single virtual grid with extents coincident with the study area.
  • Grid cells (100m x 100m).

The _final will be used to store final analysis outputs with values scaled to the Likert scale.


5. Workflows

Each indicator has its own workflow, which calculates values and assigns them to the grid cells.

For example:

  • Raster Layer Input Workflow: This workflow processes raster data and assigns scaled values (0-5) based on raster statistics to the grid cells.

A workflow comprises all of the steps needed to do the calculation for that indicator. We will re-use common logic between workflows by storing them in functions.

A workflow is a QgsTask meaning we can submit it to the QgsTaskManager and queue it or run multiple workflows in parallel.

6. Factor and Dimension Calculation

Factors are computed by aggregating indicators, and dimensions are calculated by aggregating factors. Weights can be applied to indicators and factors to determine their influence on the final calculation.

The factors and indicators do not need to be weighted equally, but need to, when combined, equate to a weighting of 1.


7. Visualization

The system uses QGIS-style formatting to visualize the results. Each indicator, factor, and dimension has its own styling rules, which are applied when rendering the data in QGIS. We will use style definitions with no outline so that they create the effect of a raster layer when visualized. This will also give us additional vizualization options like extruding cells in 3D, adding labels to cells, producing charts per indicator and so on.


8. Workflow Optimization and Parallelization

As mentioned above, we can accelerate the workflow processing by parallelizing indicator computations. Each indicator is processed by a separate task that runs concurrently.

graph LR
    I1[Indicator 1 Task] --> updateGeo1[Update i_literacy Column]
    I2[Indicator 2 Task] --> updateGeo2[Update i_health Column]
    I3[Indicator 3 Task] --> updateGeo3[Update i_economic Column]
    updateGeo1 --> finalGeo[Final GeoPackage]
    updateGeo2 --> finalGeo
    updateGeo3 --> finalGeo
Loading

8.1 Example of Parallel Indicator Processing

sequenceDiagram
    participant User
    participant RasterTask
    participant TravelTask
    participant GeoPackage

    User->>RasterTask: Start Raster Indicator Workflow
    User->>TravelTask: Start Travel Time Indicator Workflow
    RasterTask->>GeoPackage: Update i_raster_indicator Column
    TravelTask->>GeoPackage: Update i_travel_indicator Column
    GeoPackage->>User: Both Workflows Complete
Loading

8.2 Optimizing Workflow Execution

We use several optimization techniques to ensure efficient parallel execution:

graph TD
    task1[Indicator Workflow 1] --> spatial[Spatial Indexing]
    task2[Indicator Workflow 2] --> cache[In-memory Caching]
    task3[Indicator Workflow 3] --> batch[Batch Updates to GeoPackage]
    spatial --> finalGeo[Efficient GeoPackage Updates]
    cache --> finalGeo
    batch --> finalGeo
Loading

Example Python Code: Parallel GeoPackage Writing Using QGIS

from qgis.core import QgsTask, QgsVectorLayer, QgsFeature, QgsProject

# Function to update a GeoPackage
def update_geopackage(task_name, gpkg_layer_path, attribute_name, update_value):
    layer = QgsVectorLayer(gpkg_layer_path, "GeoPackage Layer", "ogr")
    
    if not layer.isValid():
        print(f"Layer {gpkg_layer_path} is not valid!")
        return
    
    with edit(layer):  # Start a transaction for safe editing
        for feature in layer.getFeatures():
            feature[attribute_name] = update_value
            layer.updateFeature(feature)
    print(f"{task_name}: Update complete.")

# Task Class
class UpdateGpkgTask(QgsTask):
    def __init__(self, name, gpkg_layer_path, attribute_name, update_value):
        QgsTask.__init__(self, name)
        self.gpkg_layer_path = gpkg_layer_path
        self.attribute_name = attribute_name
        self.update_value = update_value

    def run(self):
        update_geopackage(self.taskName(), self.gpkg_layer_path, self.attribute_name, self.update_value)
        return True

    def finished(self, result):
        if result:
            print(f"Task {self.taskName()} finished successfully.")
        else:
            print(f"Task {self.taskName()} failed.")

# Create the tasks
gpkg_path = "/path/to/your/geopackage.gpkg|layername=your_layer"
task1 = UpdateGpkgTask("Task 1", gpkg_path, "attribute_name", "new_value_1")
task2 = UpdateGpkgTask("Task 2", gpkg_path, "attribute_name", "new_value_2")

# Add tasks to the Task Manager
QgsApplication.taskManager().addTask(task1)
QgsApplication.taskManager().addTask(task2)

📃 Please note the above needs to be tested and verified.

8.4 Error Handling and Workflow Recovery

graph LR
    start[Start Indicator Task] --> checkpoint[Checkpoint Progress]
    checkpoint --> logErrors[Log Errors]
    logErrors --> retry[Retry Task if Needed]
    retry --> success[Task Completed Successfully]
Loading

Final System Workflow Integration

Here’s how the overall system executes indicator workflows in parallel, then aggregates the results into factors and dimensions:

graph LR
    user[User Input] --> indicator[Parallel Indicator Workflow Tasks]
    indicator --> factor[Compute Factor Values from Indicators]
    factor --> dimension[Compute Dimension Values from Factors]
    dimension --> updateGeo[Update GeoPackage]
    updateGeo --> taskComplete[All Tasks Completed]
Loading

Conclusion

By parallelizing indicator workflows and running each one independently, the system efficiently processes multiple indicators at the same time on a single machine. This method avoids grid-based partitioning (i.e. we parallelize by iterating over indicators rather than grid cells) and allows each indicator to be processed as its own unit, updating the corresponding columns in the GeoPackage. Individual workflows may choose to process cell by cell or feature by feature depending on what is expedient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant