Skip to content

Refactored Water Data Pipeline #336

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

Wegatriespython
Copy link

@Wegatriespython Wegatriespython commented Apr 23, 2025

Refactored Water Data Pipeline

Short Desc
Improve readability & testability via DSL(Domain Specific Language)

Long Desc

  • Isolate “what” (rules) from “how” (engine) with a DSL
  • Replace nested procedural code with declarative Rule objects
  • Centralize execution in a single engine for targeted unit tests
  • Remove magic numbers, hard-coded constants, and duplicated logic
  • Introduce CitationWrapper utility for citations

Example Before
Procedural code with repeated make_df/broadcast steps and chained concat on the same variable names, making it hard to follow flow and test individual pieces:

# surface water
output_df = (
  make_df("output", technology="extract_surfacewater", …)
    .pipe(broadcast, yv_ya_sw, time=…)
    .pipe(same_time)
)
# groundwater (duplicated structure)
output_df = pd.concat([output_df, make_df("output", technology="extract_groundwater", …)
    .pipe(broadcast, yv_ya_gw, time=…)
    .pipe(same_time)])
# saline water (slightly different args, more duplication)

Example After

def _process_extraction_output_rules(...):
    """Processes extraction output rules declaratively via DSL."""
    extraction_outputs = []
    base = {"df_node": df_node, "runtime_vals": {"year_wat": year_wat}, "node_loc": node_region}
    for r in EXTRACTION_OUTPUT_RULES.get_rule():
        r["type"] = "output"
        bcast = yv_ya_gw if r["technology"].startswith("extract_gw") else yv_ya_sw
        df_rule = (
            run_standard(r, {**base, **({"sub_time": sub_time_series} if "saline" not in r["technology"] else {})}, broadcast_year=bcast)
        )
        extraction_outputs.append(df_rule)
    return safe_concat(extraction_outputs)

How to review

  • Feedback on the rules and clarify any questions/ambiguity
  • Verify DSL engine unit-tests cover all rule types and hidden control flow
  • Confirm readability: each rule object maps clearly to its output
  • Ensure no remaining large procedural functions or magic constants

PR checklist

  • CI checks pass
  • GDX file comparison
  • Tests added/expanded; coverage ✅
  • Documentation updated (whatsnew, docstrings)
  • Delete Parity Tests after review

@Wegatriespython Wegatriespython added enh New features or functionality water MESSAGEix-Nexus (water) variant safe to test PRs from forks that do not pose security risks ai PR includes some AI generated Code labels Apr 23, 2025
@Wegatriespython Wegatriespython self-assigned this Apr 23, 2025
Copy link

codecov bot commented Apr 23, 2025

Codecov Report

Attention: Patch coverage is 82.52205% with 535 lines in your changes missing coverage. Please review.

Project coverage is 78.0%. Comparing base (557d739) to head (4143a18).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...essage_ix_models/model/water/data/water_for_ppl.py 20.6% 185 Missing ⚠️
message_ix_models/model/water/data/water_supply.py 50.5% 97 Missing ⚠️
...ssage_ix_models/model/water/data/demands_legacy.py 74.7% 78 Missing ⚠️
message_ix_models/model/water/data/demands.py 83.4% 39 Missing ⚠️
..._ix_models/model/water/data/water_supply_legacy.py 78.5% 39 Missing ⚠️
message_ix_models/util/citation_wrapper.py 53.2% 29 Missing ⚠️
message_ix_models/model/water/dsl_engine.py 86.8% 19 Missing ⚠️
...ix_models/model/water/data/water_for_ppl_legacy.py 94.7% 15 Missing ⚠️
...age_ix_models/tests/model/water/test_dsl_engine.py 86.9% 12 Missing ⚠️
...ests/model/water/data/test_water_for_ppl_parity.py 95.4% 5 Missing ⚠️
... and 7 more
Additional details and impacted files
@@           Coverage Diff           @@
##            main    #336     +/-   ##
=======================================
- Coverage   78.5%   78.0%   -0.5%     
=======================================
  Files        216     236     +20     
  Lines      17409   19951   +2542     
=======================================
+ Hits       13679   15578   +1899     
- Misses      3730    4373    +643     
Files with missing lines Coverage Δ
message_ix_models/model/water/data/demand_rules.py 100.0% <100.0%> (ø)
...ssage_ix_models/model/water/data/infrastructure.py 100.0% <100.0%> (ø)
...x_models/model/water/data/infrastructure_legacy.py 100.0% <100.0%> (ø)
...ix_models/model/water/data/infrastructure_rules.py 100.0% <100.0%> (ø)
message_ix_models/model/water/data/irrigation.py 100.0% <100.0%> (ø)
...ge_ix_models/model/water/data/irrigation_legacy.py 100.0% <100.0%> (ø)
...age_ix_models/model/water/data/irrigation_rules.py 100.0% <100.0%> (ø)
..._ix_models/model/water/data/water_for_ppl_rules.py 100.0% <100.0%> (ø)
...e_ix_models/model/water/data/water_supply_rules.py 100.0% <100.0%> (ø)
...e_ix_models/tests/model/water/data/test_demands.py 100.0% <100.0%> (ø)
... and 21 more

... and 17 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@@ -0,0 +1,356 @@
from message_ix_models.model.water.utils import Rule

WD_CONST = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add some description for each series/rule on what are they for, what should be included and what no?
Does each _rule.py script need to have a xx_CONST series with a different name?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add the descriptions for the rules.

The constants are just to hold the magic numbers which were used in the rules. Some of them are common, but I have left them for now.
WD : Water Demand
WS : Water Supply



@minimum_version("python 3.10")
def run_standard(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you could rename this function, as the word "run" has in our context a different meaning. Something like "build" "fill" "parametrize", might be more understandable. not sure

Copy link
Contributor

@adrivinca adrivinca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just commenting, did not want to make a full review

Wegatriespython and others added 10 commits May 5, 2025 19:17
Introduces a string-based DSL engine that can
evaluate Python expressions within strings,
plus a full suite of water-data operations. Incl
unit tests for engine and operations.

BREAKING CHANGE: requires Python 3.10+;
minimum-version markers added to functions.
Migrated water demans to use the new DSL engine.
Includes the creation of demand_rules.py. Added
a new test for parity with the legacy code
and updated demands test.
Migrated water supply to use the new DSL engine.
Includes the creation of supply_rules. Added
a new test for parity with the legacy code
and updated supply test.
Similar to previous for infrastructure now.
Helpful utility to wrap citations,
used to replace citations in comments with a
consistent and uniform format.
Similar to previous, uses citation wrapper.
Additional to DSL migration, this breaks file into
two files: water_for_ppl_rules.py and
water_for_ppl_support.py. Cyclomatic complexity
is reduced.
@adrivinca adrivinca requested a review from glatterf42 May 7, 2025 15:02
@adrivinca
Copy link
Contributor

Thanks @Wegatriespython to polish the PR, actually make a new one.
As far as I understood, tests pass apart from python < 3.10
the GDX give identical results

Vignesh still has to:

  • edit the documentation pages
  • remove _legacy script if we agree with it
    but apart from this it's ready for other reviews

@Wegatriespython Wegatriespython removed the safe to test PRs from forks that do not pose security risks label May 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ai PR includes some AI generated Code enh New features or functionality water MESSAGEix-Nexus (water) variant
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants