Keep (some) input data ordered in MUSE2 #395

alexdewar · 2025-02-17T10:45:28Z

Description

We make heavy use of HashMaps to store data in MUSE2, usually with the key being some kind of user-specified string ID (representing a commodity, process etc.). HashMaps are not ordered data structures, meaning if you iterate over their contents you will get a different ordering every time you run the program, even when using the same input files.

While the dispatch optimisation model doesn't care about ordering (even time slices within a year are independent of one another), a consequence of this design is that the output files will be effectively randomly shuffled every time the user runs the program with the same input data, which is annoying and potentially confusing. As it stands, if a user was to e.g. make a bar chart of commodity prices by time slice in Excel, unless they sorted the data properly beforehand they would get a different-looking plot every time they run the program. You'd have the same problem with Python too: users will end up having to write a whole bunch of logic to sort the input data just in order to plot it. Note that sorting time slices alphabetically will give a confusing order, so they'll have to have an ordered list of time slices -- and if they add/remove time slices to the model then they'll end up having to change the script. In addition, this will also make it really annoying to debug the code: every time you start the program under a debugger, your decision variables will be in a different randomised order.

So let's use an ordered map where it makes sense (i.e. if the ordering will be user-visible). There's an external crate (indexmap) which maintains the insertion order of elements. We're using it already, so I've just replaced various HashMaps with IndexMaps (as well as using a few ordered sets in the time slice code). While I was at it, I defined type aliases for various common map types we're using to make the code a bit cleaner: e.g. a CommodityMap as an alias for IndexMap<Rc<str>, Rc<Commodity>>.

Closes #327.

Type of change

Bug fix (non-breaking change to fix an issue)
New feature (non-breaking change to add functionality)
Refactoring (non-breaking, non-functional change to improve maintainability)
Optimization (non-breaking change to speed up the code)
Breaking change (whatever its nature)
Documentation (improve or add documentation)

Key checklist

All tests pass: $ cargo test
The documentation builds and looks OK: $ cargo doc

Further checks

Code is commented, particularly in hard-to-understand areas
Tests added that prove fix is effective or that feature works

codecov · 2025-02-17T10:46:17Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.43%. Comparing base (0c12284) to head (4dd7e1b).
Report is 8 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #395   +/-   ##
=======================================
  Coverage   95.42%   95.43%           
=======================================
  Files          31       31           
  Lines        4004     4007    +3     
  Branches     4004     4007    +3     
=======================================
+ Hits         3821     3824    +3     
  Misses         96       96           
  Partials       87       87

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

TinyMarsh

I think deterministic outputs making debugging easier is enough justification for this. I'm very on the fence with this in principle though.

It's possibly an interesting discussion for another time, but there's something about separation of concerns, and not baking in order to the output so as not to impose arbitrary structure. Data processing pipelines should be robust enough to handle unordered data.

Code changes look good.

alexdewar · 2025-02-26T10:39:37Z

I think deterministic outputs making debugging easier is enough justification for this. I'm very on the fence with this in principle though.

It's possibly an interesting discussion for another time, but there's something about separation of concerns, and not baking in order to the output so as not to impose arbitrary structure. Data processing pipelines should be robust enough to handle unordered data.

I agree in principle, but this is mostly about debugability/convenience really -- I think it will confuse users if they get different-looking data files every time they run the program, for no obvious reason. Users won't necessarily be particularly technically savvy either and their Excel-based workflows might not be that robust 😛. It will also make it annoying to write regression tests, for example, if the outputs are in a different order on every run. This probably wouldn't be worth doing if it was annoying to implement, but given it's an easy fix, I think we should just do it to make life a bit easier.

Anyhoo, thanks for the review!

alexdewar added 6 commits February 17, 2025 10:26

Make read_csv_id_file return ordered map + keep regions ordered

59b0489

Keep agents ordered

aca0843

Keep commodities ordered

cd03f90

Keep processes ordered

a8c3a2e

Keep time slice info ordered

d38f6d4

Keep commodity prices ordered

4119ab3

alexdewar requested a review from TinyMarsh February 17, 2025 10:45

TinyMarsh approved these changes Feb 25, 2025

View reviewed changes

Merge branch 'main' into keep-input-data-ordered

4dd7e1b

alexdewar enabled auto-merge February 26, 2025 10:39

alexdewar merged commit 7733c8d into main Feb 26, 2025
7 checks passed

alexdewar deleted the keep-input-data-ordered branch February 26, 2025 10:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep (some) input data ordered in MUSE2 #395

Keep (some) input data ordered in MUSE2 #395

alexdewar commented Feb 17, 2025

codecov bot commented Feb 17, 2025 •

edited

Loading

TinyMarsh left a comment

alexdewar commented Feb 26, 2025

Keep (some) input data ordered in MUSE2 #395

Keep (some) input data ordered in MUSE2 #395

Conversation

alexdewar commented Feb 17, 2025

Description

Type of change

Key checklist

Further checks

codecov bot commented Feb 17, 2025 • edited Loading

Codecov Report

TinyMarsh left a comment

Choose a reason for hiding this comment

alexdewar commented Feb 26, 2025

codecov bot commented Feb 17, 2025 •

edited

Loading