Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep (some) input data ordered in MUSE2 #395

Merged
merged 7 commits into from
Feb 26, 2025
Merged

Conversation

alexdewar
Copy link
Collaborator

Description

We make heavy use of HashMaps to store data in MUSE2, usually with the key being some kind of user-specified string ID (representing a commodity, process etc.). HashMaps are not ordered data structures, meaning if you iterate over their contents you will get a different ordering every time you run the program, even when using the same input files.

While the dispatch optimisation model doesn't care about ordering (even time slices within a year are independent of one another), a consequence of this design is that the output files will be effectively randomly shuffled every time the user runs the program with the same input data, which is annoying and potentially confusing. As it stands, if a user was to e.g. make a bar chart of commodity prices by time slice in Excel, unless they sorted the data properly beforehand they would get a different-looking plot every time they run the program. You'd have the same problem with Python too: users will end up having to write a whole bunch of logic to sort the input data just in order to plot it. Note that sorting time slices alphabetically will give a confusing order, so they'll have to have an ordered list of time slices -- and if they add/remove time slices to the model then they'll end up having to change the script. In addition, this will also make it really annoying to debug the code: every time you start the program under a debugger, your decision variables will be in a different randomised order.

So let's use an ordered map where it makes sense (i.e. if the ordering will be user-visible). There's an external crate (indexmap) which maintains the insertion order of elements. We're using it already, so I've just replaced various HashMaps with IndexMaps (as well as using a few ordered sets in the time slice code). While I was at it, I defined type aliases for various common map types we're using to make the code a bit cleaner: e.g. a CommodityMap as an alias for IndexMap<Rc<str>, Rc<Commodity>>.

Closes #327.

Type of change

  • Bug fix (non-breaking change to fix an issue)
  • New feature (non-breaking change to add functionality)
  • Refactoring (non-breaking, non-functional change to improve maintainability)
  • Optimization (non-breaking change to speed up the code)
  • Breaking change (whatever its nature)
  • Documentation (improve or add documentation)

Key checklist

  • All tests pass: $ cargo test
  • The documentation builds and looks OK: $ cargo doc

Further checks

  • Code is commented, particularly in hard-to-understand areas
  • Tests added that prove fix is effective or that feature works

@alexdewar alexdewar requested a review from TinyMarsh February 17, 2025 10:45
Copy link

codecov bot commented Feb 17, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.43%. Comparing base (0c12284) to head (4dd7e1b).
Report is 8 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #395   +/-   ##
=======================================
  Coverage   95.42%   95.43%           
=======================================
  Files          31       31           
  Lines        4004     4007    +3     
  Branches     4004     4007    +3     
=======================================
+ Hits         3821     3824    +3     
  Misses         96       96           
  Partials       87       87           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@TinyMarsh TinyMarsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think deterministic outputs making debugging easier is enough justification for this. I'm very on the fence with this in principle though.

It's possibly an interesting discussion for another time, but there's something about separation of concerns, and not baking in order to the output so as not to impose arbitrary structure. Data processing pipelines should be robust enough to handle unordered data.

Code changes look good.

@alexdewar
Copy link
Collaborator Author

I think deterministic outputs making debugging easier is enough justification for this. I'm very on the fence with this in principle though.

It's possibly an interesting discussion for another time, but there's something about separation of concerns, and not baking in order to the output so as not to impose arbitrary structure. Data processing pipelines should be robust enough to handle unordered data.

I agree in principle, but this is mostly about debugability/convenience really -- I think it will confuse users if they get different-looking data files every time they run the program, for no obvious reason. Users won't necessarily be particularly technically savvy either and their Excel-based workflows might not be that robust 😛. It will also make it annoying to write regression tests, for example, if the outputs are in a different order on every run. This probably wouldn't be worth doing if it was annoying to implement, but given it's an easy fix, I think we should just do it to make life a bit easier.

Anyhoo, thanks for the review!

@alexdewar alexdewar enabled auto-merge February 26, 2025 10:39
@alexdewar alexdewar merged commit 7733c8d into main Feb 26, 2025
7 checks passed
@alexdewar alexdewar deleted the keep-input-data-ordered branch February 26, 2025 10:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Maintain input order of various types of ID
2 participants