Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow objects (sensors, loops, zones) to be designated as non-essential and quietly skipped over if missing #31

Open
Krellan opened this issue Jan 21, 2023 · 3 comments

Comments

@Krellan
Copy link
Member

Krellan commented Jan 21, 2023

Here's a feature request I am seeing come in from the field.

We need to allow objects to be designated as non-essential.

This will allow them to be quietly skipped over during parsing of the configuration. They are designated as non-essential, so if they are missing, that is acceptable. They will be quietly skipped over, as if they never existed.

I propose adding this flag to input sensors, PID loops, and thermal zones.

What should the syntax be? It has to be something that will work both for the old-style (directly consuming the config.json file) and the new-style (Inventory objects on D-Bus created by entity-manager from JSON files) configuration paths.

This means that if we decorate the elements of the sensors array (the sensor objects) in the old-style JSON, we also need to think of a different way to do this for the new-style configuration path, because we can't decorate sensor objects in the new-style (because they are controlled by entity-manager, not us). However, we could augment the PID loop that uses them, like what was done for TempToMargin to flag certain sensors as being eligible for temperature-to-margin conversion.

PID loops and thermal zones need to also have a flag in this way. There was one attempt to add a catch-all flags parameter, but this patch ran into problems during review, because it's difficult/bureaucratic to make any changes to an established interface. Should we bring this feature back? https://gerrit.openbmc.org/c/openbmc/phosphor-pid-control/+/41802

Interestingly, the thermal zone parameters were successfully augmented later, to add a few more optional parameters to control timing, and interestingly, there was no corresponding upstream objection then: https://gerrit.openbmc.org/c/openbmc/phosphor-pid-control/+/57660

Anyway, once we have agreement on the mechanism, and the flags are added for sensors, loops, and zones, here's the behavioral changes that will result:

  • If sensor is flagged as non-essential, and it isn't found during loading, it will be silently skipped over, as if it never existed in the configuration.

  • If sensor is flagged as non-essential, and it fails during reading, or returns a NaN value during reading, this error will be ignored, and the sensor's contribution to PID loop input will be ignored, as if the sensor never existed.

  • If a PID loop has no inputs at all, because all of the inputs were non-essential and they weren't there, the zone will still be thrown into failsafe mode, unless the PID loop itself is also designated as non-essential.

  • If a PID loop is designated as non-essential, if it is unable to produce an output value for any reason (all input sensors are missing or failed), then this PID loop will be ignored, as if it never existed in the configuration.

  • If a zone is flagged as non-essential, it will not result in any errors if it fails. No fan output values will be written, if there was a problem with the zone. As with other objects, it will simply be ignored, as if it never existed.

How's this? Anything missing here?

@Krellan
Copy link
Member Author

Krellan commented Jan 25, 2023

Some thoughts.

There might already be an existing Availability interface we can use. However, this is only for D-Bus sensor objects. It does not extend to PID loop objects, nor to thermal zone objects.

Also, what should the name of this feature be? Unimportant? Nonessential? MissingIsOK? MissingIsAcceptable?

@Krellan
Copy link
Member Author

Krellan commented Feb 8, 2023

I like using the name MissingIsAcceptable.

Also, this is a trait from PID control's point of view, not a trait of the sensor itself, so it should not be in the Sensor interface. For this and other reasons, this rules out using the existing Availability flag.

How about this:

In a PID loop (Entity Manager type Pid), how about this:

{
  "Name": "NecessaryTempPidLoop",
  "Type": "Pid",
  ...
  "Inputs": [ "FlakyTemp1", "DependableTemp1", "FlakyTemp2", "DependableTemp2" ],
  ...
  "MissingIsAcceptable": [ "FlakyTemp1", "FlakyTemp2" ],
  ...
}

In this example, there are 4 temperature inputs. Of these, 2 are flaky, and 2 are dependable. The flaky sensors are listed in MissingIsAcceptable so if something happens to them (disappeared, reading unparseable, reading is a floating-point irrational such as NaN, any other error), they will be simply ignored. They will not contribute to the PID loop calculations, as if they never existed.

If FlakyTemp11 or FlakyTemp2 are NaN or missing, nothing happens. If DependableTemp1 or DependableTemp2 are NaN or missing, however, the entire PID loop output will be set to NaN, to correctly propagate the error upward.

This concept extends to thermal zones as well:

{
  "Type": "Pid.Zone",
  "Name": "MyCoolZone",
  "Inputs": [ "NecessaryTempPidLoop", "OptionalTempPidLoop", "MyFanPidLoop" ],
  ...
  "MissingIsAcceptable": [ "OptionalTempPidLoop" ],
  ...
}

In this example, NecessaryTempPidLoop will cause the entire zone to be thrown into failsafe if the output of this PID loop is NaN, as expected. However, because OptionalTempPidLoop is listed in the MissingIsAcceptable array, it is treated as optional, so if it is NaN or any other error happens with it, that's no big deal, and the thermal zone will just behave as if that PID loop never existed.

The use case is to allow some optional sensors and PID loops to help react faster to thermal spikes, and keep the system cool, without having to run the main fans as often. However, the mandatory sensors and PID loops will be there as a backup. Even though they are slower to react, they are enough to keep the system cool, albeit at the cost of having to run the fans more often than is strictly necessary. So, this system is not in any danger of overheating if we omit the optional stuff, but it really is nice to able to have the optional sensors working, as it saves power and runs quieter.

bradbishop pushed a commit that referenced this issue Oct 25, 2023
This is a partial implementation of the ideas here:
#31

A new configuration item is supported in the PID object, named
"MissingIsAcceptable" (for D-Bus) or "missingIsAcceptable" (for the old
config.json). The value is an array of strings. If these strings match
sensor names, those sensors will be flagged as "missing is acceptable",
that is, they can go missing and the zone will not be thrown into
failsafe mode as a result.

This can be handy for sensors that are not always available on your
particular machine. It is independent of the existing Availability
interface, because the decision to go into failsafe mode or not is a
property of the PID loop, not of the sensor itself.

If a PID loop consists of all sensors that are missing, the output
will be deemed to be the setpoint, thus essentially making the PID
loop a no-op. Now initializing sensor values to NaN, not zero, as zero
is not a good default if PID loop is margin, undoing a bug I made:
https://gerrit.openbmc.org/c/openbmc/phosphor-pid-control/+/38228

Tested: It worked for me. Also, added a unit test case.

Change-Id: Idc7978ab06fcc9ed8c6c9df9483101376e5df4d1
Signed-off-by: Josh Lehan <[email protected]>
@Krellan
Copy link
Member Author

Krellan commented Oct 30, 2023

The MissingIsAcceptable feature has been merged. This should provide most, but not all, of what we want. There are still a few edge cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant