Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dynamic sensor logic for fixed and psu presence/state checking in thermalctld #401

Merged
merged 9 commits into from
Nov 28, 2023

Conversation

gregoryboudreau
Copy link
Contributor

@gregoryboudreau gregoryboudreau commented Oct 17, 2023

Add dynamic sensor logic for fixed and psu presence/state checking in thermalctld

Description

Extends the existing dynamic sensor logic of modular chassis to fixed and adds a check for PSU presence/status in thermalctld

Motivation and Context

In current form, if a PSU or other device with sensors is removed from the device or disabled during runtime on a fixed chassis, they will become stale and remain in the temperature sensor DB which should not be correct. This change brings the existing logic checks that modular chassis' have to fixed platform as well as checks for PSU presence/status to account for scenario where PSU may be disabled during runtime.

How Has This Been Tested?

On both fixed and modular platforms (Cisco 8800RP and Cisco 8101_32FH), I confirmed presence of all thermal sensors. Then disable the PSU and confirmed its associated sensors were removed from the DB while it was either inactive and present or not present. Reenabled PSU and confirmed its associated sensors were updated back into the DB.

On test_thermalctld.py added change in test_update_module_thermals to simulate scenario where PSU which had thermals added was disabled (status set to false in this case) at runtime but still present.

Additional Information (Optional)

@gregoryboudreau gregoryboudreau marked this pull request as ready for review October 23, 2023 17:52
@gregoryboudreau
Copy link
Contributor Author

This PR is being introduced as was described in #387

@bmridul
Copy link
Collaborator

bmridul commented Nov 3, 2023

@gechiang , Can u review this PR.

Copy link
Contributor

@gechiang gechiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@gregoryboudreau
Copy link
Contributor Author

gregoryboudreau commented Nov 8, 2023

This definitely needs to go in 202305 and should also probably go into 202205 as well (and the other branches moving forward though I assume that is expected). Fixes bug in both where stale thermal sensors will be left in temp db and reported on until device power cycle as well as undefined behavior on fixed platforms where dependent on the sensors, a device that is not on but present can still report sensor information which is not behavior we would expect.

@gechiang
Copy link
Contributor

gechiang commented Nov 8, 2023

@Junchao-Mellanox , please help review this PR. Thanks!

@kevinskwang
Copy link

/azp run

Copy link

Commenter does not have sufficient privileges for PR 401 in repo sonic-net/sonic-platform-daemons

@keboliu
Copy link
Collaborator

keboliu commented Nov 10, 2023

If my understanding is correct, this PR will change the output of the "show platform psustatus" in some cases(when some PSU is not present), some sonic-mgmt test cases is relying on analyzing the CLI output. I would like suggest providing a test result on those cases against this PR, to make sure no case is broken. If some test case is broken, as a full solution, new PR shall also be raised to fix those test cases according to the new logic.
I would suggest at least running the following test case to verify the change.
https://github.com/sonic-net/sonic-mgmt/blob/master/tests/platform_tests/cli/test_show_platform.py

Besides the potential impaction to the test cases, we should also update the CLI command reference: https://github.com/sonic-net/sonic-utilities/blob/master/doc/Command-Reference.md

@Junchao-Mellanox
Copy link
Collaborator

Same concern as Kebo mentioned.

@gregoryboudreau
Copy link
Contributor Author

@keboliu @Junchao-Mellanox
To my understanding, this has no impact on show platform psustatus. This change should only effect show platform temperature as the modification changes how temperatures are stored within the temperature DB. Specifically making it 1) so that if a device is removed during runtime (in a fixed platform) that the temperature sensors associated with it are not left stale within the DB, in the exact same manner as is done already within modular chassis and 2) so that if a PSU is disabled/turned off during runtime but is still present in the chassis, that it's sensors are not included in the temperature DB either.

I have confirmed that this DOES pass platform_tests/cli/test_show_platform.py::test_show_platform_temperature[mth64-m5-2] PASSED and the behavior in the test case for test_show_platform_psustatus is also unchanged from before changes are applied. Additionally, this change should not warrant/require any changes to the CLI reference as the overall functionality and display format of show platform temperature remains completely untouched. To reiterate from above, all this change does is change the behavior of when sensors are removed/included from the temperature database. All other components (especially its display and formatting) of show platform temperature remain unchanged.

@gechiang
Copy link
Contributor

@prgeor , when you have a chance, please help review this PR.
Thanks!

@gechiang
Copy link
Contributor

MSFT ADO: 25872447

@gregoryboudreau
Copy link
Contributor Author

@prgeor
Needed for other branches as it fixes bug where stale thermal sensors will be left in temp db and reported on until device power cycle on fixed platform. (extension of the dynamic sensor logic from modular platform to fixed platforms)

@prgeor
Copy link
Collaborator

prgeor commented Nov 21, 2023

@prgeor Needed for other branches as it fixes bug where stale thermal sensors will be left in temp db and reported on until device power cycle on fixed platform. (extension of the dynamic sensor logic from modular platform to fixed platforms)

@gregoryboudreau CLI/DB part I understand. Which functionality is broken? ... like what is broken that mandates this to be ported to other stable branches 202205?

@gregoryboudreau
Copy link
Contributor Author

Because of the stale sensors, if a device is removed and tests are run without a power cycle, it can lead to the failure of mgmt tests such as platform_tests/test_thermal_state_db.py. It is also incorrect functionality for these stale sensors to remain in the first place (though not sure how pertinent that is to the branch discussion). These changes were begun due to the above described test failure being reported to us by MSFT on 202205 devices.

If this is unable to get into 202205, is it still up for 202305?

@gregoryboudreau
Copy link
Contributor Author

@prgeor Pending the decision for potential inclusion in the past release branches, can this be merged into master?

@prgeor prgeor merged commit e2d9f87 into sonic-net:master Nov 28, 2023
5 checks passed
@gechiang
Copy link
Contributor

@yxieca , @StormLiangMS , Please help review/approve this for 202205 and 202305.
MSFT ADO: 25872447
Thanks!

@gechiang
Copy link
Contributor

@gregoryboudreau please raise a separate PR under 202205 to resolve the conflict as this POR cannot be cherry-picked directly...
Thanks!

@StormLiangMS
Copy link

@gregoryboudreau @gechiang could you pls have a test on top of 202305 with this PR? to avoid regression at this stage.

@gregoryboudreau
Copy link
Contributor Author

@gechiang Please find a double commit PR for 202205 for these changes here: #409

@gechiang @StormLiangMS On 202305 based systems, tests are passing, specifically reran platform_tests/cli/test_show_platform.py::test_show_platform_temperature[mth64-m5-2] which is passing. Manual testing of thermals also has not shown any issues in 202305 system. Please let me know if a separate PR is needed, thanks.

StormLiangMS pushed a commit that referenced this pull request Dec 5, 2023
… thermalctld (#401)

* add modular sensor logic even for fixed devices and presence/status checking for PSUs

* test changes

* fixing accidental removal of name

* logic correction

* isolating key error

* psu runtime change

* fixing whitespace addition

* remove powergood check from thermalctld logic
@rlhui rlhui added P0 and removed P0 labels Dec 15, 2023
@mssonicbld
Copy link
Collaborator

@gregoryboudreau this PR already included in master Branch. Please remove Request for master label.

@gregoryboudreau
Copy link
Contributor Author

gregoryboudreau commented Dec 21, 2023

This PR looks like it is already in 202311. Also the auto-reply seems to be calling the wrong branch name.

@gechiang
Copy link
Contributor

@gregoryboudreau This PR is not possible to be picked cleanly into 202311 directly. Please create a PR on top of the 202311 branch. Thanks!

@gregoryboudreau
Copy link
Contributor Author

@gechiang I might be missing something, what part of this PR isn't in 202311? https://github.com/sonic-net/sonic-platform-daemons/blob/202311/sonic-thermalctld/scripts/thermalctld#L569, it looks like its there?

@gechiang
Copy link
Contributor

@gechiang I might be missing something, what part of this PR isn't in 202311? https://github.com/sonic-net/sonic-platform-daemons/blob/202311/sonic-thermalctld/scripts/thermalctld#L569, it looks like its there?

You are absolutely correct. So it looks like this PR got merged at the border line when the 202311 got branched out which picked it up already. Nice! One less thing to worry about.
I will go ahead remove the request for 202311 as it is already there.
Thanks for responding promptly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.