Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate rasdaemon for hardware error reporting #518

Open
mweinelt opened this issue Dec 6, 2024 · 3 comments
Open

Investigate rasdaemon for hardware error reporting #518

mweinelt opened this issue Dec 6, 2024 · 3 comments

Comments

@mweinelt
Copy link
Member

mweinelt commented Dec 6, 2024

Haumea is currently exhibiting correctable memory errors and it would be great to a) monitor these events b) log these events.

For logging I found hardware.rasdaemon which can listen for these kinds of events:

# rasdaemon -f
rasdaemon: Improper PAGE_CE_ACTION, set to default soft
rasdaemon: Page offline choice on Corrected Errors is soft
rasdaemon: Improper PAGE_CE_THRESHOLD, set to default 50.
rasdaemon: Improper PAGE_CE_REFRESH_CYCLE, set to default 24h.
rasdaemon: Threshold of memory Corrected Errors is 50 / 24h
rasdaemon: ras:mc_event event enabled
rasdaemon: Enabled event ras:mc_event
rasdaemon: ras:aer_event event enabled
rasdaemon: Enabled event ras:aer_event
rasdaemon: ras:non_standard_event event enabled
rasdaemon: Enabled event ras:non_standard_event
rasdaemon: ras:arm_event event enabled
rasdaemon: Enabled event ras:arm_event
rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu0/online failed
rasdaemon: Cpu fault isolation is disabled
rasdaemon: mce:mce_record event enabled
rasdaemon: Enabled event mce:mce_record
rasdaemon: ras:extlog_mem_event event enabled
rasdaemon: Enabled event ras:extlog_mem_event
rasdaemon: net:net_dev_xmit_timeout event enabled
rasdaemon: Enabled event net:net_dev_xmit_timeout
rasdaemon: devlink:devlink_health_report event enabled
rasdaemon: Enabled event devlink:devlink_health_report
rasdaemon: block:block_rq_error event enabled
rasdaemon: Enabled event block:block_rq_error
rasdaemon: ras:memory_failure_event event enabled
rasdaemon: Enabled event ras:memory_failure_event
rasdaemon: Listening to events for cpus 0 to 15
           <...>-1268491 [000] .....     0.026543 mce_record 2024-12-06 15:35:41 +0000 Unified Memory Controller (bank=17), status= 9c2040000000011b, Corrected error, no action required., mci=CECC, mca= DRAM ECC error.
 Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=0,csrow=1, cpu_type= AMD Scalable MCA, cpu= 0, socketid= 0, misc= d01b0fff01000000, addr= 319deb440, synd= b22c00100a800301, ipid= 9600050f00, mcgstatus=0, mcgcap= 11c, apicid= 0
           <...>-1276690 [000] .....     0.026608 mce_record 2024-12-06 15:46:37 +0000 Unified Memory Controller (bank=17), status= 9c2040000000011b, Corrected error, no action required., mci=CECC, mca= DRAM ECC error.
 Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=0,csrow=1, cpu_type= AMD Scalable MCA, cpu= 0, socketid= 0, misc= d01b0fff01000000, addr= 319deb440, synd= b22c00100a800301, ipid= 9600050f00, mcgstatus=0, mcgcap= 11c, apicid= 0
           <...>-1285769 [000] .....     0.026739 mce_record 2024-12-06 16:08:27 +0000 Unified Memory Controller (bank=17), status= 9c2040000000011b, Corrected error, no action required., mci=CECC, mca= DRAM ECC error.
 Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=0,csrow=1, cpu_type= AMD Scalable MCA, cpu= 0, socketid= 0, misc= d01b0fff01000000, addr= 319deb440, synd= b22c00100a800301, ipid= 9600050f00, mcgstatus=0, mcgcap= 11c, apicid= 0
@mweinelt
Copy link
Member Author

Found an exporter and created package and module upstream in nixpkgs.

NixOS/nixpkgs#366417

@mweinelt
Copy link
Member Author

https://nixpk.gs/pr-tracker.html?pr=366558 for nixos-24.11

@mweinelt
Copy link
Member Author

mweinelt commented Jan 6, 2025

Deployed to all bare metal hosts, but for some reason the exporter does not export the error count on haumea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

1 participant