We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Haumea is currently exhibiting correctable memory errors and it would be great to a) monitor these events b) log these events.
For logging I found hardware.rasdaemon which can listen for these kinds of events:
# rasdaemon -f rasdaemon: Improper PAGE_CE_ACTION, set to default soft rasdaemon: Page offline choice on Corrected Errors is soft rasdaemon: Improper PAGE_CE_THRESHOLD, set to default 50. rasdaemon: Improper PAGE_CE_REFRESH_CYCLE, set to default 24h. rasdaemon: Threshold of memory Corrected Errors is 50 / 24h rasdaemon: ras:mc_event event enabled rasdaemon: Enabled event ras:mc_event rasdaemon: ras:aer_event event enabled rasdaemon: Enabled event ras:aer_event rasdaemon: ras:non_standard_event event enabled rasdaemon: Enabled event ras:non_standard_event rasdaemon: ras:arm_event event enabled rasdaemon: Enabled event ras:arm_event rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu0/online failed rasdaemon: Cpu fault isolation is disabled rasdaemon: mce:mce_record event enabled rasdaemon: Enabled event mce:mce_record rasdaemon: ras:extlog_mem_event event enabled rasdaemon: Enabled event ras:extlog_mem_event rasdaemon: net:net_dev_xmit_timeout event enabled rasdaemon: Enabled event net:net_dev_xmit_timeout rasdaemon: devlink:devlink_health_report event enabled rasdaemon: Enabled event devlink:devlink_health_report rasdaemon: block:block_rq_error event enabled rasdaemon: Enabled event block:block_rq_error rasdaemon: ras:memory_failure_event event enabled rasdaemon: Enabled event ras:memory_failure_event rasdaemon: Listening to events for cpus 0 to 15 <...>-1268491 [000] ..... 0.026543 mce_record 2024-12-06 15:35:41 +0000 Unified Memory Controller (bank=17), status= 9c2040000000011b, Corrected error, no action required., mci=CECC, mca= DRAM ECC error. Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=0,csrow=1, cpu_type= AMD Scalable MCA, cpu= 0, socketid= 0, misc= d01b0fff01000000, addr= 319deb440, synd= b22c00100a800301, ipid= 9600050f00, mcgstatus=0, mcgcap= 11c, apicid= 0 <...>-1276690 [000] ..... 0.026608 mce_record 2024-12-06 15:46:37 +0000 Unified Memory Controller (bank=17), status= 9c2040000000011b, Corrected error, no action required., mci=CECC, mca= DRAM ECC error. Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=0,csrow=1, cpu_type= AMD Scalable MCA, cpu= 0, socketid= 0, misc= d01b0fff01000000, addr= 319deb440, synd= b22c00100a800301, ipid= 9600050f00, mcgstatus=0, mcgcap= 11c, apicid= 0 <...>-1285769 [000] ..... 0.026739 mce_record 2024-12-06 16:08:27 +0000 Unified Memory Controller (bank=17), status= 9c2040000000011b, Corrected error, no action required., mci=CECC, mca= DRAM ECC error. Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=0,csrow=1, cpu_type= AMD Scalable MCA, cpu= 0, socketid= 0, misc= d01b0fff01000000, addr= 319deb440, synd= b22c00100a800301, ipid= 9600050f00, mcgstatus=0, mcgcap= 11c, apicid= 0
The text was updated successfully, but these errors were encountered:
Found an exporter and created package and module upstream in nixpkgs.
NixOS/nixpkgs#366417
Sorry, something went wrong.
https://nixpk.gs/pr-tracker.html?pr=366558 for nixos-24.11
Deployed to all bare metal hosts, but for some reason the exporter does not export the error count on haumea.
No branches or pull requests
Haumea is currently exhibiting correctable memory errors and it would be great to a) monitor these events b) log these events.
For logging I found hardware.rasdaemon which can listen for these kinds of events:
The text was updated successfully, but these errors were encountered: