Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coreboot-4.11: add fixes to KGPE-D16 raminit #1760

Merged

Conversation

tlaurion
Copy link
Collaborator

The added patches fix bugs in fam15h ram DQS timing and configure the motherboard to restart gracefully if raminit fails instead of booting into an unstable state and/or crashing.

Superseeds #1709 with signed commits to pass DCO from CI.

The added patches fix bugs in fam15h ram DQS timing and configure the motherboard to restart
gracefully if raminit fails instead of booting into an unstable state and/or crashing.

Signed-off-by: Thierry Laurion <[email protected]>
@tlaurion
Copy link
Collaborator Author

Relative post under d16 club matrix room: https://matrix.to/#/!uNiXsBMseUsZDsZgDt:dodoid.com/$VfffoFicd5HtkQS5XP1_e2L6KidgZ4i6Un6754kvcMQ?via=dodoid.com&via=matrix.org&via=envs.net

As per #692 current https://github.com/linuxboot/heads/blob/master/BOARD_TESTERS.md, tagging board owners interested into testing D16 builds, please report improvement/regressions here, I need at least one approval to merge, thank you.

TESTING NEEDED:
kgpe-d16 (AMD fam15h) (dropped in coreboot 4.12): @tlaurion @Tonux599 @zifxify @arhabd

@tlaurion tlaurion mentioned this pull request Aug 26, 2024
12 tasks
Copy link
Contributor

@arhabd arhabd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since its just a stability update and my system was stable before this update i cant comment if its a big improvement but i can confirm it boots and works just aswell as master when booting and doing oem reset

@tlaurion
Copy link
Collaborator Author

So no observed regression on at least one supported HCL. Merging.

@tlaurion tlaurion merged commit 51ade5b into linuxboot:master Aug 27, 2024
42 checks passed
Subject: [PATCH 1/2] northbridge/amd: Fixed errors in fam15h DQS timing

Fixed two errors in determining whether valid values were
found for read DQS delays in raminit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. As raminit is quite complex issue and a long known problem, as long as you have memory about your debugging and analysis, could you please elaborate. What was the problem? What type of RAM was affected? How can it be verified? What read DQS delay values were considered incorrect and are now correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. As raminit is quite complex issue and a long known problem, as long as you have memory about your debugging and analysis, could you please elaborate. What was the problem? What type of RAM was affected? How can it be verified? What read DQS delay values were considered incorrect and are now correct?

Three fixes are included in this patch that improve the reliability of raminit for all RAM.

  1. A logic error required only a single memory lane to pass DQS timing for read delays, this patch fixes it so all lanes must pass for the timing values to be added to the list of potential timing configurations.
  2. When raminit is searching for timing values, it now discards negative DQS values instead of adding the faulty parameters to the list of candidate configurations.
  3. Previously when raminit fails, coreboot continues to boot into an extremely unstable state (often resetting or freezing before coreboot can even finish booting). The patch just has the board reset and try again when raminit fails. This is a very lazy work around. After adding fixes 1&2, this patch only got triggered ~1 in 50 boots IIRC.

These patches improve the boot consistency. At times I'd have as many as 5 out of 10 boots fail under stock coreboot-4.11 (8xCT16G3ERSLD4160B, 2x6328), with these patches I had 1000+ boots without any issue. I do not know if these patches enable any new RAM configurations. It is possible that the "bad" RAM configs were more prone to incorrectly passing DQS timing with bad read delays (Fix 1). If that is the case, this could enable previously unusable RAM configs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants