Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coreboot-4.11: add fixes to KGPE-D16 raminit #1760

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
From f6c818898b3f978bd22ed2829a881322e0eadaf9 Mon Sep 17 00:00:00 2001
From: Mike Rothfuss <[email protected]>
Date: Fri, 23 Aug 2024 19:54:54 -0600
Subject: [PATCH 1/2] northbridge/amd: Fixed errors in fam15h DQS timing

Fixed two errors in determining whether valid values were
found for read DQS delays in raminit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. As raminit is quite complex issue and a long known problem, as long as you have memory about your debugging and analysis, could you please elaborate. What was the problem? What type of RAM was affected? How can it be verified? What read DQS delay values were considered incorrect and are now correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. As raminit is quite complex issue and a long known problem, as long as you have memory about your debugging and analysis, could you please elaborate. What was the problem? What type of RAM was affected? How can it be verified? What read DQS delay values were considered incorrect and are now correct?

Three fixes are included in this patch that improve the reliability of raminit for all RAM.

  1. A logic error required only a single memory lane to pass DQS timing for read delays, this patch fixes it so all lanes must pass for the timing values to be added to the list of potential timing configurations.
  2. When raminit is searching for timing values, it now discards negative DQS values instead of adding the faulty parameters to the list of candidate configurations.
  3. Previously when raminit fails, coreboot continues to boot into an extremely unstable state (often resetting or freezing before coreboot can even finish booting). The patch just has the board reset and try again when raminit fails. This is a very lazy work around. After adding fixes 1&2, this patch only got triggered ~1 in 50 boots IIRC.

These patches improve the boot consistency. At times I'd have as many as 5 out of 10 boots fail under stock coreboot-4.11 (8xCT16G3ERSLD4160B, 2x6328), with these patches I had 1000+ boots without any issue. I do not know if these patches enable any new RAM configurations. It is possible that the "bad" RAM configs were more prone to incorrectly passing DQS timing with bad read delays (Fix 1). If that is the case, this could enable previously unusable RAM configs.

---
src/northbridge/amd/amdmct/mct_ddr3/mctdqs_d.c | 17 ++++++-----------
1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/src/northbridge/amd/amdmct/mct_ddr3/mctdqs_d.c b/src/northbridge/amd/amdmct/mct_ddr3/mctdqs_d.c
index d34b2dc2ba..6cf67afa4f 100644
--- a/src/northbridge/amd/amdmct/mct_ddr3/mctdqs_d.c
+++ b/src/northbridge/amd/amdmct/mct_ddr3/mctdqs_d.c
@@ -21,6 +21,7 @@
#include <arch/cpu.h>
#include <cpu/amd/msr.h>
#include <cpu/amd/mtrr.h>
+#include <southbridge/amd/common/reset.h>
#include "mct_d.h"
#include "mct_d_gcc.h"

@@ -1287,6 +1288,7 @@ static uint8_t TrainDQSRdWrPos_D_Fam15(struct MCTStatStruc *pMCTstat,
uint8_t cur_count = 0;
uint8_t best_pos = 0;
uint8_t best_count = 0;
+ uint16_t region_center;

uint32_t index_reg = 0x98;
uint32_t dev = pDCTstat->dev_dct;
@@ -1455,23 +1457,16 @@ static uint8_t TrainDQSRdWrPos_D_Fam15(struct MCTStatStruc *pMCTstat,
last_pos = 0;
}

- if (best_count > 2) {
- uint16_t region_center = (best_pos + (best_count / 2));
-
- if (region_center < 16) {
- printk(BIOS_WARNING, "TrainDQSRdWrPos: negative DQS recovery delay detected!"
- " Attempting to continue but your system may be unstable...\n");
- region_center = 0;
- } else {
- region_center -= 16;
- }
+ region_center = (best_pos + (best_count / 2));
+ if ((best_count > 2) && (region_center >= 16)) {
+ region_center -= 16;

/* Restore current settings of other (previously trained) lanes to the active array */
memcpy(current_read_dqs_delay, initial_read_dqs_delay, sizeof(current_read_dqs_delay));

/* Program the Read DQS Timing Control register with the center of the passing window */
current_read_dqs_delay[lane] = region_center;
- passing_dqs_delay_found[lane] = 1;
+ passing_read_dqs_delay_found = 1;

/* Commit the current Read DQS Timing Control settings to the hardware registers */
write_dqs_read_data_timing_registers(current_read_dqs_delay, dev, dct, dimm, index_reg);
--
2.39.2

Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
From ce1c7a35fa11b46d0478e97c4a4001179ab9d1bf Mon Sep 17 00:00:00 2001
From: Mike Rothfuss <[email protected]>
Date: Fri, 23 Aug 2024 19:59:09 -0600
Subject: [PATCH 2/2] northbridge/amd: Added resets for ram training failures

Instead of booting into an unstable state (and crashing), the board
resets to re-attempt raminit.
---
src/northbridge/amd/amdmct/mct_ddr3/mcthwl.c | 7 +++++--
src/northbridge/amd/amdmct/mct_ddr3/mctsrc.c | 7 +++++--
2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/src/northbridge/amd/amdmct/mct_ddr3/mcthwl.c b/src/northbridge/amd/amdmct/mct_ddr3/mcthwl.c
index 1ee10608b9..9a53bd352d 100644
--- a/src/northbridge/amd/amdmct/mct_ddr3/mcthwl.c
+++ b/src/northbridge/amd/amdmct/mct_ddr3/mcthwl.c
@@ -18,6 +18,7 @@
#include <stdint.h>
#include <console/console.h>
#include <string.h>
+#include <southbridge/amd/common/reset.h>
#include "mct_d.h"
#include "mct_d_gcc.h"

@@ -265,11 +266,13 @@ static void WriteLevelization_HW(struct MCTStatStruc *pMCTstat,

pDCTstat->TargetFreq = final_target_freq;

- if (global_phy_training_status)
+ if (global_phy_training_status) {
printk(BIOS_WARNING,
"%s: Uncorrectable invalid value(s) detected in second phase of write levelling; "
- "continuing but system may be unstable!\n",
+ "Restarting system\n",
__func__);
+ soft_reset();
+ }

uint8_t dct;
for (dct = 0; dct < 2; dct++) {
diff --git a/src/northbridge/amd/amdmct/mct_ddr3/mctsrc.c b/src/northbridge/amd/amdmct/mct_ddr3/mctsrc.c
index dbb989fe3d..c4cb53442d 100644
--- a/src/northbridge/amd/amdmct/mct_ddr3/mctsrc.c
+++ b/src/northbridge/amd/amdmct/mct_ddr3/mctsrc.c
@@ -26,6 +26,7 @@
#include <string.h>
#include <cpu/x86/msr.h>
#include <cpu/amd/msr.h>
+#include <southbridge/amd/common/reset.h>
#include "mct_d.h"
#include "mct_d_gcc.h"

@@ -1698,8 +1699,10 @@ void dqsTrainMaxRdLatency_SW_Fam15(struct MCTStatStruc *pMCTstat,
Set_NB32_index_wait_DCT(dev, Channel, index_reg, 0x00000050, 0x13131313);
}
dword = Get_NB32_DCT(dev, Channel, 0x268) & 0x3ffff;
- if (dword)
- printk(BIOS_ERR, "WARNING: MaxRdLatency training FAILED! Attempting to continue but your system may be unstable...\n");
+ if (dword) {
+ printk(BIOS_ERR, "WARNING: MaxRdLatency training FAILED! Restarting system\n");
+ soft_reset();
+ }

/* 2.10.5.8.5.1.5 */
nb_pstate = 0;
--
2.39.2