Skip to content

Releases: open-power/skiboot

v5.9.7

23 Feb 03:39
v5.9.7
Compare
Choose a tag to compare

skiboot-5.9.7

skiboot 5.9.7 was released on Friday December 22nd, 2017. It replaces
skiboot-5.9.6 as the current stable release in the 5.9.x series.

Over skiboot-5.9.6, we have two bug fixes, they are:

  • phb4: Change PCI MMIO timers

    Currently we have a mismatch between the NCU and PCI timers for MMIO
    accesses. The PCI timers must be lower than the NCU timers otherwise
    it may cause checkstops.

    This changes PCI timeouts controlled by skiboot to 33-50ms. It
    should be forwards and backwards compatible with expected hostboot
    changes to the NCU timer.

  • p8-i2c: Limit number of retry attempts

    Currently we will attempt to start an I2C transaction until it
    succeeds. In the event that the OCC does not release the lock on an
    I2C bus this results in an async token being held forever and the
    kernel thread that started the transaction will block forever while
    waiting for an async completion message. Fix this by limiting the
    number of attempts to start the transaction.

v5.9.6

23 Feb 03:39
v5.9.6
Compare
Choose a tag to compare

skiboot-5.9.6

skiboot 5.9.6 was released on Friday December 15th, 2017. It replaces
skiboot-5.9.5 as the current stable release in the 5.9.x series.

Over skiboot-5.9.5, we have a few bug fixes, they are:

  • sensors: occ: Skip counter type of sensors

    Don't add counter type of sensors to device-tree as they don't fit
    into hwmon sensor interface.

  • p9_stop_api updates to support IMC across deep stop states.

  • opal/xscom: Add recovery for lost core wakeup scom failures.

    Due to a hardware issue where core responding to scom was delayed
    due to thread reconfiguration, leaves the SCOM logic in a state
    where the subsequent scom to that core can get errors. This is
    affected for Core PC scom registers in the range of
    20010A80-20010ABF

    The solution is if a xscom timeout occurs to one of Core PC scom
    registers in the range of 20010A80-20010ABF, a clearing scom write
    is done to 0x20010800 with data of '0x00000000' which will also get
    a timeout but clears the scom logic errors. After the clearing write
    is done the original scom operation can be retried.

    The scom timeout is reported as status 0x4 (Invalid address) in
    HMER[21-23].

v5.9.5

23 Feb 03:39
v5.9.5
Compare
Choose a tag to compare

skiboot-5.9.5

skiboot 5.9.5 was released on Wednesday December 13th, 2017. It replaces
skiboot-5.9.4 as the current stable release in the 5.9.x series.

Over skiboot-5.9.4, we have a few bug fixes, they are:

  • Fix extremely rare race in timer code.

  • xive: Ensure VC informational FIRs are masked

    Some HostBoot versions leave those as checkstop, they are harmless
    and can sometimes occur during normal operations.

  • xive: Fix occasional VC checkstops in xive_reset

    The current workaround for the scrub bug described in
    __xive_cache_scrub() has an issue in that it can leave dirty
    invalid entries in the cache.

    When cleaning up EQs or VPs during reset, if we then remove the
    underlying indirect page for these entries, the XIVE will checkstop
    when trying to flush them out of the cache.

    This replaces the existing workaround with a new pair of workarounds
    for VPs and EQs:

    • The VP one does the dummy watch on another entry than the one we
      scrubbed (which does the job of pushing old stores out) using an
      entry that is known to be backed by a permanent indirect page.

    • The EQ one switches to a more efficient workaround

    : which consists of doing a non-side-effect ESB load from the EQ's
    ESe control bits.

  • io: Add load_wait() helper

    This uses the standard form twi/isync pair to ensure a load is
    consumed by the core before continuing. This can be necessary under
    some circumstances for example when having the following sequence:

    • Store reg A
    • Load reg A (ensure above store pushed out)
    • delay loop
    • Store reg A

    IE, a mandatory delay between 2 stores. In theory the first store is
    only guaranteed to rach the device after the load from the same
    location has completed. However the processor will start executing
    the delay loop without waiting for the return value from the load.

    This construct enforces that the delay loop isn't executed until the
    load value has been returned.

  • xive: Do not return a trigger page for an escalation interrupt

    This is bogus, we don't support them. (Thankfully the callers didn't
    actually try to use this on escalation interrupts).

  • xive: Mark a freed IRQ's IVE as valid and masked

    Removing the valid bit means a FIR will trip if it's accessed
    inadvertently. Under some circumstances, the XIVE will speculatively
    access an IVE for a masked interrupt and trip it. So make sure that
    freed entries are still marked valid (but masked).

  • hw/nx: Fix NX BAR assignments

    The NX rng BAR is used by each core to source random numbers for the
    DARN instruction. Currently we configure each core to use the NX rng
    of the chip that it exists on. Unfortunately, the NX can be
    deconfigured by hostboot and in this case we need to use the NX of a
    different chip.

    This patch moves the BAR assignments for the NX into the normal
    nx-rng init path. This lets us check if the normal (chip local) NX
    is active when configuring which NX a core should use so that we can
    fallback gracefully.

v5.9.4

23 Feb 03:41
v5.9.4
Compare
Choose a tag to compare

skiboot-5.9.4

skiboot 5.9.4 was released on Wednesday November 29th, 2017. It replaces
skiboot-5.9.3 as the current stable release in the 5.9.x series.

Over skiboot-5.9.3, we have one NPU2/NVLink2 fix that works around a
potential glitch (the one skiboot-5.9.3 would hard crash on rather than
let a system continue to run until it mysteriously crashed later on).

That fix is in two parts:

  • npu2: hw-procedures: Change phy_rx_clock_sel values to recover
    from a potential glitch.

  • npu2: hw-procedures: Manipulate IOVALID during training

    Ensure that the IOVALID bit for this brick is raised at the start of
    link training, in the reset_ntl procedure.

    Then, to protect us from a glitch when the PHY clock turns off or
    gets chopped, lower IOVALID for the duration of the phy_reset and
    phy_rx_dccal procedures.

v5.9.3

23 Feb 03:40
v5.9.3
Compare
Choose a tag to compare

skiboot-5.9.3

skiboot 5.9.3 was released on Wednesday November 22nd, 2017. It replaces
skiboot-5.9.2 as the current stable release in the 5.9.x series.

Over skiboot-5.9.2, we have one NPU2/NVLink2 fix that causes the machine
to crash hard in the event of hardware error rather than crash
mysteriously later on whenever the NVLink2 links are used.

That fix is:

  • npu2: hw-procedures: Add check_credits procedure

    As an immediate mitigator for a current hardware glitch, add a
    procedure that can be used to validate NTL credit values. This will
    be called as a safeguard to check that link training succeeded.

    Assert that things are exactly as we expect, because if they aren't,
    the system will experience a catastrophic failure shortly after the
    start of link traffic.

v5.9.2

23 Feb 03:41
v5.9.2
Compare
Choose a tag to compare

skiboot-5.9.2

skiboot 5.9.2 was released on Thursday November 16th, 2017. It replaces
skiboot-5.9.1 as the current stable release in the 5.9.x series.

Over skiboot-5.9.1, we have a few PHB4 (PCI) fixes, an i2c fix for
POWER9 platforms to avoid conflicting with the OCC use and an important
NPU2 (NVLink2) fix.

  • phb4: Fix lane equalisation setting

    Fix cut and paste from phb3. The sizes have changes now we have
    GEN4, so the check here needs to change also

    Without this we end up with the default settings (all '7') rather
    than what's in HDAT.

  • phb4: Fix PE mapping of M32 BAR

    The M32 BAR is the PHB4 region used to map all the non-prefetchable
    or 32-bit device BARs. It's supposed to have its segments remapped
    via the MDT and Linux relies on that to assign them individual PE#.

    However, we weren't configuring that properly and instead used the
    mode where PE# == segment#, thus causing EEH to freeze the wrong
    device or PE#.

  • phb4: Fix lost bit in PE number on config accesses

    A PE number can be up to 9 bits, using a uint8_t won't fly..

    That was causing error on config accesses to freeze the wrong PE.

  • phb4: Update inits

    New init value from HW folks for the fence enable register.

    This clears bit 17 (CFG Write Error CA or UR response) and bit 22
    (MMIO Write DAT_ERR Indication) and sets bit 21 (MMIO CFG Pending
    Error)

  • npu2: Move to new GPU memory map

    There are three different ways we configure the MCD and memory map.

    1. Old way (current way) Skiboot configures the MCD and puts GPUs
      at 4TB and below
    2. New way with MCD Hostboot configures the MCD and skiboot puts
      GPU at 4TB and above
    3. New way without MCD No one configures the MCD and skiboot puts
      GPU at 4TB and below

    The change keeps option 1 and adds options 2 and 3.

    The different configurations are detected using certain scoms (see
    patch).

    Option 1 will go away eventually as it's a configuration that can
    cause xstops or data integrity problems. We are keeping it around to
    support existing hostboot.

    Option 2 supports only 4 GPUs and 512GB of memory per socket.

    Option 3 supports 6 GPUs and 4TB of memory but may have some
    performance impact.

  • p8-i2c: Don't write the watermark register at init

    On P9 the I2C master is shared with the OCC. Currently the watermark
    values are set once at init time which is bad for two reasons:

    a) We don't take the OCC master lock before setting it. Which may
    cause issues if the OCC is currently using the master.
    b) The OCC might change the watermark levels and we need to reset
    them.

    Change this so that we set the watermark value when a new
    transaction is started rather than at init time.

v5.9.1

23 Feb 03:41
v5.9.1
Compare
Choose a tag to compare

skiboot-5.9.1

skiboot 5.9.1 was released on Tuesday November 14th, 2017. It replaces
skiboot-5.9 as the current stable release in the 5.9.x series.

Over skiboot-5.9, we have two NPU2 (NVLink2) fixes and two XIVE bug
fixes:

  • npu2: hw-procedures: Refactor reset_ntl procedure

    Change the implementation of reset_ntl to match the latest
    programming guide documentation.

  • npu2: hw-procedures: Add phy_rx_clock_sel()

    Change the RX clk mux control to be done by software instead of HW.
    This avoids glitches caused by changing the mux setting.

  • xive: Fix ability to clear some EQ flags

    We could never clear "unconditional notify" and "escalate"

  • xive: Update inits for DD2.0

    This updates some inits based on information from the HW designers.
    This includes enabling some new DD2.0 features that we don't yet
    exploit.

v5.9

23 Feb 03:42
v5.9
Compare
Choose a tag to compare

skiboot-5.9

skiboot v5.9 was released on Tuesday October 31st 2017. It is the first
release of skiboot 5.9 and becomes the new stable release of skiboot
following the 5.8 release, first released August 31st 2017. In this cyle
we have had five release candidate releases, mostly centered around bug
fixing for POWER9 platforms.

This release should be considered suitable for early-access POWER9
systems.

skiboot v5.9 contains all bug fixes as of skiboot-5.4.8 and
skiboot-5.1.21 (the currently maintained stable releases). There may be
some 5.9.x stable releases, depending on what issues are found.

For how the skiboot stable releases work, see stable-rules for details.

Over skiboot-5.8, we have the following changes:

New Features

POWER8

  • fast-reset by default (if possible)

    Currently, this is limited to POWER8 systems.

    A normal reboot will, rather than doing a full IPL, go through a
    fast reboot procedure. This reduces the "reboot to petitboot" time
    from minutes to a handful of seconds.

POWER9

Since skiboot-5.9-rc3:

  • occ-sensors : Add OCC inband sensor region to exports (useful for
    debugging)

Two SRESET fixes (see below for feature description):

  • core: direct-controls: Fix clearing of special wakeup

    'special_wakeup_count' is incremented on successfully asserting
    special wakeup. So we will never clear the special wakeup if we
    check 'special_wakeup_count' to be zero. Fix this issue by
    checking the 'special_wakeup_count' to 1 in
    dctl_clear_special_wakeup().

  • core/direct-controls: increase special wakeup timeout on POWER9

    Some instances have been observed where the special wakeup assert
    times out. The current timeout is too short for deeper sleep states.
    Hostboot uses 100ms, so match that.

Since skiboot-5.9-rc2: - cpu: Add
OPAL_REINIT_CPUS_TM_SUSPEND_DISABLED

Add a new CPU reinit flag, "TM Suspend Disabled", which requests that
CPUs be configured so that TM (Transactional Memory) suspend mode is
disabled.

Currently this always fails, because skiboot has no way to query the
state. A future hostboot change will add a mechanism for skiboot to
determine the status and return an appropriate error code.

Since skiboot-5.8:

  • POWER9 power management during boot

    Less power should be consumed during boot.

  • OPAL_SIGNAL_SYSTEM_RESET for POWER9

    This implements OPAL_SIGNAL_SYSTEM_RESET, using scom registers to
    quiesce the target thread and raise a system reset exception on it.
    It has been tested on DD2 with stop0 ESL=0 and ESL=1 shallow power
    saving modes.

    DD1 is not implemented because it is sufficiently different as to
    make support difficult.

  • Enable deep idle states for POWER9

    • SLW: Add support for p9_stop_api

      p9_stop_api's are used to set SPR state on a core wakeup form
      a deeper low power state. p9_stop_api uses low level platform
      formware and self-restore microcode to restore the sprs to
      requested values.

      Code is taken from :
      https://github.com/open-power/hostboot/tree/master/src/import/chips/p9/procedures/utils/stopreg

    • SLW: Removing timebase related flags for stop4

      When a core enters stop4, it does not loose decrementer and time
      base. Hence removing flags OPAL_PM_DEC_STOP and
      OPAL_PM_TIMEBASE_STOP.

    • SLW: Allow deep states if homer address is known

      Use a common variable has_wakeup_engine instead of has_slw to
      tell if the:

      • SLW image is populated in case of power8
      • CME image is populated in case of power9

      Currently we expect CME to be loaded if homer address is known (
      except for simulators)

    • SLW: Configure self-restore for HRMOR

      Make a stop api call using libpore to restore HRMOR register.
      HRMOR needs to be cleared so that when thread exits stop, they
      arrives at linux system_reset vector (0x100).

    • SLW: Add opal_slw_set_reg support for power9

    This OPAL call is made from Linux to OPAL to configure values in
    various SPRs after wakeup from a deep idle state.

  • PHB4: CAPP recovery

    CAPP recovery is initiated when a CAPP Machine Check is detected.
    The capp recovery procedure is initiated via a Hypervisor
    Maintenance interrupt (HMI).

    CAPP Machine Check may arise from either an error that results in a
    PHB freeze or from an internal CAPP error with CAPP checkstop FIR
    action. An error that causes a PHB freeze will result in the link
    down signal being asserted. The system continues running and the
    CAPP and PSL will be re-initialized.

    This implements CAPP recovery for POWER9 systems

  • Add wafer-location property for POWER9

    Extract wafer-location from ECID and add property under xscom node.

    • bits 64:71 are the chip x location (7:0)
    • bits 72:79 are the chip y location (7:0)

    Sample output: :

    [root@wsp xscom@623fc00000000]# lsprop ecid
    ecid             019a00d4 03100718 852c0000 00fd7911
    [root@wsp xscom@623fc00000000]# lsprop wafer-location
    wafer-location   00000085 0000002c
    
  • Add wafer-id property for POWER9

    Wafer id is derived from ECID data.

    • bits 4:63 are the wafer id ( ten 6 bit fields each containing a
      code)

    Sample output: :

    [root@wsp xscom@623fc00000000]# lsprop ecid
    ecid             019a00d4 03100718 852c0000 00fd7911
    [root@wsp xscom@623fc00000000]# lsprop wafer-id
    wafer-id         "6Q0DG340SO"
    
  • Add ecid property under xscom node for POWER9. Sample output: :

    [root@wsp xscom@623fc00000000]# lsprop ecid
    ecid             019a00d4 03100718 852c0000 00fd7911
    
  • Add ibm,firmware-versions device tree node

    In P8, hostboot provides mini device tree. It contains
    /ibm,firmware-versions node which has various firmware component
    version details.

    In P9, OPAL is building device tree. This patch adds support to
    parse VERSION section of PNOR and create /ibm,firmware-versions
    device tree node.

    Sample output: :

    /sys/firmware/devicetree/base/ibm,firmware-versions # lsprop .
    occ              "6a00709"
    skiboot          "v5.7-rc1-p344fb62"
    buildroot        "2017.02.2-7-g23118ce"
    capp-ucode       "9c73e9f"
    petitboot        "v1.4.3-p98b6d83"
    sbe              "02021c6"
    open-power       "witherspoon-v1.17-128-gf1b53c7-dirty"
    ....
    ....
    

POWER9

Since skiboot-5.9-rc5:

  • Suppress XSCOM chiplet-offline errors on P9

    Workaround on P9: PRD does operations it knows will fail with this
    error to work around a hardware issue where accesses via the PIB
    (FSI or OCC) work as expected, accesses via the ADU (what xscom goes
    through) do not. The chip logic will always return all FFs if there
    is any error on the scom.

  • asm/head: initialize preferred DSCR value

    POWER7/8 use DSCR=0. POWER9 preferred value has "stride-N" enabled.

Since skiboot-5.9-rc4: - opal/hmi: Workaround Power9 hw logic bug for
couple of TFMR TB errors. - opal/hmi: Fix TB reside and HDEC parity
error recovery for power9

Since skiboot-5.9-rc2: - hw/imc: Fix IMC Catalog load for DD2.X
processors

Since skiboot-5.9-rc1: - xive: Fix VP free block group mode
false-positive parameter check

The check to ensure the buddy allocation idx is aligned to its
allocation order was not taking into account the allocation split.
This would result in opal_xive_free_vp_block failures despite
giving the same value as returned by opal_xive_alloc_vp_block.

E.g., starting then stopping 4 KVM guests gives the following pattern
in the host: :

opal_xive_alloc_vp_block(5)=0x45000020
opal_xive_alloc_vp_block(5)=0x45000040
opal_xive_alloc_vp_block(5)=0x45000060
opal_xive_alloc_vp_block(5)=0x45000080
opal_xive_free_vp_block(0x45000020)=-1
opal_xive_free_vp_block(0x45000040)=0
opal_xive_free_vp_block(0x45000060)=-1
opal_xive_free_vp_block(0x45000080)=0
  • hw/imc: pause microcode at boot

    IMC nest counters has both in-band (ucode access) and out of band
    access to it. Since not all nest counter configurations are
    supported by ucode, out of band tools are used to characterize other
    configuration.

    So it is prefer to pause the nest microcode at boot to aid the nest
    out of band tools. If the ucode not paused and OS does not have IMC
    driver support, then out to band tools will race with ucode and end
    up getting undesirable values. Patch to check and pause the ucode at
    boot.

    OPAL provides APIs to control IMC counters.
    OPAL_IMC_COUNTERS_INIT is used to initialize these counters at
    boot. OPAL_IMC_COUNTERS_START and OPAL_IMC_COUNTERS_STOP API
    calls should be used to start and pause these IMC engines.
    doc/opal-api/opal-imc-counters.rst details the OPAL APIs and their
    usage.

  • hdata/i2c: update the list of known i2c devs

    This updates the list of known i2c devices - as of HDAT spec
    v10.5e - so that they can be properly identified during the hdat
    parsing.

  • hdata/i2c: log unknown i2c devices

    An i2c device is unknown if either the i2c device list is outdated
    or the device is marked as unknown (0xFF) in the hdat.

Since skiboot-5.8:

  • Disable Transactional Memory on Power9 DD 2.1

    Update pa_features_p9[] to disable TM (Transactional Memory). On
    DD 2.1 TM is not usable by Linux without other workarounds, so
    skiboot must disable it.

  • xscom: Do not print error me...

Read more

v5.9-rc5

23 Feb 03:42
v5.9-rc5
Compare
Choose a tag to compare
v5.9-rc5 Pre-release
Pre-release

skiboot-5.9-rc5

skiboot v5.9-rc5 was released on Monday October 23rd 2017 approximately
32,000ft above somewhere north of Tucson, Arizona. It is the fifth
release candidate of skiboot 5.9, which will become the new stable
release of skiboot following the 5.8 release, first released August 31st
2017.

skiboot v5.9-rc5 contains all bug fixes as of skiboot-5.4.8 and
skiboot-5.1.21 (the currently maintained stable releases). We do not
currently expect to do any 5.8.x stable releases.

For how the skiboot stable releases work, see stable-rules for details.

The current plan is to cut the final 5.9 very shortly, with skiboot 5.9
being for all POWER8 and POWER9 platforms in op-build v1.20 (Due October
18th, so we're running a bit behind there). This release will be
targetted to early POWER9 systems.

Over skiboot-5.9-rc3, we have the following changes:

  • opal/hmi: Workaround Power9 hw logic bug for couple of TFMR TB
    errors.

  • opal/hmi: Fix TB reside and HDEC parity error recovery for power9

  • phb4: Escalate freeze to fence to avoid checkstop

    Freeze events such as MMIO loads can cause the PHB to lose it's
    limited powerbus credits. If all credits are used and a further MMIO
    will cause a checkstop.

    To work around this, we escalate the troublesome freeze events to a
    fence. The fence will cause a full PHB reset which resets the
    powerbus credits and avoids the checkstop.

  • phb4: Update some init registers

    New inits based on next PHB4 workbook. Increases some timeouts to
    avoid some spurious error conditions.

  • phb4: Enable PHB MMIO in phb4_root_port_init()

    Linux EEH flow is somewhat broken. It saves the PCIe config space of
    the PHB on boot, which it then uses to restore on EEH recovery. It
    does this to restore MMIO bars and some other pieces.

    Unfortunately this save is done before any drivers are bound to
    devices under the PHB. A number of other things are configured in
    the PHB after drivers start, hence some configuration space settings
    aren't saved correctly. These include bus master and MMIO bits in
    the command register.

    Linux tried to hack around this in this linux commit bf898ec5cb
    powerpc/eeh: Enable PCI_COMMAND_MASTER for PCI bridges This sets
    the bus master bit but ignores the MMIO bit.

    Hence we lose MMIO after a full PHB reset. This causes the next MMIO
    access to the device to fail and for us to perform a PE freeze
    recovery, which still doesn't set the MMIO bit and hence we still
    fail.

    This works around this by forcing MMIO on during
    phb4_root_port_init().

    With this we can recovery from a PHB fence event on POWER9.

  • phb4: Reduce link degraded message log level to debug

    If we hit this message we'll retry and fix the problem. If we run
    out of retries and can't fix the problem, we'll still print a log
    message at error level indicating a problem.

  • phb4: Fix GEN3 for DD2.00

    In this fix:

    : 62ac763 phb4: Fix PCIe GEN4 on DD2.1 and above

    We fixed DD2.1 GEN4 but broke DD2.00 as GEN3.

    This fixes DD2.00 back to GEN3. This time for sure!

v5.9-rc4

23 Feb 03:42
v5.9-rc4
Compare
Choose a tag to compare
v5.9-rc4 Pre-release
Pre-release

skiboot-5.9-rc4

skiboot v5.9-rc4 was released on Thursday October 19th 2017. It is the
fourth release candidate of skiboot 5.9, which will become the new
stable release of skiboot following the 5.8 release, first released
August 31st 2017.

skiboot v5.9-rc4 contains all bug fixes as of skiboot-5.4.8 and
skiboot-5.1.21 (the currently maintained stable releases). We do not
currently expect to do any 5.8.x stable releases.

For how the skiboot stable releases work, see stable-rules for details.

The current plan is to cut the final 5.9 by October 20th, with skiboot
5.9 being for all POWER8 and POWER9 platforms in op-build v1.20 (Due
October 18th, so we're running a bit behind there). This release will be
targetted to early POWER9 systems.

Over skiboot-5.9-rc3, we have the following changes:

  • phb4: Fix PCIe GEN4 on DD2.1 and above

    In this change:

    : eef0e19 PHB4: Default to PCIe GEN3 on POWER9 DD2.00

    We clamped DD2.00 parts to GEN3 but unfortunately this change also
    applies to DD2.1 and above.

    This fixes this to only apply to DD2.00.

  • occ-sensors : Add OCC inband sensor region to exports (useful for
    debugging)

Two SRESET fixes:

  • core: direct-controls: Fix clearing of special wakeup

    'special_wakeup_count' is incremented on successfully asserting
    special wakeup. So we will never clear the special wakeup if we
    check 'special_wakeup_count' to be zero. Fix this issue by
    checking the 'special_wakeup_count' to 1 in
    dctl_clear_special_wakeup().

  • core/direct-controls: increase special wakeup timeout on POWER9

    Some instances have been observed where the special wakeup assert
    times out. The current timeout is too short for deeper sleep states.
    Hostboot uses 100ms, so match that.