Retrieve and print retry_rd_err_log registers like the earlier change:
commit e80634a75a ("EDAC, skx: Retrieve and print retry_rd_err_log registers")
This is a little trickier than on Skylake because of potential
interference with BIOS use of the same registers. The default
behavior is to ignore these registers.
A module parameter retry_rd_err_log(default=0) controls the mode of operation:
- 0=off : Default.
- 1=bios : Linux doesn't reset any control bits, but just reports values.
This is "no harm" mode, but it may miss reporting some data.
- 2=linux: Linux tries to take control and resets mode bits,
clears valid/UC bits after reading. This should be
more reliable (especially if BIOS interference is reduced
by disabling eMCA reporting mode in BIOS setup).
Co-developed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210818175701.1611513-3-tony.luck@intel.com
A future Xeon processor will include in-package HBM (high bandwidth
memory). The in-package HBM memory controller shares the same
architecture with the regular DDR memory controller.
Add the HBM memory controller devices for EDAC support.
Tested-by: Hongyu Ning <hongyu.ning@linux.intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210611170123.1057025-4-tony.luck@intel.com
Current i10nm_edac driver is only for system configured in 1-level
memory. If the system is configured in 2-level memory, the driver
doesn't report the 1st level memory DIMM for the error address, even
if the error occurs in the 1st level memory.
Both Ice Lake servers and Sapphire Rapids servers can be configured
in 2-level memory. Add detection of memory levels to i10nm_edac for
the two kinds of servers so that the driver can report the 2nd level
memory DIMM or the 1st level memory DIMM according to error source.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210611170123.1057025-3-tony.luck@intel.com
Some Intel servers may configure memory in 2 levels, using
fast "near" memory (e.g. DDR) as a cache for larger, slower,
"far" memory (e.g. 3D X-point).
In these configurations the BIOS ADXL address translation for
an address in a 2-level memory range will provide details of
both the "near" and far components.
Current exported ADXL components are only for 1-level memory
system or for 2nd level memory of 2-level memory system. So
add new ADXL components for 1st level memory of 2-level memory
system to fully support 2-level memory system and the detection
of memory error source(1st level memory or 2nd level memory).
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210611170123.1057025-2-tony.luck@intel.com
The Sapphire Rapids CPU model shares the same memory controller
architecture with Ice Lake server. There are some configurations
different from Ice Lake server as below:
- The device ID for configuration agent.
- The size for per channel memory-mapped I/O.
- The DDR5 memory support.
So add the above configurations and the Sapphire Rapids CPU model
ID for EDAC support.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
The skx_edac driver wrongly uses the mtr register to retrieve two fields
close_pg and bank_xor_enable. Fix it by using the correct mcmtr register
to get the two fields.
Cc: <stable@vger.kernel.org>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reported-by: Matthew Riley <mattdr@google.com>
Acked-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20200515210146.1337-1-tony.luck@intel.com
The device ID for configuration agent PCI device and the offset for
bus number configuration register can be CPU model specific. So add
a new structure res_config to make them configurable and pass res_config
to {skx,i10nm}_init() and skx_get_all_bus_mappings() for use.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20200427083246.GB11036@zn.tnic
Skylake logs some additional useful information in per-channel
registers in addition the the architectural status/addr/misc
logged in the machine check bank.
Pick up this information and add it to the EDAC log:
retry_rd_err_[five 32-bit register values]
Sorry, no definitions for these registers. OEMs and DIMM vendors
will be able to use them to isolate which cells in the DIMM are
causing problems.
correrrcnt[per rank corrected error counts]
Note that if additional errors are logged while these registers are
being read, you may see a jumble of values some from earlier errors,
others from later errors (since the registers report the most recent
logged error). The correrrcnt registers provide error counts per possible
rank. If these counts only change by one since the previous error logged
for this channel, then it is safe to assume that the registers logged
provide a coherent view of one error.
With this change EDAC logs look like this:
EDAC MC4: 1 CE memory read error on CPU_SrcID#2_MC#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x8f26018 offset:0x0 grain:32 syndrome:0x0 - err_code:0x0101:0x0091 socket:2 imc:0 rank:0 bg:0 ba:0 row:0x1f880 col:0x200 retry_rd_err_log[0001a209 00000000 00000001 04800001 0001f880] correrrcnt[0001 0000 0000 0000 0000 0000 0000 0000])
Acked-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
The source ID register offset for Skylake server is 0xf0, while for
Icelake server is 0xf8. Pass the correct offset to get the source ID.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
The following Kconfig constellations fail randconfig builds:
CONFIG_ACPI_NFIT=y
CONFIG_EDAC_DEBUG=y
CONFIG_EDAC_SKX=m
CONFIG_EDAC_I10NM=y
or
CONFIG_ACPI_NFIT=y
CONFIG_EDAC_DEBUG=y
CONFIG_EDAC_SKX=y
CONFIG_EDAC_I10NM=m
with:
...
CC [M] drivers/edac/skx_common.o
...
.../skx_common.o:.../skx_common.c:672: undefined reference to `__this_module'
That is because if one of the two drivers - skx_edac or i10nm_edac - is
built-in and the other one is a module, the shared file skx_common.c
gets linked into a module object by kbuild. Therefore, when linking that
same file into vmlinux, the '__this_module' symbol used in debugfs isn't
defined, leading to the above error.
Fix it by moving all debugfs code from skx_common.c to both skx_base.c
and i10nm_base.c respectively. Thus, skx_common.c doesn't refer to the
'__this_module' symbol anymore.
Clarify skx_common.c's purpose at the top of the file for future
reference, while at it.
[ bp: Make text more readable. ]
Fixes: d4dc89d069 ("EDAC, i10nm: Add a driver for Intel 10nm server processors")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: https://lkml.kernel.org/r/20190321221339.GA32323@agluck-desk
Parts of skx_edac can be shared with the Intel 10nm server EDAC driver.
Carve out the common parts from skx_edac in preparation to support both
skx_edac driver and i10nm_edac drivers.
Co-developed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: James Morse <james.morse@arm.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: https://lkml.kernel.org/r/20190130191519.15393-3-tony.luck@intel.com