A "DPA invalidation event" is any scenario where the contents of a DPA
(Device Physical Address) is modified in a way that is incoherent with
CPU caches, or if the HPA (Host Physical Address) to DPA association
changes due to a remapping event.
PMEM security events like Unlock and Passphrase Secure Erase already
manage caches through LIBNVDIMM, so that leaves HPA to DPA remap events
that need cache management by the CXL core. Those only happen when the
boot time CXL configuration has changed. That event occurs when
userspace attaches an endpoint decoder to a region configuration, and
that region is subsequently activated.
The implications of not invalidating caches between remap events is that
reads from the region at different points in time may return different
results due to stale cached data from the previous HPA to DPA mapping.
Without a guarantee that the region contents after cxl_region_probe()
are written before being read (a layering-violation assumption that
cxl_region_probe() can not make) the CXL subsystem needs to ensure that
reads that precede writes see consistent results.
A CONFIG_CXL_REGION_INVALIDATION_TEST option is added to support debug
and unit testing of the CXL implementation in QEMU or other environments
where cpu_cache_has_invalidate_memregion() returns false. This may prove
too restrictive for QEMU where the HDM decoders are emulated, but in
that case the CXL subsystem needs some new mechanism / indication that
the HDM decoder is emulated and not a passthrough of real hardware.
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Link: https://lore.kernel.org/r/166993222098.1995348.16604163596374520890.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
After a region's interleave parameters (ways and granularity) are set,
add a way for regions to allocate HPA (host physical address space) from
the free capacity in their parent root-decoder. The allocator for this
capacity reuses the 'struct resource' based allocator used for
CONFIG_DEVICE_PRIVATE.
Once the tuple of "ways, granularity, [uuid], and size" is set the
region configuration transitions to the CXL_CONFIG_INTERLEAVE_ACTIVE
state which is a precursor to allowing endpoint decoders to be added to
a region.
Co-developed-by: Ben Widawsky <bwidawsk@kernel.org>
Signed-off-by: Ben Widawsky <bwidawsk@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/165784335630.1758207.420216490941955417.stgit@dwillia2-xfh.jf.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
CXL 2.0 allows for dynamic provisioning of new memory regions (system
physical address resources like "System RAM" and "Persistent Memory").
Whereas DDR and PMEM resources are conveyed statically at boot, CXL
allows for assembling and instantiating new regions from the available
capacity of CXL memory expanders in the system.
Sysfs with an "echo $region_name > $create_region_attribute" interface
is chosen as the mechanism to initiate the provisioning process. This
was chosen over ioctl() and netlink() to keep the configuration
interface entirely in a pseudo-fs interface, and it was chosen over
configfs since, aside from this one creation event, the interface is
read-mostly. I.e. configfs supports cases where an object is designed to
be provisioned each boot, like an iSCSI storage target, and CXL region
creation is mostly for PMEM regions which are created usually once
per-lifetime of a server instance. This is an improvement over nvdimm
that pre-created "seed" devices that tended to confuse users looking to
determine which devices are active and which are idle.
Recall that the major change that CXL brings over previous persistent
memory architectures is the ability to dynamically define new regions.
Compare that to drivers like 'nfit' where the region configuration is
statically defined by platform firmware.
Regions are created as a child of a root decoder that encompasses an
address space with constraints. When created through sysfs, the root
decoder is explicit. When created from an LSA's region structure a root
decoder will possibly need to be inferred by the driver.
Upon region creation through sysfs, a vacant region is created with a
unique name. Regions have a number of attributes that must be configured
before the region can be bound to the driver where HDM decoder program
is completed.
An example of creating a new region:
- Allocate a new region name:
region=$(cat /sys/bus/cxl/devices/decoder0.0/create_pmem_region)
- Create a new region by name:
while
region=$(cat /sys/bus/cxl/devices/decoder0.0/create_pmem_region)
! echo $region > /sys/bus/cxl/devices/decoder0.0/create_pmem_region
do true; done
- Region now exists in sysfs:
stat -t /sys/bus/cxl/devices/decoder0.0/$region
- Delete the region, and name:
echo $region > /sys/bus/cxl/devices/decoder0.0/delete_region
Signed-off-by: Ben Widawsky <bwidawsk@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/165784333909.1758207.794374602146306032.stgit@dwillia2-xfh.jf.intel.com
[djbw: simplify locking, reword changelog]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
DOE mailbox objects will be needed for various mailbox communications
with each memory device.
Iterate each DOE mailbox capability and create PCI DOE mailbox objects
as found.
It is not anticipated that this is the final resting place for the
iteration of the DOE devices. The support of switch ports will drive
this code into the PCIe side. In this imagined architecture the CXL
port driver would then query into the PCI device for the DOE mailbox
array.
For now creating the mailboxes in the CXL port is good enough for the
endpoints. Later PCIe ports will need to support this to support switch
ports more generically.
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Lukas Wunner <lukas@wunner.de>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20220719205249.566684-5-ira.weiny@intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The CXL specification claims S3 support at a hardware level, but at a
system software level there are some missing pieces. Section 9.4 (CXL
2.0) rightly claims that "CXL mem adapters may need aux power to retain
memory context across S3", but there is no enumeration mechanism for the
OS to determine if a given adapter has that support. Moreover the save
state and resume image for the system may inadvertantly end up in a CXL
device that needs to be restored before the save state is recoverable.
I.e. a circular dependency that is not resolvable without a third party
save-area.
Arrange for the cxl_mem driver to fail S3 attempts. This still nominaly
allows for suspend, but requires unbinding all CXL memory devices before
the suspend to ensure the typical DRAM flow is taken. The cxl_mem unbind
flow is intended to also tear down all CXL memory regions associated
with a given cxl_memdev.
It is reasonable to assume that any device participating in a System RAM
range published in the EFI memory map is covered by aux power and
save-area outside the device itself. So this restriction can be
minimized in the future once pre-existing region enumeration support
arrives, and perhaps a spec update to clarify if the EFI memory map is
sufficent for determining the range of devices managed by
platform-firmware for S3 support.
Per Rafael, if the CXL configuration prevents suspend then it should
fail early before tasks are frozen, and mem_sleep should stop showing
'mem' as an option [1]. Effectively CXL augments the platform suspend
->valid() op since, for example, the ACPI ops are not aware of the CXL /
PCI dependencies. Given the split role of platform firmware vs OS
provisioned CXL memory it is up to the cxl_mem driver to determine if
the CXL configuration has elements that platform firmware may not be
prepared to restore.
Link: https://lore.kernel.org/r/CAJZ5v0hGVN_=3iU8OLpHY3Ak35T5+JcBM-qs8SbojKrpd0VXsA@mail.gmail.com [1]
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Len Brown <len.brown@intel.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/165066828317.3907920.5690432272182042556.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
At this point the subsystem can enumerate all CXL ports (CXL.mem decode
resources in upstream switch ports and host bridges) in a system. The
last mile is connecting those ports to endpoints.
The cxl_mem driver connects an endpoint device to the platform CXL.mem
protoctol decode-topology. At ->probe() time it walks its
device-topology-ancestry and adds a CXL Port object at every Upstream
Port hop until it gets to CXL root. The CXL root object is only present
after a platform firmware driver registers platform CXL resources. For
ACPI based platform this is managed by the ACPI0017 device and the
cxl_acpi driver.
The ports are registered such that disabling a given port automatically
unregisters all descendant ports, and the chain can only be registered
after the root is established.
Given ACPI device scanning may run asynchronously compared to PCI device
scanning the root driver is tasked with rescanning the bus after the
root successfully probes.
Conversely if any ports in a chain between the root and an endpoint
becomes disconnected it subsequently triggers the endpoint to
unregister. Given lock depenedencies the endpoint unregistration happens
in a workqueue asynchronously. If userspace cares about synchronizing
delayed work after port events the /sys/bus/cxl/flush attribute is
available for that purpose.
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
[djbw: clarify changelog, rework hotplug support]
Link: https://lore.kernel.org/r/164398782997.903003.9725273241627693186.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The need for a CXL port driver and a dedicated cxl_bus_type is driven by
a need to simultaneously support 2 independent physical memory decode
domains (cache coherent CXL.mem and uncached PCI.mmio) that also
intersect at a single PCIe device node. A CXL Port is a device that
advertises a CXL Component Register block with an "HDM Decoder
Capability Structure".
>From Documentation/driver-api/cxl/memory-devices.rst:
Similar to how a RAID driver takes disk objects and assembles them into
a new logical device, the CXL subsystem is tasked to take PCIe and ACPI
objects and assemble them into a CXL.mem decode topology. The need for
runtime configuration of the CXL.mem topology is also similar to RAID in
that different environments with the same hardware configuration may
decide to assemble the topology in contrasting ways. One may choose
performance (RAID0) striping memory across multiple Host Bridges and
endpoints while another may opt for fault tolerance and disable any
striping in the CXL.mem topology.
The port driver identifies whether an endpoint Memory Expander is
connected to a CXL topology. If an active (bound to the 'cxl_port'
driver) CXL Port is not found at every PCIe Switch Upstream port and an
active "root" CXL Port then the device is just a plain PCIe endpoint
only capable of participating in PCI.mmio and DMA cycles, not CXL.mem
coherent interleave sets.
The 'cxl_port' driver lets the CXL subsystem leverage driver-core
infrastructure for setup and teardown of register resources and
communicating device activation status to userspace. The cxl_bus_type
can rendezvous the async arrival of platform level CXL resources (via
the 'cxl_acpi' driver) with the asynchronous enumeration of Memory
Expander endpoints, while also implementing a hierarchical locking model
independent of the associated 'struct pci_dev' locking model. The
locking for dport and decoder enumeration is now handled in the core
rather than callers.
For now the port driver only enumerates and registers CXL resources
(downstream port metadata and decoder resources) later it will be used
to take action on its decoders in response to CXL.mem region
provisioning requests.
Note1: cxlpci.h has long depended on pci.h, but port.c was the first to
not include pci.h. Carry that dependency in cxlpci.h.
Note2: cxl port enumeration and probing complicates CXL subsystem init
to the point that it helps to have centralized debug logging of probe
events in cxl_bus_probe().
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Co-developed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/r/164374948116.464348.1772618057599155408.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The cxl_mem module was renamed cxl_pci in commit 21e9f76733 ("cxl:
Rename mem to pci"). In preparation for adding an ancillary driver for
cxl_memdev devices (registered on the cxl bus by cxl_pci), go ahead and
rename CONFIG_CXL_MEM to CONFIG_CXL_PCI. Free up the CXL_MEM name for
that new driver to manage CXL.mem endpoint operations.
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Link: https://lore.kernel.org/r/164298412409.3018233.12407355692407890752.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The cxl_acpi driver originally open-coded its table parsing since the
ACPI subtable helpers were marked __init and only used in early NUMA
initialization. Now that those helpers have been exported for driver
usage replace the open-coded solution with the common one.
Cc: Alison Schofield <alison.schofield@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/163553710810.2509508.14686373989517930921.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Register an 'nvdimm-bridge' device to act as an anchor for a libnvdimm
bus hierarchy. Also, flesh out the cxl_bus definition to allow a
cxl_nvdimm_bridge_driver to attach to the bridge and trigger the
nvdimm-bus registration.
The creation of the bridge is gated on the detection of a PMEM capable
address space registered to the root. The bridge indirection allows the
libnvdimm module to remain unloaded on platforms without PMEM support.
Given that the probing of ACPI0017 is asynchronous to CXL endpoint
devices, and the expectation that CXL endpoint devices register other
PMEM resources on the 'CXL' nvdimm bus, a workqueue is added. The
workqueue is needed to run bus_rescan_devices() outside of the
device_lock() of the nvdimm-bridge device to rendezvous nvdimm resources
as they arrive. For now only the bus is taken online/offline in the
workqueue.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/162379909706.2993820.14051258608641140169.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
CONFIG_CXL_BUS is default 'n' as expected for new functionality. When
that is enabled do not make the end user hunt for all the expected
sub-options to enable. For example CONFIG_CXL_BUS without CONFIG_CXL_MEM
is an odd/expert configuration, so is CONFIG_CXL_MEM without
CONFIG_CXL_ACPI (on ACPI capable platforms). Default CONFIG_CXL_MEM and
CONFIG_CXL_ACPI to CONFIG_CXL_BUS.
Acked-by: Ben Widawsky <ben.widawsky@intel.com>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/162325450105.2293126.17046356425194082921.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
While CXL builds upon the PCI software model for enumeration and
endpoint control, a static platform component is required to bootstrap
the CXL memory layout. Similar to how ACPI identifies root-level PCI
memory resources, ACPI data enumerates the address space and interleave
configuration for CXL Memory.
In addition to identifying host bridges, ACPI is responsible for
enumerating the CXL memory space that can be addressed by downstream
decoders. This is similar to the requirement for ACPI to publish
resources via the _CRS method for PCI host bridges. Specifically, ACPI
publishes a table, CXL Early Discovery Table (CEDT), which includes a
list of CXL Memory resources, CXL Fixed Memory Window Structures
(CFMWS).
For now, introduce the core infrastructure for a cxl_port hierarchy
starting with a root level anchor represented by the ACPI0017 device.
Follow on changes model support for the configurable decode capabilities
of cxl_port instances, i.e. CXL switch support.
Co-developed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Link: https://lore.kernel.org/r/162325449515.2293126.15303270193010154608.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
As the driver has undergone development, it's become clear that the
majority [entirety?] of the current functionality in mem.c is actually a
layer encapsulating functionality exposed through PCI based
interactions. This layer can be used either in isolation or to provide
functionality for higher level functionality.
CXL capabilities exist in a parallel domain to PCIe. CXL devices are
enumerable and controllable via "legacy" PCIe mechanisms; however, their
CXL capabilities are a superset of PCIe. For example, a CXL device may
be connected to a non-CXL capable PCIe root port, and therefore will not
be able to participate in CXL.mem or CXL.cache operations, but can still
be accessed through PCIe mechanisms for CXL.io operations.
To properly represent the PCI nature of this driver, and in preparation for
introducing a new driver for the CXL.mem / HDM decoder (Host-managed Device
Memory) capabilities of a CXL memory expander, rename mem.c to pci.c so that
mem.c is available for this new driver.
The result of the change is that there is a clear layering distinction
in the driver, and a systems administrator may load only the cxl_pci
module and gain access to such operations as, firmware update, offline
provisioning of devices, and error collection. In addition to freeing up
the file name for another purpose, there are two primary reasons this is
useful,
1. Acting upon devices which don't have full CXL capabilities. This
may happen for instance if the CXL device is connected in a CXL
unaware part of the platform topology.
2. Userspace-first provisioning for devices without kernel driver
interference. This may be useful when provisioning a new device
in a specific manner that might otherwise be blocked or prevented
by the real CXL mem driver.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Link: https://lore.kernel.org/r/20210526174413.802913-1-ben.widawsky@intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The CXL memory device send interface will have a number of supported
commands. The raw command is not such a command. Raw commands allow
userspace to send a specified opcode to the underlying hardware and
bypass all driver checks on the command. The primary use for this
command is to [begrudgingly] allow undocumented vendor specific hardware
commands.
While not the main motivation, it also allows prototyping new hardware
commands without a driver patch and rebuild.
While this all sounds very powerful it comes with a couple of caveats:
1. Bug reports using raw commands will not get the same level of
attention as bug reports using supported commands (via taint).
2. Supported commands will be rejected by the RAW command.
With this comes new debugfs knob to allow full access to your toes with
your weapon of choice.
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com> (v2)
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Ariel Sibley <Ariel.Sibley@microchip.com>
Link: https://lore.kernel.org/r/20210217040958.1354670-6-ben.widawsky@intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The CXL.mem protocol allows a device to act as a provider of "System
RAM" and/or "Persistent Memory" that is fully coherent as if the memory
was attached to the typical CPU memory controller.
With the CXL-2.0 specification a PCI endpoint can implement a "Type-3"
device interface and give the operating system control over "Host
Managed Device Memory". See section 2.3 Type 3 CXL Device.
The memory range exported by the device may optionally be described by
the platform firmware memory map, or by infrastructure like LIBNVDIMM to
provision persistent memory capacity from one, or more, CXL.mem devices.
A pre-requisite for Linux-managed memory-capacity provisioning is this
cxl_mem driver that can speak the mailbox protocol defined in section
8.2.8.4 Mailbox Registers.
For now just land the initial driver boiler-plate and Documentation/
infrastructure.
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: David Rientjes <rientjes@google.com> (v1)
Cc: Jonathan Corbet <corbet@lwn.net>
Link: https://www.computeexpresslink.org/download-the-specification
Link: https://lore.kernel.org/r/20210217040958.1354670-2-ben.widawsky@intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>