There are new types and helpers that are supposed to be used in new code.
As a preparation to get rid of legacy types and API functions do
the conversion here.
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Currently, the EEH recovery process considers passed-through devices
as if they were not EEH-aware, which can cause them to be removed as
part of recovery. Because device removal requires cooperation from
the guest, this may lead to the process stalling or deadlocking.
Also, if devices are removed on the host side, they will be removed
from their IOMMU group, making recovery in the guest impossible.
Therefore, alter the recovery process so that passed-through devices
are not removed but are instead left frozen (and marked isolated)
until the guest performs it's own recovery. If firmware thaws a
passed-through PE because it's parent PE has been thawed (because it
was not passed through), re-freeze it.
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Using the {0} construct as a generic initializer is perfectly fine in C,
however due to a bug in old gcc there is a warning:
+ /kisskb/src/drivers/vfio/pci/vfio_pci_nvlink2.c: warning: (near
initialization for 'cap.header') [-Wmissing-braces]: => 181:9
Since for whatever reason we still want to compile the modern kernel
with such an old gcc without warnings, this changes the capabilities
initialization.
The gcc bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53119
Fixes: 7f92891778 ("vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The recently added nvlink2 VFIO driver introduced a license conflict in two
files. In both cases the SPDX license identifier is:
SPDX-License-Identifier: GPL-2.0+
but the files contain also the following license boiler plate text:
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License version 2 as
* published by the Free Software Foundation
The latter is GPL-2.9-only and not GPL-2.0=.
Looking deeper. The nvlink source file is derived from vfio_pci_igd.c which
is also licensed under GPL-2.0-only and it can be assumed that the file was
copied and modified. As the original file is licensed GPL-2.0-only it's not
possible to relicense derivative work to GPL-2.0-or-later.
Fix the SPDX identifier and remove the boiler plate as it is redundant.
Fixes: 7f92891778 ("vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: kvm@vger.kernel.org
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The below referenced commit adds a test for integer overflow, but in
doing so prevents the unmap ioctl from ever including the last page of
the address space. Subtract one to compare to the last address of the
unmap to avoid the overflow and wrap-around.
Fixes: 71a7d3d78e ("vfio/type1: silence integer overflow warning")
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1662291
Cc: stable@vger.kernel.org # v4.15+
Reported-by: Pei Zhang <pezhang@redhat.com>
Debugged-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Tested-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
drivers/vfio/pci/vfio_pci_nvlink2.c cannot be compiled for in-tree
building.
CC drivers/vfio/pci/vfio_pci_nvlink2.o
In file included from drivers/vfio/pci/trace.h:102,
from drivers/vfio/pci/vfio_pci_nvlink2.c:29:
./include/trace/define_trace.h:89:42: fatal error: ./trace.h: No such file or directory
#include TRACE_INCLUDE(TRACE_INCLUDE_FILE)
^
compilation terminated.
make[1]: *** [scripts/Makefile.build;277: drivers/vfio/pci/vfio_pci_nvlink2.o] Error 1
To fix the build error, let's tell include/trace/define_trace.h the
location of drivers/vfio/pci/trace.h
Fixes: 7f92891778 ("vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver")
Reported-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Including (in no particular order):
- Page table code for AMD IOMMU now supports large pages where
smaller page-sizes were mapped before. VFIO had to work around
that in the past and I included a patch to remove it (acked by
Alex Williamson)
- Patches to unmodularize a couple of IOMMU drivers that would
never work as modules anyway.
- Work to unify the the iommu-related pointers in
'struct device' into one pointer. This work is not finished
yet, but will probably be in the next cycle.
- NUMA aware allocation in iommu-dma code
- Support for r8a774a1 and r8a774c0 in the Renesas IOMMU driver
- Scalable mode support for the Intel VT-d driver
- PM runtime improvements for the ARM-SMMU driver
- Support for the QCOM-SMMUv2 IOMMU hardware from Qualcom
- Various smaller fixes and improvements
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJcKkEoAAoJECvwRC2XARrjCCoQAJxsgaAF5Z0s7z8j2A9SkaGp
SIMnUAI5mDOdyhTOAI+eehpRzg5UVYt/JjFYnHz8HWqbSc8YOvDvHafmhMFIwYvO
hq5knbs6ns2jJNFO+M4dioDq+3THdqkGIF5xoHdGTP7cn9+XyQ8lAoHo0RuL122U
PJGqX7Cp4XnFP4HMb3uQYhVeBV7mU+XqAdB+4aDnQkzI5LkQCRr74GcqOm+Rlnyc
cmQWc2arUMjgc1TJIrex8dx9dT6lq8kOmhyEg/IjHeGaZyJ3HqA+30XDDLEExN0G
MeVawuxJz40HgXlkXr+iZTQtIFYkXdKvJH6rptMbOfbDeDz+YZ01TbtAMMH9o4jX
yxjjMjdcWTsWYQ/MHHdsoMP34cajCi/EYPMNksbycw+E3Y+X/bSReCoWC0HUK8/+
Z4TpZ9mZVygtJR+QNZ+pE9oiJpb4sroM10zTnbMoVHNnvfsO01FYk7FMPkolSKLw
zB4MDswQYgchoFR9Z4ZB4PycYTzeafLKYgDPDoD1vIJgDavuidwvDWDRTDc+aMWM
siIIewq19To9jDJkVjX4dsT/p99KVKgAR/Ps6jjWkAroha7g6GcmlYZHIJnyop04
jiaSXUsk8aRucP/CRz5xdMmaGoN7BsNmpUjcrquc6Povk/6gvXvpY04oCs1+gNMX
ipL9E3GTFCVBubRFrksv
=DT9A
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU updates from Joerg Roedel:
- Page table code for AMD IOMMU now supports large pages where smaller
page-sizes were mapped before. VFIO had to work around that in the
past and I included a patch to remove it (acked by Alex Williamson)
- Patches to unmodularize a couple of IOMMU drivers that would never
work as modules anyway.
- Work to unify the the iommu-related pointers in 'struct device' into
one pointer. This work is not finished yet, but will probably be in
the next cycle.
- NUMA aware allocation in iommu-dma code
- Support for r8a774a1 and r8a774c0 in the Renesas IOMMU driver
- Scalable mode support for the Intel VT-d driver
- PM runtime improvements for the ARM-SMMU driver
- Support for the QCOM-SMMUv2 IOMMU hardware from Qualcom
- Various smaller fixes and improvements
* tag 'iommu-updates-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (78 commits)
iommu: Check for iommu_ops == NULL in iommu_probe_device()
ACPI/IORT: Don't call iommu_ops->add_device directly
iommu/of: Don't call iommu_ops->add_device directly
iommu: Consolitate ->add/remove_device() calls
iommu/sysfs: Rename iommu_release_device()
dmaengine: sh: rcar-dmac: Use device_iommu_mapped()
xhci: Use device_iommu_mapped()
powerpc/iommu: Use device_iommu_mapped()
ACPI/IORT: Use device_iommu_mapped()
iommu/of: Use device_iommu_mapped()
driver core: Introduce device_iommu_mapped() function
iommu/tegra: Use helper functions to access dev->iommu_fwspec
iommu/qcom: Use helper functions to access dev->iommu_fwspec
iommu/of: Use helper functions to access dev->iommu_fwspec
iommu/mediatek: Use helper functions to access dev->iommu_fwspec
iommu/ipmmu-vmsa: Use helper functions to access dev->iommu_fwspec
iommu/dma: Use helper functions to access dev->iommu_fwspec
iommu/arm-smmu: Use helper functions to access dev->iommu_fwspec
ACPI/IORT: Use helper functions to access dev->iommu_fwspec
iommu: Introduce wrappers around dev->iommu_fwspec
...
- Replace global vfio-pci lock with per bus lock to allow concurrent
open and release (Alex Williamson)
- Declare mdev function as static (Paolo Cretaro)
- Convert char to u8 in mdev/mtty sample driver (Nathan Chancellor)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJcHQpWAAoJECObm247sIsi1x0QAIZ+XICmk+cWiYJBOH64IIqS
ZjjzKj0QtIBnAtecK9OjCeVSbcaPeL10DPNmg/QF0/TmO4Ch71XZZ2D9DOW1neVv
7FPrK9pXDIy/vDq/5yGXr9d/x5LvqiWFG3DmUp+lmUEbd8wFbdBep1/aSr6HY+78
LGoX8TAGuNfj1+GLhL/jXanFE1Az1+dTh63umI9TptRf+HGxQZmRY5JqfZ8OMWrl
vNmiugmkS5quhIKa4PO2ZFBndodgI5hMxezkEDqXzOe0bzQ8uBzs43UVa8/V+Ais
05F4Ya5CbeQYTOkSnwN0DC4Vy6WWUtz6TJBB2x8P57ZYb8PYQg7yJluBxZVYmR4S
0c8/Dqi95fUcVSUmDmW/q46YEqFBcXQZEmHP2lap2BLI4H/It7ft/cGacfUuvIq5
g9H7tZBF8g1MsjYRm4D86kC1AcstbBgZeywmmtNwdTVYpeGbIv7DLvWMlVP0Jk/J
gKKAO1RHo/tnpZv/W+ge5pXDSOxZOgHRlXnnlj8jMFjQh0J15dtSmQM13MZhaxnZ
5Q7emt2jv7kvYqsBbkTcS27ii4kmjMlnsX0YPL8OK/QlQbW8jJwIraFoPZvUklkZ
FrqU9DIJlaGbtMwB04b9wEvWtfPKl/6ycFWFLu+9rFMVaOVT5oeHUQFV1PWOErw8
/l0qJqiTy9y1K4DJEU1P
=HOtX
-----END PGP SIGNATURE-----
Merge tag 'vfio-v4.21-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Replace global vfio-pci lock with per bus lock to allow concurrent
open and release (Alex Williamson)
- Declare mdev function as static (Paolo Cretaro)
- Convert char to u8 in mdev/mtty sample driver (Nathan Chancellor)
* tag 'vfio-v4.21-rc1' of git://github.com/awilliam/linux-vfio:
vfio-mdev/samples: Use u8 instead of char for handle functions
vfio/mdev: add static modifier to add_mdev_supported_type
vfio/pci: Parallelize device open and release
POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not
pluggable PCIe devices but still have PCIe links which are used
for config space and MMIO. In addition to that the GPUs have 6 NVLinks
which are connected to other GPUs and the POWER9 CPU. POWER9 chips
have a special unit on a die called an NPU which is an NVLink2 host bus
adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each.
These systems also support ATS (address translation services) which is
a part of the NVLink2 protocol. Such GPUs also share on-board RAM
(16GB or 32GB) to the system via the same NVLink2 so a CPU has
cache-coherent access to a GPU RAM.
This exports GPU RAM to the userspace as a new VFIO device region. This
preregisters the new memory as device memory as it might be used for DMA.
This inserts pfns from the fault handler as the GPU memory is not onlined
until the vendor driver is loaded and trained the NVLinks so doing this
earlier causes low level errors which we fence in the firmware so
it does not hurt the host system but still better be avoided; for the same
reason this does not map GPU RAM into the host kernel (usual thing for
emulated access otherwise).
This exports an ATSD (Address Translation Shootdown) register of NPU which
allows TLB invalidations inside GPU for an operating system. The register
conveniently occupies a single 64k page. It is also presented to
the userspace as a new VFIO device region. One NPU has 8 ATSD registers,
each of them can be used for TLB invalidation in a GPU linked to this NPU.
This allocates one ATSD register per an NVLink bridge allowing passing
up to 6 registers. Due to the host firmware bug (just recently fixed),
only 1 ATSD register per NPU was actually advertised to the host system
so this passes that alone register via the first NVLink bridge device in
the group which is still enough as QEMU collects them all back and
presents to the guest via vPHB to mimic the emulated NPU PHB on the host.
In order to provide the userspace with the information about GPU-to-NVLink
connections, this exports an additional capability called "tgt"
(which is an abbreviated host system bus address). The "tgt" property
tells the GPU its own system address and allows the guest driver to
conglomerate the routing information so each GPU knows how to get directly
to the other GPUs.
For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to
know LPID (a logical partition ID or a KVM guest hardware ID in other
words) and PID (a memory context ID of a userspace process, not to be
confused with a linux pid). This assigns a GPU to LPID in the NPU and
this is why this adds a listener for KVM on an IOMMU group. A PID comes
via NVLink from a GPU and NPU uses a PID wildcard to pass it through.
This requires coherent memory and ATSD to be available on the host as
the GPU vendor only supports configurations with both features enabled
and other configurations are known not to work. Because of this and
because of the ways the features are advertised to the host system
(which is a device tree with very platform specific properties),
this requires enabled POWERNV platform.
The V100 GPUs do not advertise any of these capabilities via the config
space and there are more than just one device ID so this relies on
the platform to tell whether these GPUs have special abilities such as
NVLinks.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
VFIO regions already support region capabilities with a limited set of
fields. However the subdriver might have to report to the userspace
additional bits.
This adds an add_capability() hook to vfio_pci_regops.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
So far we only allowed mapping of MMIO BARs to the userspace. However
there are GPUs with on-board coherent RAM accessible via side
channels which we also want to map to the userspace. The first client
for this is NVIDIA V100 GPU with NVLink2 direct links to a POWER9
NPU-enabled CPU; such GPUs have 16GB RAM which is coherently mapped
to the system address space, we are going to export these as an extra
PCI region.
We already support extra PCI regions and this adds support for mapping
them to the userspace.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This new memory does not have page structs as it is not plugged to
the host so gup() will fail anyway.
This adds 2 helpers:
- mm_iommu_newdev() to preregister the "memory device" memory so
the rest of API can still be used;
- mm_iommu_is_devmem() to know if the physical address is one of thise
new regions which we must avoid unpinning of.
This adds @mm to tce_page_is_contained() and iommu_tce_xchg() to test
if the memory is device memory to avoid pfn_to_page().
This adds a check for device memory in mm_iommu_ua_mark_dirty_rm() which
does delayed pages dirtying.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Normally mm_iommu_get() should add a reference and mm_iommu_put() should
remove it. However historically mm_iommu_find() does the referencing and
mm_iommu_get() is doing allocation and referencing.
We are going to add another helper to preregister device memory so
instead of having mm_iommu_new() (which pre-registers the normal memory
and references the region), we need separate helpers for pre-registering
and referencing.
This renames:
- mm_iommu_get to mm_iommu_new;
- mm_iommu_find to mm_iommu_get.
This changes mm_iommu_get() to reference the region so the name now
reflects what it does.
This removes the check for exact match from mm_iommu_new() as we want it
to fail on existing regions; mm_iommu_get() should be used instead.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
As a part of cleanup, the SPAPR TCE IOMMU subdriver releases preregistered
memory. If there is a bug in memory release, the loop in
tce_iommu_release() becomes infinite; this actually happened to me.
This makes the loop finite and prints a warning on every failure to make
the code more bug prone.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Set add_mdev_supported_type as static since it is only used within
mdev_sysfs.c. This fixes -Wmissing-prototypes gcc warning.
Signed-off-by: Paolo Cretaro <paolocretaro@gmail.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In commit 61d792562b ("vfio-pci: Use mutex around open, release, and
remove") a mutex was added to freeze the refcnt for a device so that
we can handle errors and perform bus resets on final close. However,
bus resets can be rather slow and a global mutex here is undesirable.
Evaluating the potential locking granularity, a per-device mutex
provides the best resolution but with multiple devices on a bus all
released concurrently, they'll race to acquire each other's mutex,
likely resulting in no reset at all if we use trylock. We therefore
lock at the granularity of the bus/slot reset as we're only attempting
a single reset for this group of devices anyway. This allows much
greater scaling as we're bounded in the number of devices protected by
a single reflck object.
Reported-by: Christian Ehrhardt <christian.ehrhardt@canonical.com>
Tested-by: Christian Ehrhardt <christian.ehrhardt@canonical.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The AMD IOMMU driver can now map a huge-page where smaller
mappings existed before, so this code-path is no longer
triggered.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The powernv platform maintains 2 TCE tables for VFIO - a hardware TCE
table and a table with userspace addresses. These tables are radix trees,
we allocate indirect levels when they are written to. Since
the memory allocation is problematic in real mode, we have 2 accessors
to the entries:
- for virtual mode: it allocates the memory and it is always expected
to return non-NULL;
- fr real mode: it does not allocate and can return NULL.
Also, DMA windows can span to up to 55 bits of the address space and since
we never have this much RAM, such windows are sparse. However currently
the SPAPR TCE IOMMU driver walks through all TCEs to unpin DMA memory.
Since we maintain a userspace addresses table for VFIO which is a mirror
of the hardware table, we can use it to know which parts of the DMA
window have not been mapped and skip these so does this patch.
The bare metal systems do not have this problem as they use a bypass mode
of a PHB which maps RAM directly.
This helps a lot with sparse DMA windows, reducing the shutdown time from
about 3 minutes per 1 billion TCEs to a few seconds for 32GB sparse guest.
Just skipping the last level seems to be good enough.
As non-allocating accessor is used now in virtual mode as well, rename it
from IOMMU_TABLE_USERSPACE_ENTRY_RM (real mode) to _RO (read only).
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Currently the type-1 IOMMU instantiation depends on "ARM_SMMU ||
ARM_SMMU_V3", while it applies to other ARM/ARM64 platforms with an
IOMMU (e.g. Renesas VMSA-compatible IPMMUs).
Instead of extending the list of IOMMU types on ARM platforms, replace
the list by "ARM || ARM64", like other architectures do. The feature is
still restricted to ARM/ARM64 platforms with an IOMMU by the dependency
on IOMMU_API.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The SR-IOV spec requires that VFs must report zero for the INTx pin
register as VFs are precluded from INTx support. It's much easier for
the host kernel to understand whether a device is a VF and therefore
whether a non-zero pin register value is bogus than it is to do the
same in userspace. Override the INTx count for such devices and
virtualize the pin register to provide a consistent view of the device
to the user.
As this is clearly a spec violation, warn about it to support hardware
validation, but also provide a known whitelist as it doesn't do much
good to continue complaining if the hardware vendor doesn't plan to
fix it.
Known devices with this issue: 8086:270c
Tested-by: Gage Eads <gage.eads@intel.com>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Free allocated vdev->msi_perm in error path.
Signed-off-by: Li Qiang <liq3ea@gmail.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Notable changes:
- A fix for a bug in our page table fragment allocator, where a page table page
could be freed and reallocated for something else while still in use, leading
to memory corruption etc. The fix reuses pt_mm in struct page (x86 only) for
a powerpc only refcount.
- Fixes to our pkey support. Several are user-visible changes, but bring us in
to line with x86 behaviour and/or fix outright bugs. Thanks to Florian Weimer
for reporting many of these.
- A series to improve the hvc driver & related OPAL console code, which have
been seen to cause hardlockups at times. The hvc driver changes in particular
have been in linux-next for ~month.
- Increase our MAX_PHYSMEM_BITS to 128TB when SPARSEMEM_VMEMMAP=y.
- Remove Power8 DD1 and Power9 DD1 support, neither chip should be in use
anywhere other than as a paper weight.
- An optimised memcmp implementation using Power7-or-later VMX instructions
- Support for barrier_nospec on some NXP CPUs.
- Support for flushing the count cache on context switch on some IBM CPUs
(controlled by firmware), as a Spectre v2 mitigation.
- A series to enhance the information we print on unhandled signals to bring it
into line with other arches, including showing the offending VMA and dumping
the instructions around the fault.
Thanks to:
Aaro Koskinen, Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Alexey
Spirkov, Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar,
Arnd Bergmann, Bartosz Golaszewski, Benjamin Herrenschmidt, Bharat Bhushan,
Bjoern Noetel, Boqun Feng, Breno Leitao, Bryant G. Ly, Camelia Groza,
Christophe Leroy, Christoph Hellwig, Cyril Bur, Dan Carpenter, Daniel Klamt,
Darren Stevens, Dave Young, David Gibson, Diana Craciun, Finn Thain, Florian
Weimer, Frederic Barrat, Gautham R. Shenoy, Geert Uytterhoeven, Geoff Levand,
Guenter Roeck, Gustavo Romero, Haren Myneni, Hari Bathini, Joel Stanley,
Jonathan Neuschäfer, Kees Cook, Madhavan Srinivasan, Mahesh Salgaonkar, Markus
Elfring, Mathieu Malaterre, Mauro S. M. Rodrigues, Michael Hanselmann, Michael
Neuling, Michael Schmitz, Mukesh Ojha, Murilo Opsfelder Araujo, Nicholas
Piggin, Parth Y Shah, Paul Mackerras, Paul Menzel, Ram Pai, Randy Dunlap,
Rashmica Gupta, Reza Arbab, Rodrigo R. Galvao, Russell Currey, Sam Bobroff,
Scott Wood, Shilpasri G Bhat, Simon Guo, Souptick Joarder, Stan Johnson,
Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain, Vasant Hegde, Venkat Rao
B, zhong jiang.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAlt2O6cTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgC7hD/4+cj796Df7GsVsIMxzQm7SS9dklIdO
JuKj2Nr5HRzTH59jWlXukLG9mfTNCFgFJB4gEpK1ArDOTcHTCI9RRsLZTZ/kum66
7Pd+7T40dLYXB5uecuUs0vMXa2fI3syKh1VLzACSXv3Dh9BBIKQBwW/aD2eww4YI
1fS5LnXZ2PSxfr6KNAC6ogZnuaiD0sHXOYrtGHq+S/TFC7+Z6ySa6+AnPS+hPVoo
/rHDE1Khr66aj7uk+PP2IgUrCFj6Sbj6hTVlS/iAuwbMjUl9ty6712PmvX9x6wMZ
13hJQI+g6Ci+lqLKqmqVUpXGSr6y4NJGPS/Hko4IivBTJApI+qV/tF2H9nxU+6X0
0RqzsMHPHy13n2torA1gC7ttzOuXPI4hTvm6JWMSsfmfjTxLANJng3Dq3ejh6Bqw
76EMowpDLexwpy7/glPpqNdsP4ySf2Qm8yq3mR7qpL4m3zJVRGs11x+s5DW8NKBL
Fl5SqZvd01abH+sHwv6NLaLkEtayUyohxvyqu2RU3zu5M5vi7DhqstybTPjKPGu0
icSPh7b2y10WpOUpC6lxpdi8Me8qH47mVc/trZ+SpgBrsuEmtJhGKszEnzRCOqos
o2IhYHQv3lQv86kpaAFQlg/RO+Lv+Lo5qbJ209V+hfU5nYzXpEulZs4dx1fbA+ze
fK8GEh+u0L4uJg==
=PzRz
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Notable changes:
- A fix for a bug in our page table fragment allocator, where a page
table page could be freed and reallocated for something else while
still in use, leading to memory corruption etc. The fix reuses
pt_mm in struct page (x86 only) for a powerpc only refcount.
- Fixes to our pkey support. Several are user-visible changes, but
bring us in to line with x86 behaviour and/or fix outright bugs.
Thanks to Florian Weimer for reporting many of these.
- A series to improve the hvc driver & related OPAL console code,
which have been seen to cause hardlockups at times. The hvc driver
changes in particular have been in linux-next for ~month.
- Increase our MAX_PHYSMEM_BITS to 128TB when SPARSEMEM_VMEMMAP=y.
- Remove Power8 DD1 and Power9 DD1 support, neither chip should be in
use anywhere other than as a paper weight.
- An optimised memcmp implementation using Power7-or-later VMX
instructions
- Support for barrier_nospec on some NXP CPUs.
- Support for flushing the count cache on context switch on some IBM
CPUs (controlled by firmware), as a Spectre v2 mitigation.
- A series to enhance the information we print on unhandled signals
to bring it into line with other arches, including showing the
offending VMA and dumping the instructions around the fault.
Thanks to: Aaro Koskinen, Akshay Adiga, Alastair D'Silva, Alexey
Kardashevskiy, Alexey Spirkov, Alistair Popple, Andrew Donnellan,
Aneesh Kumar K.V, Anju T Sudhakar, Arnd Bergmann, Bartosz Golaszewski,
Benjamin Herrenschmidt, Bharat Bhushan, Bjoern Noetel, Boqun Feng,
Breno Leitao, Bryant G. Ly, Camelia Groza, Christophe Leroy, Christoph
Hellwig, Cyril Bur, Dan Carpenter, Daniel Klamt, Darren Stevens, Dave
Young, David Gibson, Diana Craciun, Finn Thain, Florian Weimer,
Frederic Barrat, Gautham R. Shenoy, Geert Uytterhoeven, Geoff Levand,
Guenter Roeck, Gustavo Romero, Haren Myneni, Hari Bathini, Joel
Stanley, Jonathan Neuschäfer, Kees Cook, Madhavan Srinivasan, Mahesh
Salgaonkar, Markus Elfring, Mathieu Malaterre, Mauro S. M. Rodrigues,
Michael Hanselmann, Michael Neuling, Michael Schmitz, Mukesh Ojha,
Murilo Opsfelder Araujo, Nicholas Piggin, Parth Y Shah, Paul
Mackerras, Paul Menzel, Ram Pai, Randy Dunlap, Rashmica Gupta, Reza
Arbab, Rodrigo R. Galvao, Russell Currey, Sam Bobroff, Scott Wood,
Shilpasri G Bhat, Simon Guo, Souptick Joarder, Stan Johnson, Thiago
Jung Bauermann, Tyrel Datwyler, Vaibhav Jain, Vasant Hegde, Venkat
Rao, zhong jiang"
* tag 'powerpc-4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (234 commits)
powerpc/mm/book3s/radix: Add mapping statistics
powerpc/uaccess: Enable get_user(u64, *p) on 32-bit
powerpc/mm/hash: Remove unnecessary do { } while(0) loop
powerpc/64s: move machine check SLB flushing to mm/slb.c
powerpc/powernv/idle: Fix build error
powerpc/mm/tlbflush: update the mmu_gather page size while iterating address range
powerpc/mm: remove warning about ‘type’ being set
powerpc/32: Include setup.h header file to fix warnings
powerpc: Move `path` variable inside DEBUG_PROM
powerpc/powermac: Make some functions static
powerpc/powermac: Remove variable x that's never read
cxl: remove a dead branch
powerpc/powermac: Add missing include of header pmac.h
powerpc/kexec: Use common error handling code in setup_new_fdt()
powerpc/xmon: Add address lookup for percpu symbols
powerpc/mm: remove huge_pte_offset_and_shift() prototype
powerpc/lib: Use patch_site to patch copy_32 functions once cache is enabled
powerpc/pseries: Fix endianness while restoring of r3 in MCE handler.
powerpc/fadump: merge adjacent memory ranges to reduce PT_LOAD segements
powerpc/fadump: handle crash memory ranges array index overflow
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAlt1f9AUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vxbdhAArnhRvkwOk4m4/LCuKF6HpmlxbBNC
TjnBCenNf+lFXzWskfDFGFl/Wif4UzGbRTSCNQrwMzj3Ww3f/6R2QIq9rEJvyNC4
VdxQnaBEZSUgN87q5UGqgdjMTo3zFvlFH6fpb5XDiQ5IX/QZeXeYqoB64w+HvKPU
M+IsoOvnA5gb7pMcpchrGUnSfS1e6AqQbbTt6tZflore6YCEA4cH5OnpGx8qiZIp
ut+CMBvQjQB01fHeBc/wGrVte4NwXdONrXqpUb4sHF7HqRNfEh0QVyPhvebBi+k1
kquqoBQfPFTqgcab31VOcQhg70dEx+1qGm5/YBAwmhCpHR/g2gioFXoROsr+iUOe
BtF6LZr+Y8cySuhJnkCrJBqWvvBaKbJLg0KMbI+7p4o9MZpod2u7LS5LFrlRDyKW
3nz3o+b1+v3tCCKVKIhKo0ljolgkweQtR1f6KIHvq93wBODHVQnAOt9NlPfHVyks
ryGBnOhMjoU5hvfexgIWFk9Ph9MEVQSffkI+TeFPO/tyGBfGfQyGtESiXuEaMQaH
FGdZHX2RLkY3pWHOtWeMzRHzOnr2XjpDFcAqL3HBGPdJ30K3Umv3WOgoFe2SaocG
0gaddPjKSwwM4Sa/VP+O5cjGuzi7QnczSDdpYjxIGZzBav32hqx4/rsnLw7bHH8y
XkEme7cYJc8MGsA=
=2Dmn
-----END PGP SIGNATURE-----
Merge tag 'pci-v4.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull pci updates from Bjorn Helgaas:
- Decode AER errors with names similar to "lspci" (Tyler Baicar)
- Expose AER statistics in sysfs (Rajat Jain)
- Clear AER status bits selectively based on the type of recovery (Oza
Pawandeep)
- Honor "pcie_ports=native" even if HEST sets FIRMWARE_FIRST (Alexandru
Gagniuc)
- Don't clear AER status bits if we're using the "Firmware-First"
strategy where firmware owns the registers (Alexandru Gagniuc)
- Use sysfs_match_string() to simplify ASPM sysfs parsing (Andy
Shevchenko)
- Remove unnecessary includes of <linux/pci-aspm.h> (Bjorn Helgaas)
- Defer DPC event handling to work queue (Keith Busch)
- Use threaded IRQ for DPC bottom half (Keith Busch)
- Print AER status while handling DPC events (Keith Busch)
- Work around IDT switch ACS Source Validation erratum (James
Puthukattukaran)
- Emit diagnostics for all cases of PCIe Link downtraining (Links
operating slower than they're capable of) (Alexandru Gagniuc)
- Skip VFs when configuring Max Payload Size (Myron Stowe)
- Reduce Root Port Max Payload Size if necessary when hot-adding a
device below it (Myron Stowe)
- Simplify SHPC existence/permission checks (Bjorn Helgaas)
- Remove hotplug sample skeleton driver (Lukas Wunner)
- Convert pciehp to threaded IRQ handling (Lukas Wunner)
- Improve pciehp tolerance of missed events and initially unstable
links (Lukas Wunner)
- Clear spurious pciehp events on resume (Lukas Wunner)
- Add pciehp runtime PM support, including for Thunderbolt controllers
(Lukas Wunner)
- Support interrupts from pciehp bridges in D3hot (Lukas Wunner)
- Mark fall-through switch cases before enabling -Wimplicit-fallthrough
(Gustavo A. R. Silva)
- Move DMA-debug PCI init from arch code to PCI core (Christoph
Hellwig)
- Fix pci_request_irq() usage of IRQF_ONESHOT when no handler is
supplied (Heiner Kallweit)
- Unify PCI and DMA direction #defines (Shunyong Yang)
- Add PCI_DEVICE_DATA() macro (Andy Shevchenko)
- Check for VPD completion before checking for timeout (Bert Kenward)
- Limit Netronome NFP5000 config space size to work around erratum
(Jakub Kicinski)
- Set IRQCHIP_ONESHOT_SAFE for PCI MSI irqchips (Heiner Kallweit)
- Document ACPI description of PCI host bridges (Bjorn Helgaas)
- Add "pci=disable_acs_redir=" parameter to disable ACS redirection for
peer-to-peer DMA support (we don't have the peer-to-peer support yet;
this is just one piece) (Logan Gunthorpe)
- Clean up devm_of_pci_get_host_bridge_resources() resource allocation
(Jan Kiszka)
- Fixup resizable BARs after suspend/resume (Christian König)
- Make "pci=earlydump" generic (Sinan Kaya)
- Fix ROM BAR access routines to stay in bounds and check for signature
correctly (Rex Zhu)
- Add DMA alias quirk for Microsemi Switchtec NTB (Doug Meyer)
- Expand documentation for pci_add_dma_alias() (Logan Gunthorpe)
- To avoid bus errors, enable PASID only if entire path supports
End-End TLP prefixes (Sinan Kaya)
- Unify slot and bus reset functions and remove hotplug knowledge from
callers (Sinan Kaya)
- Add Function-Level Reset quirks for Intel and Samsung NVMe devices to
fix guest reboot issues (Alex Williamson)
- Add function 1 DMA alias quirk for Marvell 88SS9183 PCIe SSD
Controller (Bjorn Helgaas)
- Remove Xilinx AXI-PCIe host bridge arch dependency (Palmer Dabbelt)
- Remove Aardvark outbound window configuration (Evan Wang)
- Fix Aardvark bridge window sizing issue (Zachary Zhang)
- Convert Aardvark to use pci_host_probe() to reduce code duplication
(Thomas Petazzoni)
- Correct the Cadence cdns_pcie_writel() signature (Alan Douglas)
- Add Cadence support for optional generic PHYs (Alan Douglas)
- Add Cadence power management ops (Alan Douglas)
- Remove redundant variable from Cadence driver (Colin Ian King)
- Add Kirin MSI support (Xiaowei Song)
- Drop unnecessary root_bus_nr setting from exynos, imx6, keystone,
armada8k, artpec6, designware-plat, histb, qcom, spear13xx (Shawn
Guo)
- Move link notification settings from DesignWare core to individual
drivers (Gustavo Pimentel)
- Add endpoint library MSI-X interfaces (Gustavo Pimentel)
- Correct signature of endpoint library IRQ interfaces (Gustavo
Pimentel)
- Add DesignWare endpoint library MSI-X callbacks (Gustavo Pimentel)
- Add endpoint library MSI-X test support (Gustavo Pimentel)
- Remove unnecessary GFP_ATOMIC from Hyper-V "new child" allocation
(Jia-Ju Bai)
- Add more devices to Broadcom PAXC quirk (Ray Jui)
- Work around corrupted Broadcom PAXC config space to enable SMMU and
GICv3 ITS (Ray Jui)
- Disable MSI parsing to work around broken Broadcom PAXC logic in some
devices (Ray Jui)
- Hide unconfigured functions to work around a Broadcom PAXC defect
(Ray Jui)
- Lower iproc log level to reduce console output during boot (Ray Jui)
- Fix mobiveil iomem/phys_addr_t type usage (Lorenzo Pieralisi)
- Fix mobiveil missing include file (Lorenzo Pieralisi)
- Add mobiveil Kconfig/Makefile support (Lorenzo Pieralisi)
- Fix mvebu I/O space remapping issues (Thomas Petazzoni)
- Use generic pci_host_bridge in mvebu instead of ARM-specific API
(Thomas Petazzoni)
- Whitelist VMD devices with fast interrupt handlers to avoid sharing
vectors with slow handlers (Keith Busch)
* tag 'pci-v4.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (153 commits)
PCI/AER: Don't clear AER bits if error handling is Firmware-First
PCI: Limit config space size for Netronome NFP5000
PCI/MSI: Set IRQCHIP_ONESHOT_SAFE for PCI-MSI irqchips
PCI/VPD: Check for VPD access completion before checking for timeout
PCI: Add PCI_DEVICE_DATA() macro to fully describe device ID entry
PCI: Match Root Port's MPS to endpoint's MPSS as necessary
PCI: Skip MPS logic for Virtual Functions (VFs)
PCI: Add function 1 DMA alias quirk for Marvell 88SS9183
PCI: Check for PCIe Link downtraining
PCI: Add ACS Redirect disable quirk for Intel Sunrise Point
PCI: Add device-specific ACS Redirect disable infrastructure
PCI: Convert device-specific ACS quirks from NULL termination to ARRAY_SIZE
PCI: Add "pci=disable_acs_redir=" parameter for peer-to-peer support
PCI: Allow specifying devices using a base bus and path of devfns
PCI: Make specifying PCI devices in kernel parameters reusable
PCI: Hide ACS quirk declarations inside PCI core
PCI: Delay after FLR of Intel DC P3700 NVMe
PCI: Disable Samsung SM961/PM961 NVMe before FLR
PCI: Export pcie_has_flr()
PCI: mvebu: Drop bogus comment above mvebu_pcie_map_registers()
...
We expect to receive PFs with SR-IOV disabled, however some host
drivers leave SR-IOV enabled at unbind. This puts us in a state where
we can potentially assign both the PF and the VF, leading to both
functionality as well as security concerns due to lack of managing the
SR-IOV state as well as vendor dependent isolation from the PF to VF.
If we were to attempt to actively disable SR-IOV on driver probe, we
risk VF bound drivers blocking, potentially risking live lock
scenarios. Therefore simply refuse to bind to PFs with SR-IOV enabled
with a warning message indicating the issue. Users can resolve this
by re-binding to the host driver and disabling SR-IOV before
attempting to use the device with vfio-pci.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Two regression fixes, one for xmon disassembly formatting and the other to fix
the E500 build.
Two commits to fix a potential security issue in the VFIO code under obscure
circumstances.
And finally a fix to the Power9 idle code to restore SPRG3, which is user
visible and used for sched_getcpu().
Thanks to:
Alexey Kardashevskiy, David Gibson. Gautham R. Shenoy, James Clarke.
-----BEGIN PGP SIGNATURE-----
iQIwBAABCAAaBQJbUrFqExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYCW
mQ//eYpaYIkEthXH0uHUN2wpWZlhbg0wUtjclsT5RUonDwMC2PM9BUuLk61RtIlD
jbi39GrUISHf13U1Ydhb3e1I5B+TpDe6hgb/dGUMAXPe6rt+jFREogZR+vgIU0ep
Q8ta1GIgbI6La0zXn5o3apUtqR7bAQ3cWD2T8vN4tQmAQZPw1fV13cZZ2kFs5JFO
aYX8pD76wkAUz8Im+bweziRSRYdAIi8Oxt1Cdyg9Oti5y4fp4LuQKa/qbfkswkkk
2ycG3TWr4Ln8RM/GaUPW1UPh1Zd8b6et7vUnhxO1g/JdEXlnm9A+DifJFrUk/+y8
DsyofdXUj+u+LXX/H8uqqse7ysfvkC53R15Jo8irISsIgYZcv4yKsZKGJR27wHGV
h9KBDA0ZK+czU2jK4gZAdMTs1IICgXGUVIL+nI8U1sRBep3CI3HguqBVoC/MGes3
WR2+8i2diL1m2jAHR5Nd3+ArBpI876ZalhDmM1Mv3GRUAvjo9nsehk32/nfBDYUM
nPQd1vi3F6w1MSlj7aqA3quTeGANwlnOcdC4QB++tz7oO1sY6ZYZtii3jsGSluXL
vP/Mxvz2AG3v9KPCDQJqzx0n3vMAmGphetddxYl1X1dSwbwLNx0GggKNe0xw6sjp
6lbin0w/Ot79VoSEncuRdm1hrtIktwubsH5CGB7Gxke5Kgk=
=Sfmf
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Two regression fixes, one for xmon disassembly formatting and the
other to fix the E500 build.
Two commits to fix a potential security issue in the VFIO code under
obscure circumstances.
And finally a fix to the Power9 idle code to restore SPRG3, which is
user visible and used for sched_getcpu().
Thanks to: Alexey Kardashevskiy, David Gibson. Gautham R. Shenoy,
James Clarke"
* tag 'powerpc-4.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/powernv: Fix save/restore of SPRG3 on entry/exit from stop (idle)
powerpc/Makefile: Assemble with -me500 when building for E500
KVM: PPC: Check if IOMMU page is contained in the pinned physical page
vfio/spapr: Use IOMMU pageshift rather than pagesize
powerpc/xmon: Fix disassembly since printf changes
Now that the old implementation of pci_reset_bus() is gone, replace
pci_try_reset_bus() with pci_reset_bus().
Compared to the old implementation, new code will fail immmediately with
-EAGAIN if object lock cannot be obtained.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Drivers are expected to call pci_try_reset_slot() or pci_try_reset_bus() by
querying if a system supports hotplug or not. A survey showed that most
drivers don't do this and we are leaking hotplug capability to the user.
Hide pci_try_slot_reset() from drivers and embed into pci_try_bus_reset().
Change pci_try_reset_bus() parameter from struct pci_bus to struct pci_dev.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
info.index can be indirectly controlled by user-space, hence leading
to a potential exploitation of the Spectre variant 1 vulnerability.
This issue was detected with the help of Smatch:
drivers/vfio/pci/vfio_pci.c:734 vfio_pci_ioctl()
warn: potential spectre issue 'vdev->region'
Fix this by sanitizing info.index before indirectly using it to index
vdev->region
Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].
[1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2
Cc: stable@vger.kernel.org
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
A VM which has:
- a DMA capable device passed through to it (eg. network card);
- running a malicious kernel that ignores H_PUT_TCE failure;
- capability of using IOMMU pages bigger that physical pages
can create an IOMMU mapping that exposes (for example) 16MB of
the host physical memory to the device when only 64K was allocated to the VM.
The remaining 16MB - 64K will be some other content of host memory, possibly
including pages of the VM, but also pages of host kernel memory, host
programs or other VMs.
The attacking VM does not control the location of the page it can map,
and is only allowed to map as many pages as it has pages of RAM.
We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
an IOMMU page is contained in the physical page so the PCI hardware won't
get access to unassigned host memory; however this check is missing in
the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and
did not hit this yet as the very first time when the mapping happens
we do not have tbl::it_userspace allocated yet and fall back to
the userspace which in turn calls VFIO IOMMU driver, this fails and
the guest does not retry,
This stores the smallest preregistered page size in the preregistered
region descriptor and changes the mm_iommu_xxx API to check this against
the IOMMU page size.
This calculates maximum page size as a minimum of the natural region
alignment and compound page size. For the page shift this uses the shift
returned by find_linux_pte() which indicates how the page is mapped to
the current userspace - if the page is huge and this is not a zero, then
it is a leaf pte and the page is mapped within the range.
Fixes: 121f80ba68 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The size is always equal to 1 page so let's use this. Later on this will
be used for other checks which use page shifts to check the granularity
of access.
This should cause no behavioral change.
Cc: stable@vger.kernel.org # v4.12+
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
At the moment we allocate the entire TCE table, twice (hardware part and
userspace translation cache). This normally works as we normally have
contigous memory and the guest will map entire RAM for 64bit DMA.
However if we have sparse RAM (one example is a memory device), then
we will allocate TCEs which will never be used as the guest only maps
actual memory for DMA. If it is a single level TCE table, there is nothing
we can really do but if it a multilevel table, we can skip allocating
TCEs we know we won't need.
This adds ability to allocate only first level, saving memory.
This changes iommu_table::free() to avoid allocating of an extra level;
iommu_table::set() will do this when needed.
This adds @alloc parameter to iommu_table::exchange() to tell the callback
if it can allocate an extra level; the flag is set to "false" for
the realmode KVM handlers of H_PUT_TCE hcalls and the callback returns
H_TOO_HARD.
This still requires the entire table to be counted in mm::locked_vm.
To be conservative, this only does on-demand allocation when
the usespace cache table is requested which is the case of VFIO.
The example math for a system replicating a powernv setup with NVLink2
in a guest:
16GB RAM mapped at 0x0
128GB GPU RAM window (16GB of actual RAM) mapped at 0x244000000000
the table to cover that all with 64K pages takes:
(((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB
If we allocate only necessary TCE levels, we will only need:
(((0x400000000 + 0x400000000) >> 16)*8)>>20 = 4MB (plus some for indirect
levels).
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We want to support sparse memory and therefore huge chunks of DMA windows
do not need to be mapped. If a DMA window big enough to require 2 or more
indirect levels, and a DMA window is used to map all RAM (which is
a default case for 64bit window), we can actually save some memory by
not allocation TCE for regions which we are not going to map anyway.
The hardware tables alreary support indirect levels but we also keep
host-physical-to-userspace translation array which is allocated by
vmalloc() and is a flat array which might use quite some memory.
This converts it_userspace from vmalloc'ed array to a multi level table.
As the format becomes platform dependend, this replaces the direct access
to it_usespace with a iommu_table_ops::useraddrptr hook which returns
a pointer to the userspace copy of a TCE; future extension will return
NULL if the level was not allocated.
This should not change non-KVM handling of TCE tables and it_userspace
will not be allocated for non-KVM tables.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We are going to reuse multilevel TCE code for the userspace copy of
the TCE table and since it is big endian, let's make the copy big endian
too.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The patch noted in the fixes below converted get_user_pages_fast() to
get_user_pages_longterm(), however the two calls differ in a few ways.
First _fast() is documented to not require the mmap_sem, while _longterm()
is documented to need it. Hold the mmap sem as required.
Second, _fast accepts an 'int write' while _longterm uses 'unsigned int
gup_flags', so the expression '!!(prot & IOMMU_WRITE)' is only working by
luck as FOLL_WRITE is currently == 0x1. Use the expected FOLL_WRITE
constant instead.
Fixes: 94db151dc8 ("vfio: disable filesystem-dax page pinning")
Cc: <stable@vger.kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Allow the code which provides extensions to support direct assignment
of Intel IGD (GVT-d) to be compiled out of the kernel if desired. The
config option for this was previously automatically enabled on X86,
therefore the default remains Y. This simply provides the option to
disable it even for X86.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If a device is part of a PM Domain (e.g. power and/or clock domain), its
power state is managed using Runtime PM. Without Runtime PM, the device
may not be powered up or clocked, causing subtle failures, crashes, or
system lock-ups when the device is accessed by the guest.
Fix this by adding Runtime PM support, powering the device when the VFIO
device is opened by the guest.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Simon Horman <horms+renesas@verge.net.au>
Acked-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
match_string() returns the index of an array for a matching string,
which can be used intead of open coded variant.
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
There exists a gap at the end of mdev_device_create() where the device
is visible to userspace, but we're not yet ready to handle removal, as
triggered through the 'remove' attribute. We handle this properly in
mdev_device_remove() with an -EAGAIN return, but we can marginally
reduce this gap by adding this attribute as a final step of our sysfs
setup.
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Acked-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
When we create an mdev device, we check for duplicates against the
parent device and return -EEXIST if found, but the mdev device
namespace is global since we'll link all devices from the bus. We do
catch this later in sysfs_do_create_link_sd() to return -EEXIST, but
with it comes a kernel warning and stack trace for trying to create
duplicate sysfs links, which makes it an undesirable response.
Therefore we should really be looking for duplicates across all mdev
parent devices, or as implemented here, against our mdev device list.
Using mdev_list to prevent duplicates means that we can remove
mdev_parent.lock, but in order not to serialize mdev device creation
and removal globally, we add mdev_device.active which allows UUIDs to
be reserved such that we can drop the mdev_list_lock before the mdev
device is fully in place.
Two behavioral notes; first, mdev_parent.lock had the side-effect of
serializing mdev create and remove ops per parent device. This was
an implementation detail, not an intentional guarantee provided to
the mdev vendor drivers. Vendor drivers can trivially provide this
serialization internally if necessary. Second, review comments note
the new -EAGAIN behavior when the device, and in particular the remove
attribute, becomes visible in sysfs. If a remove is triggered prior
to completion of mdev_device_create() the user will see a -EAGAIN
error. While the errno is different, receiving an error during this
period is not, the previous implementation returned -ENODEV for the
same condition. Furthermore, the consistency to the user is improved
in the case where mdev_device_remove_ops() returns error. Previously
concurrent calls to mdev_device_remove() could see the device
disappear with -ENODEV and return in the case of error. Now a user
would see -EAGAIN while the device is in this transitory state.
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Acked-by: Halil Pasic <pasic@linux.ibm.com>
Acked-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
- Capitalize the first word of error messages,
- Unwrap statements that fit on a single line,
- Use "VFIO" instead of "vfio" as the error message prefix.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If the IOMMU group setup fails, the reset module is not released.
Fixes: b5add544d6 ("vfio, platform: make reset driver a requirement by default")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Simon Horman <horms+renesas@verge.net.au>
Acked-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
MAP_DMA ioctls might be called from various threads within a process,
for example when using QEMU, the vCPU threads are often generating
these calls and we therefore take a reference to that vCPU task.
However, QEMU also supports vCPU hotplug on some machines and the task
that called MAP_DMA may have exited by the time UNMAP_DMA is called,
resulting in the mm_struct pointer being NULL and thus a failure to
match against the existing mapping.
To resolve this, we instead take a reference to the thread
group_leader, which has the same mm_struct and resource limits, but
is less likely exit, at least in the QEMU case. A difficulty here is
guaranteeing that the capabilities of the group_leader match that of
the calling thread, which we resolve by tracking CAP_IPC_LOCK at the
time of calling rather than at an indeterminate time in the future.
Potentially this also results in better efficiency as this is now
recorded once per MAP_DMA ioctl.
Reported-by: Xu Yandong <xuyandong2@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Pull aio updates from Al Viro:
"Majority of AIO stuff this cycle. aio-fsync and aio-poll, mostly.
The only thing I'm holding back for a day or so is Adam's aio ioprio -
his last-minute fixup is trivial (missing stub in !CONFIG_BLOCK case),
but let it sit in -next for decency sake..."
* 'work.aio-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
aio: sanitize the limit checking in io_submit(2)
aio: fold do_io_submit() into callers
aio: shift copyin of iocb into io_submit_one()
aio_read_events_ring(): make a bit more readable
aio: all callers of aio_{read,write,fsync,poll} treat 0 and -EIOCBQUEUED the same way
aio: take list removal to (some) callers of aio_complete()
aio: add missing break for the IOCB_CMD_FDSYNC case
random: convert to ->poll_mask
timerfd: convert to ->poll_mask
eventfd: switch to ->poll_mask
pipe: convert to ->poll_mask
crypto: af_alg: convert to ->poll_mask
net/rxrpc: convert to ->poll_mask
net/iucv: convert to ->poll_mask
net/phonet: convert to ->poll_mask
net/nfc: convert to ->poll_mask
net/caif: convert to ->poll_mask
net/bluetooth: convert to ->poll_mask
net/sctp: convert to ->poll_mask
net/tipc: convert to ->poll_mask
...
Bisection by Amadeusz Sławiński implicates this commit leading to bad
page state issues after VM shutdown, likely due to unbalanced page
references. The original commit was intended only as a performance
improvement, therefore revert for offline rework.
Link: https://lkml.org/lkml/2018/6/2/97
Fixes: 356e88ebe4 ("vfio/type1: Improve memory pinning process for raw PFN mapping")
Cc: Jason Cai (Xiang Feng) <jason.cai@linux.alibaba.com>
Reported-by: Amadeusz Sławiński <amade@asmblr.net>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
These abstract out calls to the poll method in preparation for changes
in how we poll.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
The ioeventfd here is actually irqfd handling of an ioeventfd such as
supported in KVM. A user is able to pre-program a device write to
occur when the eventfd triggers. This is yet another instance of
eventfd-irqfd triggering between KVM and vfio. The impetus for this
is high frequency writes to pages which are virtualized in QEMU.
Enabling this near-direct write path for selected registers within
the virtualized page can improve performance and reduce overhead.
Specifically this is initially targeted at NVIDIA graphics cards where
the driver issues a write to an MMIO register within a virtualized
region in order to allow the MSI interrupt to re-trigger.
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The iowriteXX/ioreadXX functions assume little endian hardware and
convert to little endian on a write and from little endian on a read.
We currently do our own explicit conversion to negate this. Instead,
add some endian dependent defines to avoid all byte swaps. There
should be no functional change other than big endian systems aren't
penalized with wasted swaps.
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This creates a common helper that we'll use for ioeventfd setup.
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
When using vfio to pass through a PCIe device (e.g. a GPU card) that
has a huge BAR (e.g. 16GB), a lot of cycles are wasted on memory
pinning because PFNs of PCI BAR are not backed by struct page, and
the corresponding VMA has flag VM_PFNMAP.
With this change, when pinning a region which is a raw PFN mapping,
it can skip unnecessary user memory pinning process, and thus, can
significantly improve VM's boot up time when passing through devices
via VFIO. In my test on a Xeon E5 2.6GHz, the time mapping a 16GB
BAR was reduced from about 0.4s to 1.5us.
Signed-off-by: Jason Cai (Xiang Feng) <jason.cai@linux.alibaba.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This reverts commit 2170dd0431
The intent of commit 2170dd0431 ("vfio-pci: Mask INTx if a device is
not capabable of enabling it") was to disallow the user from seeing
that the device supports INTx if the platform is incapable of enabling
it. The detection of this case however incorrectly includes devices
which natively do not support INTx, such as SR-IOV VFs, and further
discussions reveal gaps even for the target use case.
Reported-by: Arjun Vynipadath <arjun@chelsio.com>
Fixes: 2170dd0431 ("vfio-pci: Mask INTx if a device is not capabable of enabling it")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
VFIO IOMMU type1 currently upmaps IOVA pages synchronously, which requires
IOTLB flushing for every unmapping. This results in large IOTLB flushing
overhead when handling pass-through devices has a large number of mapped
IOVAs. This can be avoided by using the new IOTLB flushing interface.
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
[aw - use LIST_HEAD]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Filesystem-DAX is incompatible with 'longterm' page pinning. Without
page cache indirection a DAX mapping maps filesystem blocks directly.
This means that the filesystem must not modify a file's block map while
any page in a mapping is pinned. In order to prevent the situation of
userspace holding of filesystem operations indefinitely, disallow
'longterm' Filesystem-DAX mappings.
RDMA has the same conflict and the plan there is to add a 'with lease'
mechanism to allow the kernel to notify userspace that the mapping is
being torn down for block-map maintenance. Perhaps something similar can
be put in place for vfio.
Note that xfs and ext4 still report:
"DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
...at mount time, and resolving the dax-dma-vs-truncate problem is one
of the last hurdles to remove that designation.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: kvm@vger.kernel.org
Cc: <stable@vger.kernel.org>
Reported-by: Haozhong Zhang <haozhong.zhang@intel.com>
Tested-by: Haozhong Zhang <haozhong.zhang@intel.com>
Fixes: d475c6346a ("dax,ext2: replace XIP read and write with DAX I/O")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:
for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
done
with de-mangling cleanups yet to come.
NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do. But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.
The next patch from Al will sort out the final differences, and we
should be all done.
Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Mask INTx from user if pdev->irq is zero (Alexey Kardashevskiy)
- Capability helper cleanup (Alex Williamson)
- Allow mmaps overlapping MSI-X vector table with region capability
exposing this feature (Alexey Kardashevskiy)
- mdev static cleanups (Xiongwei Song)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJac2tQAAoJECObm247sIsiUr8P/2zrK0G/mVrHWGEjlnlvgYjg
FSozgN0fmc9mJ+Hg8ntfXn4GXuTaCq0uz96eMl9Wy1tMpLs4hoQf20IRojIpqgpb
aVUHKg+7sJUWNsd+u1jSBb64SvQLdTbesfGgL8WLcJB90jWxzaZjEZ3mgKO4Lb88
rNEpiTVoOURDgJo+bOMEnJJ6okmpRLBgw5pqrdLT2BButZg3QfLtcuoY18pFYxc7
INy4YPWPe93aiDGloUrjj6xKRKfTaL7L8KGZBlk4FR5JENDRoOtGEguGcQRl0u8w
IYLIkIE9p172S5bkeCYqawyxAPgQQIk5Wd4buFArg9w7tdWJOkEiCMwSu/LCocR8
CiwZHKOWp7mJWv3XxJGY+rU3nHIuB+IaeKknE6rLXgkLVXED5Ta4pPzrpoUZezsT
yyI5U2BnWNL5ISaWY3i+YQKvFUgPe8Az9Zw4p3zGUJ+zu2QheN7x+BVXT0xENMfk
2sdNFeZwkxmB18pdsJdb+/DpL9yCrS7VxP3IDnAdfR8VIyLE3QFRTnsfNX7F1Uvr
zKBZChuah2osiiM3k3ncwkovsM6iN1ZLVnHUg7xRBmp6WQnBYN02wrTXM/e1UqsS
nKIBYA9fIcpLZ4hXMuiMmk7LFlScl/HngW+h+jWAyoP/j3X0gw/YM5hrPEJEV3fN
WuIPpBYN36+0V9cUDMnN
=Ww3n
-----END PGP SIGNATURE-----
Merge tag 'vfio-v4.16-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Mask INTx from user if pdev->irq is zero (Alexey Kardashevskiy)
- Capability helper cleanup (Alex Williamson)
- Allow mmaps overlapping MSI-X vector table with region capability
exposing this feature (Alexey Kardashevskiy)
- mdev static cleanups (Xiongwei Song)
* tag 'vfio-v4.16-rc1' of git://github.com/awilliam/linux-vfio:
vfio: mdev: make a couple of functions and structure vfio_mdev_driver static
vfio-pci: Allow mapping MSIX BAR
vfio: Simplify capability helper
vfio-pci: Mask INTx if a device is not capabable of enabling it
The functions vfio_mdev_probe, vfio_mdev_remove and the structure
vfio_mdev_driver are only used in this file, so make them static.
Clean up sparse warnings:
drivers/vfio/mdev/vfio_mdev.c:114:5: warning: no previous prototype
for 'vfio_mdev_probe' [-Wmissing-prototypes]
drivers/vfio/mdev/vfio_mdev.c:121:6: warning: no previous prototype
for 'vfio_mdev_remove' [-Wmissing-prototypes]
Signed-off-by: Xiongwei Song <sxwjean@gmail.com>
Reviewed-by: Quan Xu <quan.xu0@gmail.com>
Reviewed-by: Liu, Yi L <yi.l.liu@intel.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
By default VFIO disables mapping of MSIX BAR to the userspace as
the userspace may program it in a way allowing spurious interrupts;
instead the userspace uses the VFIO_DEVICE_SET_IRQS ioctl.
In order to eliminate guessing from the userspace about what is
mmapable, VFIO also advertises a sparse list of regions allowed to mmap.
This works fine as long as the system page size equals to the MSIX
alignment requirement which is 4KB. However with a bigger page size
the existing code prohibits mapping non-MSIX parts of a page with MSIX
structures so these parts have to be emulated via slow reads/writes on
a VFIO device fd. If these emulated bits are accessed often, this has
serious impact on performance.
This allows mmap of the entire BAR containing MSIX vector table.
This removes the sparse capability for PCI devices as it becomes useless.
As the userspace needs to know for sure whether mmapping of the MSIX
vector containing data can succeed, this adds a new capability -
VFIO_REGION_INFO_CAP_MSIX_MAPPABLE - which explicitly tells the userspace
that the entire BAR can be mmapped.
This does not touch the MSIX mangling in the BAR read/write handlers as
we are doing this just to enable direct access to non MSIX registers.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw - fixup whitespace, trim function name]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The vfio_info_add_capability() helper requires the caller to pass a
capability ID, which it then uses to fill in header fields, assuming
hard coded versions. This makes for an awkward and rigid interface.
The only thing we want this helper to do is allocate sufficient
space in the caps buffer and chain this capability into the list.
Reduce it to that simple task.
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
At the moment VFIO rightfully assumes that INTx is supported if
the interrupt pin is not set to zero in the device config space.
However if that is not the case (the pin is not zero but pdev->irq is),
vfio_intx_enable() fails.
In order to prevent the userspace from trying to enable INTx when we know
that it cannot work, let's mask the PCI_INTERRUPT_PIN register.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
__poll_t is also used as wait key in some waitqueues.
Verify that wait_..._poll() gets __poll_t as key and
provide a helper for wakeup functions to get back to
that __poll_t value.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.
However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:
----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()
// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds Broadcom FlexRM low-level reset for
VFIO platform.
It will do the following:
1. Disable/Deactivate each FlexRM ring
2. Flush each FlexRM ring
The cleanup sequence for FlexRM rings is adapted from
Broadcom FlexRM mailbox driver.
Signed-off-by: Anup Patel <anup.patel@broadcom.com>
Reviewed-by: Oza Oza <oza.oza@broadcom.com>
Reviewed-by: Scott Branden <scott.branden@broadcom.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
I get a static checker warning about the potential integer overflow if
we add "unmap->iova + unmap->size". The integer overflow isn't really
harmful, but we may as well fix it. Also unmap->size gets truncated to
size_t when we pass it to vfio_find_dma() so we could check for too high
values of that as well.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Clearing very big IOMMU tables can trigger soft lockups. This adds
cond_resched() to allow the scheduler to do context switching when
it decides to.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
MRRS defines the maximum read request size a device is allowed to
make. Drivers will often increase this to allow more data transfer
with a single request. Completions to this request are bound by the
MPS setting for the bus. Aside from device quirks (none known), it
doesn't seem to make sense to set an MRRS value less than MPS, yet
this is a likely scenario given that user drivers do not have a
system-wide view of the PCI topology. Virtualize MRRS such that the
user can set MRRS >= MPS, but use MPS as the floor value that we'll
write to hardware.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
With virtual PCI-Express chipsets, we now see userspace/guest drivers
trying to match the physical MPS setting to a virtual downstream port.
Of course a lone physical device surrounded by virtual interconnects
cannot make a correct decision for a proper MPS setting. Instead,
let's virtualize the MPS control register so that writes through to
hardware are disallowed. Userspace drivers like QEMU assume they can
write anything to the device and we'll filter out anything dangerous.
Since mismatched MPS can lead to AER and other faults, let's add it
to the kernel side rather than relying on userspace virtualization to
handle it.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
amba_id are not supposed to change at runtime. All functions
working with const amba_id. So mark the non-const structs as const.
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
When the user unbinds the last device of a group from a vfio bus
driver, the devices within that group should be available for other
purposes. We currently have a race that makes this generally, but
not always true. The device can be unbound from the vfio bus driver,
but remaining IOMMU context of the group attached to the container
can result in errors as the next driver configures DMA for the device.
Wait for the group to be detached from the IOMMU backend before
allowing the bus driver remove callback to complete.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In vfio_iommu_group_get() we want to increase the reference
count of the iommu group.
In noiommu case, the group does not exist and is allocated.
iommu_group_add_device() increases the group ref count. However we
then call iommu_group_put() which decrements it.
This leads to a "refcount_t: underflow WARN_ON".
Only decrement the ref count in case of iommu_group_add_device
failure.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If the IOMMU driver advertises 'real' reserved regions for MSIs, but
still includes the software-managed region as well, we are currently
blind to the former and will configure the IOMMU domain to map MSIs into
the latter, which is unlikely to work as expected.
Since it would take a ridiculous hardware topology for both regions to
be valid (which would be rather difficult to support in general), we
should be safe to assume that the presence of any hardware regions makes
the software region irrelevant. However, the IOMMU driver might still
advertise the software region by default, particularly if the hardware
regions are filled in elsewhere by generic code, so it might not be fair
for VFIO to be super-strict about not mixing them. To that end, make
vfio_iommu_has_sw_msi() robust against the presence of both region types
at once, so that we end up doing what is almost certainly right, rather
than what is almost certainly wrong.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
For ARM-based systems with a GICv3 ITS to provide interrupt isolation,
but hardware limitations which are worked around by having MSIs bypass
SMMU translation (e.g. HiSilicon Hip06/Hip07), VFIO neglects to check
for the IRQ_DOMAIN_FLAG_MSI_REMAP capability, (and thus erroneously
demands unsafe_interrupts) if a software-managed MSI region is absent.
Fix this by always checking for isolation capability at both the IRQ
domain and IOMMU domain levels, rather than predicating that on whether
MSIs require an IOMMU mapping (which was always slightly tenuous logic).
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Root complex integrated endpoints do not have a link and therefore may
use a smaller PCIe capability in config space than we expect when
building our config map. Add a case for these to avoid reporting an
erroneous overlap.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Device lock bites again; if a device .remove() callback races a user
calling ioctl(VFIO_GROUP_GET_DEVICE_FD), the unbind request will hold
the device lock, but the user ioctl may have already taken a vfio_device
reference. In the case of a PCI device, the initial open will attempt
to reset the device, which again attempts to get the device lock,
resulting in deadlock. Use the trylock PCI reset interface and return
error on the open path if reset fails due to lock contention.
Link: https://lkml.org/lkml/2017/7/25/381
Reported-by: Wen Congyang <wencongyang2@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
- Include Intel XXV710 in INTx workaround (Alex Williamson)
- Make use of ERR_CAST() for error return (Dan Carpenter)
- Fix vfio_group release deadlock from iommu notifier (Alex Williamson)
- Unset KVM-VFIO attributes only on group match (Alex Williamson)
- Fix release path group/file matching with KVM-VFIO (Alex Williamson)
- Remove unnecessary lock uses triggering lockdep splat (Alex Williamson)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJZZ71TAAoJECObm247sIsi1mUP/1/5kubhHRE/Y11SnX4Bpie0
X6YVbV3WLGuV9jaFMI7EgYJLZ6pBvCRX0CLUzEdrbS4LAlTLQu2GSY5tqhcLJumE
0mV1Wj1jOO9BAur13/sJ+IohbZeK10dtbTkWv+YhrVSpRaLP+ituaKsajEV12WRH
6dxho2bqfvNcTC8yjE+vqe9mU3jXYPGuCx8oYGaDXEsCjzrdbOFnAG0/s/2cWpb+
D1ADHd3020VAHZRnHRBLFfMczza1jqllhSAUfdMw1gRGCQDq3k1XenzVNLLTbtYy
VmEWHa+R/OCfbKVxaDqPsgOTK7x7DKn+Pzb3lWCdQ8v5X+2ubHpVIYjxTDSSTbt3
YJ7a+hNk8AHkFgwS7x8BdOT8mmNGb1NZldjS4dv2VWkfcTnMQnubnYSCzGztto9h
P2THKBil6djPb9S3pCvtKUiHSIZedYZlKofUldrOdGDAZmzLLlf8lTzijGjDYKFM
pQeZC+xeEhZXURipgkH+a+paYgDtKEfwSlABODghjCcJf7S/GbyVPLOKLXzoVb2y
Ml8eGlo4O/cNniQK5faH447ilM7hzS1aG83uGnHTfe8VgKjI7Z5ZSxKOtoEq5bDz
bb91E6GVLKHqT0LVS1YZfrnqK0hX/QAd/sK1REM5nN8JNmPLyLpjv8FaJEhpk1vC
z4At5+pfKM8DYrW3EGmc
=3A4K
-----END PGP SIGNATURE-----
Merge tag 'vfio-v4.13-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Include Intel XXV710 in INTx workaround (Alex Williamson)
- Make use of ERR_CAST() for error return (Dan Carpenter)
- Fix vfio_group release deadlock from iommu notifier (Alex Williamson)
- Unset KVM-VFIO attributes only on group match (Alex Williamson)
- Fix release path group/file matching with KVM-VFIO (Alex Williamson)
- Remove unnecessary lock uses triggering lockdep splat (Alex Williamson)
* tag 'vfio-v4.13-rc1' of git://github.com/awilliam/linux-vfio:
vfio: Remove unnecessary uses of vfio_container.group_lock
vfio: New external user group/file match
kvm-vfio: Decouple only when we match a group
vfio: Fix group release deadlock
vfio: Use ERR_CAST() instead of open coding it
vfio/pci: Add Intel XXV710 to hidden INTx devices
The original intent of vfio_container.group_lock is to protect
vfio_container.group_list, however over time it's become a crutch to
prevent changes in container composition any time we call into the
iommu driver backend. This introduces problems when we start to have
more complex interactions, for example when a user's DMA unmap request
triggers a notification to an mdev vendor driver, who responds by
attempting to unpin mappings within that request, re-entering the
iommu backend. We incorrectly assume that the use of read-locks here
allow for this nested locking behavior, but a poorly timed write-lock
could in fact trigger a deadlock.
The current use of group_lock seems to fall into the trap of locking
code, not data. Correct that by removing uses of group_lock that are
not directly related to group_list. Note that the vfio type1 iommu
backend has its own mutex, vfio_iommu.lock, which it uses to protect
itself for each of these interfaces anyway. The group_lock appears to
be a redundancy for these interfaces and type1 even goes so far as to
release its mutex to allow for exactly the re-entrant code path above.
Reported-by: Chuanxiao Dong <chuanxiao.dong@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: stable@vger.kernel.org # v4.10+
At the point where the kvm-vfio pseudo device wants to release its
vfio group reference, we can't always acquire a new reference to make
that happen. The group can be in a state where we wouldn't allow a
new reference to be added. This new helper function allows a caller
to match a file to a group to facilitate this. Given a file and
group, report if they match. Thus the caller needs to already have a
group reference to match to the file. This allows the deletion of a
group without acquiring a new reference.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Cc: stable@vger.kernel.org
If vfio_iommu_group_notifier() acquires a group reference and that
reference becomes the last reference to the group, then vfio_group_put
introduces a deadlock code path where we're trying to unregister from
the iommu notifier chain from within a callout of that chain. Use a
work_struct to release this reference asynchronously.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Cc: stable@vger.kernel.org
Rename:
wait_queue_t => wait_queue_entry_t
'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.
Start sorting this out by renaming it to 'wait_queue_entry_t'.
This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It's a small cleanup to use ERR_CAST() here.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
XXV710 has the same broken INTx behavior as the rest of the X/XL710
series, the interrupt status register is not wired to report pending
INTx interrupts, thus we never associate the interrupt to the device.
Extend the device IDs to include these so that we hide that the
device supports INTx at all to the user.
Reported-by: Stefan Assmann <sassmann@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Highlights include:
- Larger virtual address space on 64-bit server CPUs. By default we use a 128TB
virtual address space, but a process can request access to the full 512TB by
passing a hint to mmap().
- Support for the new Power9 "XIVE" interrupt controller.
- TLB flushing optimisations for the radix MMU on Power9.
- Support for CAPI cards on Power9, using the "Coherent Accelerator Interface
Architecture 2.0".
- The ability to configure the mmap randomisation limits at build and runtime.
- Several small fixes and cleanups to the kprobes code, as well as support for
KPROBES_ON_FTRACE.
- Major improvements to handling of system reset interrupts, correctly treating
them as NMIs, giving them a dedicated stack and using a new hypervisor call
to trigger them, all of which should aid debugging and robustness.
Many fixes and other minor enhancements.
Thanks to:
Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan,
Aneesh Kumar K.V, Anshuman Khandual, Anton Blanchard, Balbir Singh, Ben
Hutchings, Benjamin Herrenschmidt, Bhupesh Sharma, Chris Packham, Christian
Zigotzky, Christophe Leroy, Christophe Lombard, Daniel Axtens, David Gibson,
Gautham R. Shenoy, Gavin Shan, Geert Uytterhoeven, Guilherme G. Piccoli,
Hamish Martin, Hari Bathini, Kees Cook, Laurent Dufour, Madhavan Srinivasan,
Mahesh J Salgaonkar, Mahesh Salgaonkar, Masami Hiramatsu, Matt Brown, Matthew
R. Ochs, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran,
Pan Xinhui, Paul Mackerras, Rashmica Gupta, Russell Currey, Sukadev
Bhattiprolu, Thadeu Lima de Souza Cascardo, Tobin C. Harding, Tyrel Datwyler,
Uma Krishnan, Vaibhav Jain, Vipin K Parashar, Yang Shi.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJZDHUMAAoJEFHr6jzI4aWAT7oQALkE2Nj3gjcn1z0SkFhq/1iO
Py9Elmqm4E+L6NKYtBY5dS8xVAJ088ffzERyqJ1FY1LHkB8tn8bWRcMQmbjAFzTI
V4TAzDNI890BN/F4ptrYRwNFxRBHAvZ4NDunTzagwYnwmTzW9PYHmOi4pvWTo3Tw
KFUQ0joLSEgHzyfXxYB3fyj41u8N0FZvhfazdNSqia2Y5Vwwv/ION5jKplDM+09Y
EtVEXFvaKAS1sjbM/d/Jo5rblHfR0D9/lYV10+jjyIokjzslIpyTbnj3izeYoM5V
I4h99372zfsEjBGPPXyM3khL3zizGMSDYRmJHQSaKxjtecS9SPywPTZ8ufO/aSzV
Ngq6nlND+f1zep29VQ0cxd3Jh40skWOXzxJaFjfDT25xa6FbfsWP2NCtk8PGylZ7
EyqTuCWkMgIP02KlX3oHvEB2LRRPCDmRU2zECecRGNJrIQwYC2xjoiVi7Q8Qe8rY
gr7Ib5Jj/a+uiTcCIy37+5nXq2s14/JBOKqxuYZIxeuZFvKYuRUipbKWO05WDOAz
m/pSzeC3J8AAoYiqR0gcSOuJTOnJpGhs7zrQFqnEISbXIwLW+ICumzOmTAiBqOEY
Rt8uW2gYkPwKLrE05445RfVUoERaAjaE06eRMOWS6slnngHmmnRJbf3PcoALiJkT
ediqGEj0/N1HMB31V5tS
=vSF3
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Highlights include:
- Larger virtual address space on 64-bit server CPUs. By default we
use a 128TB virtual address space, but a process can request access
to the full 512TB by passing a hint to mmap().
- Support for the new Power9 "XIVE" interrupt controller.
- TLB flushing optimisations for the radix MMU on Power9.
- Support for CAPI cards on Power9, using the "Coherent Accelerator
Interface Architecture 2.0".
- The ability to configure the mmap randomisation limits at build and
runtime.
- Several small fixes and cleanups to the kprobes code, as well as
support for KPROBES_ON_FTRACE.
- Major improvements to handling of system reset interrupts,
correctly treating them as NMIs, giving them a dedicated stack and
using a new hypervisor call to trigger them, all of which should
aid debugging and robustness.
- Many fixes and other minor enhancements.
Thanks to: Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple,
Andrew Donnellan, Aneesh Kumar K.V, Anshuman Khandual, Anton
Blanchard, Balbir Singh, Ben Hutchings, Benjamin Herrenschmidt,
Bhupesh Sharma, Chris Packham, Christian Zigotzky, Christophe Leroy,
Christophe Lombard, Daniel Axtens, David Gibson, Gautham R. Shenoy,
Gavin Shan, Geert Uytterhoeven, Guilherme G. Piccoli, Hamish Martin,
Hari Bathini, Kees Cook, Laurent Dufour, Madhavan Srinivasan, Mahesh J
Salgaonkar, Mahesh Salgaonkar, Masami Hiramatsu, Matt Brown, Matthew
R. Ochs, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Oliver
O'Halloran, Pan Xinhui, Paul Mackerras, Rashmica Gupta, Russell
Currey, Sukadev Bhattiprolu, Thadeu Lima de Souza Cascardo, Tobin C.
Harding, Tyrel Datwyler, Uma Krishnan, Vaibhav Jain, Vipin K Parashar,
Yang Shi"
* tag 'powerpc-4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (214 commits)
powerpc/64s: Power9 has no LPCR[VRMASD] field so don't set it
powerpc/powernv: Fix TCE kill on NVLink2
powerpc/mm/radix: Drop support for CPUs without lockless tlbie
powerpc/book3s/mce: Move add_taint() later in virtual mode
powerpc/sysfs: Move #ifdef CONFIG_HOTPLUG_CPU out of the function body
powerpc/smp: Document irq enable/disable after migrating IRQs
powerpc/mpc52xx: Don't select user-visible RTAS_PROC
powerpc/powernv: Document cxl dependency on special case in pnv_eeh_reset()
powerpc/eeh: Clean up and document event handling functions
powerpc/eeh: Avoid use after free in eeh_handle_special_event()
cxl: Mask slice error interrupts after first occurrence
cxl: Route eeh events to all drivers in cxl_pci_error_detected()
cxl: Force context lock during EEH flow
powerpc/64: Allow CONFIG_RELOCATABLE if COMPILE_TEST
powerpc/xmon: Teach xmon oops about radix vectors
powerpc/mm/hash: Fix off-by-one in comment about kernel contexts ids
powerpc/pseries: Enable VFIO
powerpc/powernv: Fix iommu table size calculation hook for small tables
powerpc/powernv: Check kzalloc() return value in pnv_pci_table_alloc
powerpc: Add arch/powerpc/tools directory
...
vfio_pin_pages_remote() is typically called to iterate over a range
of memory. Testing CAP_IPC_LOCK is relatively expensive, so it makes
sense to push it up to the caller, which can then repeatedly call
vfio_pin_pages_remote() using that value. This can show nearly a 20%
improvement on the worst case path through VFIO_IOMMU_MAP_DMA with
contiguous page mapping disabled. Testing RLIMIT_MEMLOCK is much more
lightweight, but we bring it along on the same principle and it does
seem to show a marginal improvement.
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
With vfio_lock_acct() testing the locked memory limit under mmap_sem,
it's redundant to do it here for a single page. We can also reorder
our tests such that we can avoid testing for reserved pages if we're
not doing accounting and let vfio_lock_acct() test the process
CAP_IPC_LOCK. Finally, this function oddly returns 1 on success.
Update to return zero on success, -errno on error. Since the function
only pins a single page, there's no need to return the number of pages
pinned.
N.B. vfio_pin_pages_remote() can pin a large contiguous range of pages
before calling vfio_lock_acct(). If we were to similarly remove the
extra test there, a user could temporarily pin far more pages than
they're allowed.
Suggested-by: Kirti Wankhede <kwankhede@nvidia.com>
Suggested-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If the mmap_sem is contented then the vfio type1 IOMMU backend will
defer locked page accounting updates to a workqueue task. This has a
few problems and depending on which side the user tries to play, they
might be over-penalized for unmaps that haven't yet been accounted or
race the workqueue to enter more mappings than they're allowed. The
original intent of this workqueue mechanism seems to be focused on
reducing latency through the ioctl, but we cannot do so at the cost
of correctness. Remove this workqueue mechanism and update the
callers to allow for failure. We can also now recheck the limit under
write lock to make sure we don't exceed it.
vfio_pin_pages_remote() also now necessarily includes an unwind path
which we can jump to directly if the consecutive page pinning finds
that we're exceeding the user's memory limits. This avoids the
current lazy approach which does accounting and mapping up to the
fault, only to return an error on the next iteration to unwind the
entire vfio_dma.
Cc: stable@vger.kernel.org
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This adds missing checking for kzalloc() return value.
Fixes: 4b6fad7097 ("powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The existing SPAPR TCE driver advertises both VFIO_SPAPR_TCE_IOMMU and
VFIO_SPAPR_TCE_v2_IOMMU types to the userspace and the userspace usually
picks the v2.
Normally the userspace would create a container, attach an IOMMU group
to it and only then set the IOMMU type (which would normally be v2).
However a specific IOMMU group may not support v2, in other words
it may not implement set_window/unset_window/take_ownership/
release_ownership and such a group should not be attached to
a v2 container.
This adds extra checks that a new group can do what the selected IOMMU
type suggests. The userspace can then test the return value from
ioctl(VFIO_SET_IOMMU, VFIO_SPAPR_TCE_v2_IOMMU) and try
VFIO_SPAPR_TCE_IOMMU.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
So far iommu_table obejcts were only used in virtual mode and had
a single owner. We are going to change this by implementing in-kernel
acceleration of DMA mapping requests. The proposed acceleration
will handle requests in real mode and KVM will keep references to tables.
This adds a kref to iommu_table and defines new helpers to update it.
This replaces iommu_free_table() with iommu_tce_table_put() and makes
iommu_free_table() static. iommu_tce_table_get() is not used in this patch
but it will be in the following patch.
Since this touches prototypes, this also removes @node_name parameter as
it has never been really useful on powernv and carrying it for
the pseries platform code to iommu_free_table() seems to be quite
useless as well.
This should cause no behavioral change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
At the moment iommu_table can be disposed by either calling
iommu_table_free() directly or it_ops::free(); the only implementation
of free() is in IODA2 - pnv_ioda2_table_free() - and it calls
iommu_table_free() anyway.
As we are going to have reference counting on tables, we need an unified
way of disposing tables.
This moves it_ops::free() call into iommu_free_table() and makes use
of the latter. The free() callback now handles only platform-specific
data.
As from now on the iommu_free_table() calls it_ops->free(), we need
to have it_ops initialized before calling iommu_free_table() so this
moves this initialization in pnv_pci_ioda2_create_table().
This should cause no behavioral change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The introduction of reserved regions has left a couple of rough edges
which we could do with sorting out sooner rather than later. Since we
are not yet addressing the potential dynamic aspect of software-managed
reservations and presenting them at arbitrary fixed addresses, it is
incongruous that we end up displaying hardware vs. software-managed MSI
regions to userspace differently, especially since ARM-based systems may
actually require one or the other, or even potentially both at once,
(which iommu-dma currently has no hope of dealing with at all). Let's
resolve the former user-visible inconsistency ASAP before the ABI has
been baked into a kernel release, in a way that also lays the groundwork
for the latter shortcoming to be addressed by follow-up patches.
For clarity, rename the software-managed type to IOMMU_RESV_SW_MSI, use
IOMMU_RESV_MSI to describe the hardware type, and document everything a
little bit. Since the x86 MSI remapping hardware falls squarely under
this meaning of IOMMU_RESV_MSI, apply that type to their regions as well,
so that we tell the same story to userspace across all platforms.
Secondly, as the various region types require quite different handling,
and it really makes little sense to ever try combining them, convert the
bitfield-esque #defines to a plain enum in the process before anyone
gets the wrong impression.
Fixes: d30ddcaa7b ("iommu: Add a new type field in iommu_resv_region")
Reviewed-by: Eric Auger <eric.auger@redhat.com>
CC: Alex Williamson <alex.williamson@redhat.com>
CC: David Woodhouse <dwmw2@infradead.org>
CC: kvm@vger.kernel.org
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The intent of the original warning is make sure that the mdev vendor
driver has removed any group notifiers at the point where the group
is closed by the user. Theoretically this would be through an
orderly shutdown where any devices are release prior to the group
release. We can't always count on an orderly shutdown, the user can
close the group before the notifier can be removed or the user task
might be killed. We'd like to add this sanity test when the group is
idle and the only references are from the devices within the group
themselves, but we don't have a good way to do that. Instead check
both when the group itself is removed and when the group is opened.
A bit later than we'd prefer, but better than the current over
aggressive approach.
Fixes: ccd46dbae7 ("vfio: support notifier chain in vfio_group")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: <stable@vger.kernel.org> # v4.10
Cc: Jike Song <jike.song@intel.com>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/mm.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
The APIs that are going to be moved first are:
mm_alloc()
__mmdrop()
mmdrop()
mmdrop_async_fn()
mmdrop_async()
mmget_not_zero()
mmput()
mmput_async()
get_task_mm()
mm_access()
mm_release()
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- Kconfig fixes for SPAPR_TCE_IOMMU=n (Michael Ellerman)
- Module softdep rather than request_module to simplify usage from
initrd (Alex Williamson)
- Comment typo fix (Changbin Du)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJYry1kAAoJECObm247sIsieQ8P/0iydvBdW3AACPADQIvfV3+K
PfAj3RwO1AvKB2Uiiql8t+S+qr5WO2Z4oeyOsaapmaOq8fxN76nTtIPUPg9SzkUx
DDFNT+Zl/wFGA+vDfPcDFaFAf35mNTLN2hmqcrkTwLL7flA0ICYZzpT0U3nP6aEt
go10DkQXQ1aGJkTfBluIqe0FTBd2EZ3+/mJbjiaHRlV5CcXGEtAAe8IC+XJoJJds
AZ1RQiLauE6Rj4t2JQegZJH+ntX7zPk329jD0DjxZq9y08uEqdyeXWj31O8SRucB
hD4QGN6NS8Qv+TlpdMVnsv6E3oZhUZIObH2p0nhUJ8ZpdEkTZoR/73rJKipz6svX
9TuuiIJL2ibZ40rEpABIx1N0ykHiUmXgtCuVfjiKSXmsdZYSu7bK/CbOVBGvS+WE
dKbgob44QozgD34wt5mv9by1jnxxeDodbIfkXE3p4b2TKaaFFnpCG2Yex6XOjjCU
T4peLwxTyszqVDdghJ5l8S2661+UqS+VZQXSXMQRg4fLLx+HjB0JPo/NcfjNFhUE
/RwXvOoVRRn0EKTX5v5Ty5YVx+/KWGm3wAsa1Ar+YGATKn3tS91Vz4cKYZA/bdUc
GKkGI5FrARjtWF860cBtq/KWk2qYy7EzCdVfkk56e5s4clrZwjbet6W/RmZGSvVn
R+XAN7GucRg/aMJyrwJM
=6Gze
-----END PGP SIGNATURE-----
Merge tag 'vfio-v4.11-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Kconfig fixes for SPAPR_TCE_IOMMU=n (Michael Ellerman)
- Module softdep rather than request_module to simplify usage from
initrd (Alex Williamson)
- Comment typo fix (Changbin Du)
* tag 'vfio-v4.11-rc1' of git://github.com/awilliam/linux-vfio:
vfio: fix a typo in comment of function vfio_pin_pages
vfio: Replace module request with softdep
vfio/mdev: Use a module softdep for vfio_mdev
vfio: Fix build break when SPAPR_TCE_IOMMU=n
Correct the description that 'unpinned' -> 'pinned'.
Signed-off-by: Changbin Du <changbin.du@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The changes include:
* KVM PCIe/MSI passthrough support on ARM/ARM64
* Introduction of a core representation for individual hardware
iommus
* Support for IOMMU privileged mappings as supported by some
ARM IOMMUS
* 16-bit SID support for ARM-SMMUv2
* Stream table optimization for ARM-SMMUv3
* Various fixes and other small improvements
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJYqw3hAAoJECvwRC2XARrjPy0P/35ykfHIAESJuF+72ziaoAYA
ZvMrli8rGq7n+ntaIGPx9rV+hZTUSF8V2bfsYV7SAn5iYuViXZqvOtC3BAEp6GNC
cdMeQfqXoHiWVMdXdOihzk+6YCQvBxqPOvUtYFqVhOo3Yrz8Dc71KsKvrTndEUVY
f7bXHKssVONkWMga9lIVDgEefG5VyJPEQaxJXB9ymLHXbwWOcISe1lgtkrzFSxSH
H9YNI07Tfcxfn6rN8jGmcYFYM58xwBicpB4HBw5uytMBYAsxqTEsx4X5dGpOF6RH
cFW9nby+9ZlcTMyuWXKAck3o8df2ZC1xiSjnz+DHQdBPFiFNqIL3PVUcaz9PnF2e
e6Y+DA3s+jykeiCvi2K0Z9RwTg7t8S5spel+UCeNVSnIjE9pqZNLF8vsDjF17zuR
+zcFm7RVI397QVQGp0dbqhtxnwqt/3CX/wlzpvuNdEZa4vwujpcnM9tfl6gyFrF8
awK9Fj5ryAn4DEiM+8yiRHwLrU5ij1cfc8jQdqleEB2ca7Wv3g1uhhS0QTXOFY9u
A7ygOna25U1EcOwjC6ebjiEL115ZEOrXo+eChhzCHoUEHCVxL+L/NAMEsUcMqPIw
3XsHhru0HbXgd5O5wHX39s2je8G3+ElqQwy8Ja3DimV6tvon7yaKCXy9QU+2aa1u
3r53R/0mW1ijtOfK+I0b
=5b3I
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU UPDATES from Joerg Roedel:
- KVM PCIe/MSI passthrough support on ARM/ARM64
- introduction of a core representation for individual hardware iommus
- support for IOMMU privileged mappings as supported by some ARM IOMMUS
- 16-bit SID support for ARM-SMMUv2
- stream table optimization for ARM-SMMUv3
- various fixes and other small improvements
* tag 'iommu-updates-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (61 commits)
vfio/type1: Fix error return code in vfio_iommu_type1_attach_group()
iommu: Remove iommu_register_instance interface
iommu/exynos: Make use of iommu_device_register interface
iommu/mediatek: Make use of iommu_device_register interface
iommu/msm: Make use of iommu_device_register interface
iommu/arm-smmu: Make use of the iommu_register interface
iommu: Add iommu_device_set_fwnode() interface
iommu: Make iommu_device_link/unlink take a struct iommu_device
iommu: Add sysfs bindings for struct iommu_device
iommu: Introduce new 'struct iommu_device'
iommu: Rename struct iommu_device
iommu: Rename iommu_get_instance()
iommu: Fix static checker warning in iommu_insert_device_resv_regions
iommu: Avoid unnecessary assignment of dev->iommu_fwspec
iommu/mediatek: Remove bogus 'select' statements
iommu/dma: Remove bogus dma_supported() implementation
iommu/ipmmu-vmsa: Restrict IOMMU Domain Geometry to 32-bit address space
iommu/vt-d: Don't over-free page table directories
iommu/vt-d: Tylersburg isoch identity map check is done too late.
iommu/vt-d: Fix some macros that are incorrectly specified in intel-iommu
...
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.
Fixes: 5d70499218 ("vfio/type1: Allow transparent MSI IOVA allocation")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Rather than doing a module request from within the init function, add
a soft dependency on the available IOMMU backend drivers. This makes
the dependency visible to userspace when picking modules for the
ram disk.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Use an explicit module softdep rather than a request module call such
that the dependency is exposed to userspace. This allows us to more
easily support modules loaded at initrd time.
Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Currently the kconfig logic for VFIO_IOMMU_SPAPR_TCE and VFIO_SPAPR_EEH
is broken when SPAPR_TCE_IOMMU=n. Leading to:
warning: (VFIO) selects VFIO_IOMMU_SPAPR_TCE which has unmet direct dependencies (VFIO && SPAPR_TCE_IOMMU)
warning: (VFIO) selects VFIO_IOMMU_SPAPR_TCE which has unmet direct dependencies (VFIO && SPAPR_TCE_IOMMU)
drivers/vfio/vfio_iommu_spapr_tce.c:113:8: error: implicit declaration of function 'mm_iommu_find'
This stems from the fact that VFIO selects VFIO_IOMMU_SPAPR_TCE, and
although it has an if clause, the condition is not correct.
We could fix it by doing select VFIO_IOMMU_SPAPR_TCE if SPAPR_TCE_IOMMU,
but the cleaner fix is to drop the selects and tie VFIO_IOMMU_SPAPR_TCE
to the value of VFIO, and express the dependencies in only once place.
Do the same for VFIO_SPAPR_EEH.
The end result is that the values of VFIO_IOMMU_SPAPR_TCE and
VFIO_SPAPR_EEH follow the value of VFIO, except when SPAPR_TCE_IOMMU=n
and/or EEH=n. Which is exactly what we want to happen.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If a container already has a group attached, attaching a new group
should just program already created IOMMU tables to the hardware via
the iommu_table_group_ops::set_window() callback.
However commit 6f01cc692a ("vfio/spapr: Add a helper to create
default DMA window") did not just simplify the code but also removed
the set_window() calls in the case of attaching groups to a container
which already has tables so it broke VFIO PCI hotplug.
This reverts set_window() bits in tce_iommu_take_ownership_ddw().
Fixes: 6f01cc692a ("vfio/spapr: Add a helper to create default DMA window")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Commit d9c728949d ("vfio/spapr: Postpone default window creation")
added an additional exit to the VFIO_IOMMU_SPAPR_TCE_CREATE case and
made it possible to return from tce_iommu_ioctl() without unlocking
container->lock; this fixes the issue.
Fixes: d9c728949d ("vfio/spapr: Postpone default window creation")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In case the IOMMU translates MSI transactions (typical case
on ARM), we check MSI remapping capability at IRQ domain
level. Otherwise it is checked at IOMMU level.
At this stage the arm-smmu-(v3) still advertise the
IOMMU_CAP_INTR_REMAP capability at IOMMU level. This will be
removed in subsequent patches.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Bharat Bhushan <bharat.bhushan@nxp.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
When attaching a group to the container, check the group's
reserved regions and test whether the IOMMU translates MSI
transactions. If yes, we initialize an IOVA allocator through
the iommu_get_msi_cookie API. This will allow the MSI IOVAs
to be transparently allocated on MSI controller's compose().
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Bharat Bhushan <bharat.bhushan@nxp.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Using has_capability() rather than ns_capable(), we're no longer using
this header.
Cc: Jike Song <jike.song@intel.com>
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Before the mdev enhancement type1 iommu used capable() to test the
capability of current task; in the course of mdev development a
new requirement, testing for another task other than current, was
raised. ns_capable() was used for this purpose, however it still
tests current, the only difference is, in a specified namespace.
Fix it by using has_capability() instead, which tests the cap for
specified task in init_user_ns, the same namespace as capable().
Cc: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Jike Song <jike.song@intel.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Here, pci_iomap can fail, handle this case release selected
pci regions and return -ENOMEM.
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Using ancient compilers (gcc-4.5 or older) on ARM, we get a link
failure with the vfio-pci driver:
ERROR: "__aeabi_lcmp" [drivers/vfio/pci/vfio-pci.ko] undefined!
The reason is that the compiler tries to do a comparison of
a 64-bit range. This changes it to convert to a 32-bit number
explicitly first, as newer compilers do for themselves.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Abstract access to mdev_device so that we can define which interfaces
are public rather than relying on comments in the structure.
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Jike Song <jike.song@intel.com>
Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
Rather than hoping for good behavior by marking some elements
internal, enforce it by making the entire structure private and
creating an accessor function for the one useful external field.
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Jike Song <jike.song@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
Add an mdev_ prefix so we're not poluting the namespace so much.
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Jike Song <jike.song@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
Using the mtty mdev sample driver we can generate a remove race by
starting one shell that continuously creates mtty devices and several
other shells all attempting to remove devices, in my case four remove
shells. The fault occurs in mdev_remove_sysfs_files() where the
passed type arg is NULL, which suggests we've received a struct device
in mdev_device_remove() but it's in some sort of teardown state. The
solution here is to make use of the accidentally unused list_head on
the mdev_device such that the mdev core keeps a list of all the mdev
devices. This allows us to validate that we have a valid mdev before
we start removal, remove it from the list to prevent others from
working on it, and if the vendor driver refuses to remove, we can
re-add it to the list.
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
As part of the mdev support, type1 now gets a task reference per
vfio_dma and uses that to get an mm reference for the task while
working on accounting. That's correct, but it's not fast. For some
paths, like vfio_pin_pages_remote(), we know we're only called from
user context, so we can restore the lighter weight calls. In other
cases, we're effectively already testing whether we're in the stored
task context elsewhere, extend this vfio_lock_acct() as well.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
Highlights include:
- Support for the kexec_file_load() syscall, which is a prereq for secure and
trusted boot.
- Prevent kernel execution of userspace on P9 Radix (similar to SMEP/PXN).
- Sort the exception tables at build time, to save time at boot, and store
them as relative offsets to save space in the kernel image & memory.
- Allow building the kernel with thin archives, which should allow us to build
an allyesconfig once some other fixes land.
- Build fixes to allow us to correctly rebuild when changing the kernel endian
from big to little or vice versa.
- Plumbing so that we can avoid doing a full mm TLB flush on P9 Radix.
- Initial stack protector support (-fstack-protector).
- Support for dumping the radix (aka. Linux) and hash page tables via debugfs.
- Fix an oops in cxl coredump generation when cxl_get_fd() is used.
- Freescale updates from Scott: "Highlights include 8xx hugepage support,
qbman fixes/cleanup, device tree updates, and some misc cleanup."
- Many and varied fixes and minor enhancements as always.
Thanks to:
Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Anshuman Khandual,
Anton Blanchard, Balbir Singh, Bartlomiej Zolnierkiewicz, Christophe Jaillet,
Christophe Leroy, Denis Kirjanov, Elimar Riesebieter, Frederic Barrat,
Gautham R. Shenoy, Geliang Tang, Geoff Levand, Jack Miller, Johan Hovold,
Lars-Peter Clausen, Libin, Madhavan Srinivasan, Michael Neuling, Nathan
Fontenot, Naveen N. Rao, Nicholas Piggin, Pan Xinhui, Peter Senna Tschudin,
Rashmica Gupta, Rui Teng, Russell Currey, Scott Wood, Simon Guo, Suraj
Jitindar Singh, Thiago Jung Bauermann, Tobias Klauser, Vaibhav Jain.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJYU4YSAAoJEFHr6jzI4aWAC4gQALtIAqqPon0Cd5b/FVVcMbW7
mMqB2b/0FGEl5GoRTzGUDaQqElilm6AEVfHO86C7DFji/a6olneFfw87iz+mtWuZ
JvrNq68ZiSnoeszdUy4MgtXFLb5sTzNMev4skaHfjI9E5CepWBoR0zH4G+kNVnd5
WSgudv8Cq4Px+MEuTOigt3QYjHzZ3cw/XNOOm9c+oGj+PDW4O9UItVI+S1WLoey4
rAB2nRcLMDPuwfRQC9XsF3zEbkv4h1dEXo/EBRuRpcF+0lLTzFw1lv1WE8OxlUmS
kAXbty3dIytBfSbtJT0c0Ps6sfQ4HFhu6ZV2fjnxNTz2KDkBIN7LBYHmBYiqY9oZ
9zvbUWtfiTu5ocfRtTq7rC/Hcj4Kbr9S9F/FvXR0WyDsKgu4xxAovqC3gcn6YjYK
Rr1tcCI4nUzyhVJVmd+OEhUvc5JbFy9aGage+YeOyejfvvSbXIunaxWlPjoDkvim
Vjl+UKU8gw51XFssqY5ZBi/HNlMFKYedLpMFp/fItnLglhj50V0eFWkpDgdSCYom
vo9ifPLZx8n8m8De3H7TV4E0F4gCHcTeqZdu7tW9AAUVM6iLJcDLm3asGmtNh21t
snOHNOJ5QSIno6ezUUg29T6VBjbPh46fdJJSlIZrEe8OzLZ1haGyttf0tD00PQvY
Z2W/m3gxafnOeGgBqvyv
=xOzf
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Highlights include:
- Support for the kexec_file_load() syscall, which is a prereq for
secure and trusted boot.
- Prevent kernel execution of userspace on P9 Radix (similar to
SMEP/PXN).
- Sort the exception tables at build time, to save time at boot, and
store them as relative offsets to save space in the kernel image &
memory.
- Allow building the kernel with thin archives, which should allow us
to build an allyesconfig once some other fixes land.
- Build fixes to allow us to correctly rebuild when changing the
kernel endian from big to little or vice versa.
- Plumbing so that we can avoid doing a full mm TLB flush on P9
Radix.
- Initial stack protector support (-fstack-protector).
- Support for dumping the radix (aka. Linux) and hash page tables via
debugfs.
- Fix an oops in cxl coredump generation when cxl_get_fd() is used.
- Freescale updates from Scott: "Highlights include 8xx hugepage
support, qbman fixes/cleanup, device tree updates, and some misc
cleanup."
- Many and varied fixes and minor enhancements as always.
Thanks to:
Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Anshuman
Khandual, Anton Blanchard, Balbir Singh, Bartlomiej Zolnierkiewicz,
Christophe Jaillet, Christophe Leroy, Denis Kirjanov, Elimar
Riesebieter, Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff
Levand, Jack Miller, Johan Hovold, Lars-Peter Clausen, Libin,
Madhavan Srinivasan, Michael Neuling, Nathan Fontenot, Naveen N.
Rao, Nicholas Piggin, Pan Xinhui, Peter Senna Tschudin, Rashmica
Gupta, Rui Teng, Russell Currey, Scott Wood, Simon Guo, Suraj
Jitindar Singh, Thiago Jung Bauermann, Tobias Klauser, Vaibhav Jain"
[ And thanks to Michael, who took time off from a new baby to get this
pull request done. - Linus ]
* tag 'powerpc-4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (174 commits)
powerpc/fsl/dts: add FMan node for t1042d4rdb
powerpc/fsl/dts: add sg_2500_aqr105_phy4 alias on t1024rdb
powerpc/fsl/dts: add QMan and BMan nodes on t1024
powerpc/fsl/dts: add QMan and BMan nodes on t1023
soc/fsl/qman: test: use DEFINE_SPINLOCK()
powerpc/fsl-lbc: use DEFINE_SPINLOCK()
powerpc/8xx: Implement support of hugepages
powerpc: get hugetlbpage handling more generic
powerpc: port 64 bits pgtable_cache to 32 bits
powerpc/boot: Request no dynamic linker for boot wrapper
soc/fsl/bman: Use resource_size instead of computation
soc/fsl/qe: use builtin_platform_driver
powerpc/fsl_pmc: use builtin_platform_driver
powerpc/83xx/suspend: use builtin_platform_driver
powerpc/ftrace: Fix the comments for ftrace_modify_code
powerpc/perf: macros for power9 format encoding
powerpc/perf: power9 raw event format encoding
powerpc/perf: update attribute_group data structure
powerpc/perf: factor out the event format field
powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJYUt1vAAoJEFmIoMA60/r8abgP/3R+5Lsk5/kfAHk5/2Mtqbvg
mZ0eDUpY9GbUeMjSq84Nr2H8u7d+1AJCCu8KtDJYZCmjZpnSp2SuE2PS5JoGC7zC
fintD24jlIF4/J5+HeVXXmbfr3xATxvpTuiSLEi8sLBRJ3KRIswhMSwoPwOyeTQw
v/EclWKPGYcI5Zp0oigY9/Jd3q3lQ17KXppi/0dDoLh7PNOFvEHItXWzmf++u/NP
iYT9R1xmzEsy0/HRd6hiwPT2xA8YsAXxgobhHooUgh1FWmZ02Tg1WjgDemOW4lVh
kNIUcsLczh7wZCceogrrJ+pwb9+NyyIyKuHPv6OG3ieyz1IZdznaj1fAE5HJYiPo
eVS7cP1S6DyV3Y5qFj5F2dSRS7T4GXdXG5mNhmeCpUHs0vfzSCG36jLmhTy8UIxs
1rCf5oFa+uU9q0okfH8VtcGOXqWjGgyxTSGGfF71HUMLnPbsci2fxC2cO6svzIX7
wDY0uxOzpyMIYMuQR6iz7VqvAwEaZ+7pfMIrWWdDcQ9/5tCNJ49cLuKaThPL4bVu
juiGBQtnTLg8tjrhjDL9tQiJpuVIweVXyyQ1fvZoVXkMLlhVCF2ttirvwFUit2PB
84OlevQZ+9QdE/qalrWbv4qzhesuiwu0avkzjGoqg6tWTF0epu2AHI2vqy6UBYEG
tcfJPEcz1019PKZNSvWy
=ut0k
-----END PGP SIGNATURE-----
Merge tag 'pci-v4.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI updates from Bjorn Helgaas:
"PCI changes:
- add support for PCI on ARM64 boxes with ACPI. We already had this
for theoretical spec-compliant hardware; now we're adding quirks
for the actual hardware (Cavium, HiSilicon, Qualcomm, X-Gene)
- add runtime PM support for hotplug ports
- enable runtime suspend for Intel UHCI that uses platform-specific
wakeup signaling
- add yet another host bridge registration interface. We hope this is
extensible enough to subsume the others
- expose device revision in sysfs for DRM
- to avoid device conflicts, make sure any VF BAR updates are done
before enabling the VF
- avoid unnecessary link retrains for ASPM
- allow INTx masking on Mellanox devices that support it
- allow access to non-standard VPD for Chelsio devices
- update Broadcom iProc support for PAXB v2, PAXC v2, inbound DMA,
etc
- update Rockchip support for max-link-speed
- add NVIDIA Tegra210 support
- add Layerscape LS1046a support
- update R-Car compatibility strings
- add Qualcomm MSM8996 support
- remove some uninformative bootup messages"
* tag 'pci-v4.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (115 commits)
PCI: Enable access to non-standard VPD for Chelsio devices (cxgb3)
PCI: Expand "VPD access disabled" quirk message
PCI: pciehp: Remove loading message
PCI: hotplug: Remove hotplug core message
PCI: Remove service driver load/unload messages
PCI/AER: Log AER IRQ when claiming Root Port
PCI/AER: Log errors with PCI device, not PCIe service device
PCI/AER: Remove unused version macros
PCI/PME: Log PME IRQ when claiming Root Port
PCI/PME: Drop unused support for PMEs from Root Complex Event Collectors
PCI: Move config space size macros to pci_regs.h
x86/platform/intel-mid: Constify mid_pci_platform_pm
PCI/ASPM: Don't retrain link if ASPM not possible
PCI: iproc: Skip check for legacy IRQ on PAXC buses
PCI: pciehp: Leave power indicator on when enabling already-enabled slot
PCI: pciehp: Prioritize data-link event over presence detect
PCI: rcar: Add gen3 fallback compatibility string for pcie-rcar
PCI: rcar: Use gen2 fallback compatibility last
PCI: rcar-gen2: Use gen2 fallback compatibility last
PCI: rockchip: Move the deassert of pm/aclk/pclk after phy_init()
..
Patch series "mm: unexport __get_user_pages_unlocked()".
This patch series continues the cleanup of get_user_pages*() functions
taking advantage of the fact we can now pass gup_flags as we please.
It firstly adds an additional 'locked' parameter to
get_user_pages_remote() to allow for its callers to utilise
VM_FAULT_RETRY functionality. This is necessary as the invocation of
__get_user_pages_unlocked() in process_vm_rw_single_vec() makes use of
this and no other existing higher level function would allow it to do
so.
Secondly existing callers of __get_user_pages_unlocked() are replaced
with the appropriate higher-level replacement -
get_user_pages_unlocked() if the current task and memory descriptor are
referenced, or get_user_pages_remote() if other task/memory descriptors
are referenced (having acquiring mmap_sem.)
This patch (of 2):
Add a int *locked parameter to get_user_pages_remote() to allow
VM_FAULT_RETRY faulting behaviour similar to get_user_pages_[un]locked().
Taking into account the previous adjustments to get_user_pages*()
functions allowing for the passing of gup_flags, we are now in a
position where __get_user_pages_unlocked() need only be exported for his
ability to allow VM_FAULT_RETRY behaviour, this adjustment allows us to
subsequently unexport __get_user_pages_unlocked() as well as allowing
for future flexibility in the use of get_user_pages_remote().
[sfr@canb.auug.org.au: merge fix for get_user_pages_remote API change]
Link: http://lkml.kernel.org/r/20161122210511.024ec341@canb.auug.org.au
Link: http://lkml.kernel.org/r/20161027095141.2569-2-lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move PCI configuration space size macros (PCI_CFG_SPACE_SIZE and
PCI_CFG_SPACE_EXP_SIZE) from drivers/pci/pci.h to
include/uapi/linux/pci_regs.h so they can be used by more drivers and
eliminate duplicate definitions.
[bhelgaas: Expand comment to include PCI-X details]
Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Passing zero for the size to vfio_find_dma() isn't compatible with
matching the start address of an existing vfio_dma. Doing so triggers a
corner case. In vfio_find_dma(), when the start address is equal to
dma->iova and size is 0, check for the end of search range makes it to
take wrong side of RB-tree. That fails the search even though the address
is present in mapped dma ranges.
In functions pin_pages and unpin_pages, the iova which is being searched
is base address of page to be pinned or unpinned. So here size should be
set to PAGE_SIZE, as argument to vfio_find_dma().
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Passing zero for the size to vfio_find_dma() isn't compatible with
matching the start address of an existing vfio_dma. Doing so triggers a
corner case. In vfio_find_dma(), when the start address is equal to
dma->iova and size is 0, check for the end of search range makes it to
take wrong side of RB-tree. That fails the search even though the address
is present in mapped dma ranges. Due to this, in vfio_dma_do_unmap(),
while checking boundary conditions, size should be set to 1 for verifying
start address of unmap range.
vfio_find_dma() is also used to verify last address in unmap range with
size = 0, but in that case address to be searched is calculated with
start + size - 1 and so it works correctly.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
[aw: changelog tweak]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
mdev vendor driver should unregister the iommu notifier since the vfio
iommu can persist beyond the attachment of the mdev group. WARN_ON will
show warning if vendor driver doesn't unregister the notifier and is
forced to follow the implementations steps.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
At the moment the userspace tool is expected to request pinning of
the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
When the userspace process finishes, all the pinned pages need to
be put; this is done as a part of the userspace memory context (MM)
destruction which happens on the very last mmdrop().
This approach has a problem that a MM of the userspace process
may live longer than the userspace process itself as kernel threads
use userspace process MMs which was runnning on a CPU where
the kernel thread was scheduled to. If this happened, the MM remains
referenced until this exact kernel thread wakes up again
and releases the very last reference to the MM, on an idle system this
can take even hours.
This moves preregistered regions tracking from MM to VFIO; insteads of
using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
added so each container releases regions which it has pre-registered.
This changes the userspace interface to return EBUSY if a memory
region is already registered in a container. However it should not
have any practical effect as the only userspace tool available now
does register memory region once per container anyway.
As tce_iommu_register_pages/tce_iommu_unregister_pages are called
under container->lock, this does not need additional locking.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In some situations the userspace memory context may live longer than
the userspace process itself so if we need to do proper memory context
cleanup, we better have tce_container take a reference to mm_struct and
use it later when the process is gone (@current or @current->mm is NULL).
This references mm and stores the pointer in the container; this is done
in a new helper - tce_iommu_mm_set() - when one of the following happens:
- a container is enabled (IOMMU v1);
- a first attempt to pre-register memory is made (IOMMU v2);
- a DMA window is created (IOMMU v2).
The @mm stays referenced till the container is destroyed.
This replaces current->mm with container->mm everywhere except debug
prints.
This adds a check that current->mm is the same as the one stored in
the container to prevent userspace from making changes to a memory
context of other processes.
DMA map/unmap ioctls() do not check for @mm as they already check
for @enabled which is set after tce_iommu_mm_set() is called.
This does not reference a task as multiple threads within the same mm
are allowed to ioctl() to vfio and supposedly they will have same limits
and capabilities and if they do not, we'll just fail with no harm made.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We are going to allow the userspace to configure container in
one memory context and pass container fd to another so
we are postponing memory allocations accounted against
the locked memory limit. One of previous patches took care of
it_userspace.
At the moment we create the default DMA window when the first group is
attached to a container; this is done for the userspace which is not
DDW-aware but familiar with the SPAPR TCE IOMMU v2 in the part of memory
pre-registration - such client expects the default DMA window to exist.
This postpones the default DMA window allocation till one of
the folliwing happens:
1. first map/unmap request arrives;
2. new window is requested;
This adds noop for the case when the userspace requested removal
of the default window which has not been created yet.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There is already a helper to create a DMA window which does allocate
a table and programs it to the IOMMU group. However
tce_iommu_take_ownership_ddw() did not use it and did these 2 calls
itself to simplify error path.
Since we are going to delay the default window creation till
the default window is accessed/removed or new window is added,
we need a helper to create a default window from all these cases.
This adds tce_iommu_create_default_window(). Since it relies on
a VFIO container to have at least one IOMMU group (for future use),
this changes tce_iommu_attach_group() to add a group to the container
first and then call the new helper.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The iommu_table struct manages a hardware TCE table and a vmalloc'd
table with corresponding userspace addresses. Both are allocated when
the default DMA window is created and this happens when the very first
group is attached to a container.
As we are going to allow the userspace to configure container in one
memory context and pas container fd to another, we have to postpones
such allocations till a container fd is passed to the destination
user process so we would account locked memory limit against the actual
container user constrainsts.
This postpones the it_userspace array allocation till it is used first
time for mapping. The unmapping patch already checks if the array is
allocated.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This changes mm_iommu_xxx helpers to take mm_struct as a parameter
instead of getting it from @current which in some situations may
not have a valid reference to mm.
This changes helpers to receive @mm and moves all references to @current
to the caller, including checks for !current and !current->mm;
checks in mm_iommu_preregistered() are removed as there is no caller
yet.
This moves the mm_iommu_adjust_locked_vm() call to the caller as
it receives mm_iommu_table_group_mem_t but it needs mm.
This should cause no behavioral change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Beyond vfio_iommu events, users might also be interested in
vfio_group events. For example, if a vfio_group is used along
with Qemu/KVM, whenever kvm pointer is set to/cleared from the
vfio_group, users could be notified.
Currently only VFIO_GROUP_NOTIFY_SET_KVM supported.
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Jike Song <jike.song@intel.com>
[aw: remove use of new typedef]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Currently vfio_register_notifier assumes that there is only one
notifier chain, which is in vfio_iommu. However, the user might
also be interested in events other than vfio_iommu, for example,
vfio_group. Refactor vfio_{un}register_notifier implementation
to make it feasible.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Jike Song <jike.song@intel.com>
[aw: merge with commit 816ca69ea9c7 ("vfio: Fix handling of error returned by 'vfio_group_get_from_dev()'"), remove typedef]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
'vfio_group_get_from_dev()' seems to return only NULL on error, not an
error pointer.
Fixes: 2169037dc3 ("vfio iommu: Added pin and unpin callback functions to vfio_iommu_driver_ops")
Fixes: c086de818d ("vfio iommu: Add blocking notifier to notify DMA_UNMAP")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Capability header next field is an offset relative to the start of
the INFO buffer. tmp->next is assigned the proper value but iterations
implemented in vfio_info_cap_add and vfio_info_cap_shift use next
as an offset between the headers. When coping with multiple capabilities
this leads to an Oops.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
As of commit d97ffe2368 ("PCI: Fix return value from
pci_user_{read,write}_config_*()") it's unnecessary to call
pcibios_err_to_errno() to fixup the return value from these functions.
pcibios_err_to_errno() already does simple passthrough of -errno values,
therefore no functional change is expected.
[aw: changelog]
Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Vendor driver using mediated device framework would use same mechnism to
validate and prepare IRQs. Introducing this function to reduce code
replication in multiple drivers.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Update msix_sparse_mmap_cap() to use vfio_info_add_capability()
Update region type capability to use vfio_info_add_capability()
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Vendor driver using mediated device framework should use
vfio_info_add_capability() to add capabilities.
Introduced this function to reduce code duplication in vendor drivers.
vfio_info_cap_shift() manipulated a data buffer to add an offset to each
element in a chain. This data buffer is documented in a uapi header.
Changing vfio_info_cap_shift symbol to be available to all drivers.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Added blocking notifier to IOMMU TYPE1 driver to notify vendor drivers
about DMA_UNMAP.
Exported two APIs vfio_register_notifier() and vfio_unregister_notifier().
Notifier should be registered, if external user wants to use
vfio_pin_pages()/vfio_unpin_pages() APIs to pin/unpin pages.
Vendor driver should use VFIO_IOMMU_NOTIFY_DMA_UNMAP action to invalidate
mappings.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
Mediated device only uses IOMMU APIs, the underlying hardware can be
managed by an IOMMU domain.
Aim of this change is:
- To use most of the code of TYPE1 IOMMU driver for mediated devices
- To support direct assigned device and mediated device in single module
This change adds pin and unpin support for mediated device to TYPE1 IOMMU
backend module. More details:
- Domain for external user is tracked separately in vfio_iommu structure.
It is allocated when group for first mdev device is attached.
- Pages pinned for external domain are tracked in each vfio_dma structure
for that iova range.
- Page tracking rb-tree in vfio_dma keeps <iova, pfn, ref_count>. Key of
rb-tree is iova, but it actually aims to track pfns.
- On external pin request for an iova, page is pinned once, if iova is
already pinned and tracked, ref_count is incremented.
- External unpin request unpins pages only when ref_count is 0.
- Pinned pages list is used to find pfn from iova and then unpin it.
WARN_ON is added if there are entires in pfn_list while detaching the
group and releasing the domain.
- Page accounting is updated to account in its address space where the
pages are pinned/unpinned, i.e dma->task
- Accouting for mdev device is only done if there is no iommu capable
domain in the container. When there is a direct device assigned to the
container and that domain is iommu capable, all pages are already pinned
during DMA_MAP.
- Page accouting is updated on hot plug and unplug mdev device and pass
through device.
Tested by assigning below combinations of devices to a single VM:
- GPU pass through only
- vGPU device only
- One GPU pass through and one vGPU device
- Linux VM hot plug and unplug vGPU device while GPU pass through device
exist
- Linux VM hot plug and unplug GPU pass through device while vGPU device
exist
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Add task structure to vfio_dma structure. Task structure is used for:
- During DMA_UNMAP, same task who mapped it or other task who shares same
address space is allowed to unmap, otherwise unmap fails.
QEMU maps few iova ranges initially, then fork threads and from the child
thread calls DMA_UNMAP on previously mapped iova. Since child shares same
address space, DMA_UNMAP is successful.
- Avoid accessing struct mm while process is exiting by acquiring
reference of task's mm during page accounting.
- It is also used to get task mlock capability and rlimit for mlock.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Update arguments of vaddr_get_pfn() to take struct mm_struct *mm as input
argument.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Added APIs for pining and unpining set of pages. These call back into
backend iommu module to actually pin and unpin pages.
Added two new callback functions to struct vfio_iommu_driver_ops. Backend
IOMMU module that supports pining and unpinning pages for mdev devices
should provide these functions.
Renamed static functions in vfio_type1_iommu.c to resolve conflicts
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This change rearrange functions to have common function to increment
container_users
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Jike Song <jike.song@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This patch rearranges functions to get vfio_group from device
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Jike Song <jike.song@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vfio_mdev driver registers with mdev core driver.
mdev core driver creates mediated device and calls probe routine of
vfio_mdev driver for each device.
Probe routine of vfio_mdev driver adds mediated device to VFIO core module
This driver forms a shim layer that pass through VFIO devices operations
to vendor driver for mediated devices.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Jike Song <jike.song@intel.com>
Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Design for Mediated Device Driver:
Main purpose of this driver is to provide a common interface for mediated
device management that can be used by different drivers of different
devices.
This module provides a generic interface to create the device, add it to
mediated bus, add device to IOMMU group and then add it to vfio group.
Below is the high Level block diagram, with Nvidia, Intel and IBM devices
as example, since these are the devices which are going to actively use
this module as of now.
+---------------+
| |
| +-----------+ | mdev_register_driver() +--------------+
| | | +<------------------------+ __init() |
| | mdev | | | |
| | bus | +------------------------>+ |<-> VFIO user
| | driver | | probe()/remove() | vfio_mdev.ko | APIs
| | | | | |
| +-----------+ | +--------------+
| |
| MDEV CORE |
| MODULE |
| mdev.ko |
| +-----------+ | mdev_register_device() +--------------+
| | | +<------------------------+ |
| | | | | nvidia.ko |<-> physical
| | | +------------------------>+ | device
| | | | callback +--------------+
| | Physical | |
| | device | | mdev_register_device() +--------------+
| | interface | |<------------------------+ |
| | | | | i915.ko |<-> physical
| | | +------------------------>+ | device
| | | | callback +--------------+
| | | |
| | | | mdev_register_device() +--------------+
| | | +<------------------------+ |
| | | | | ccw_device.ko|<-> physical
| | | +------------------------>+ | device
| | | | callback +--------------+
| +-----------+ |
+---------------+
Core driver provides two types of registration interfaces:
1. Registration interface for mediated bus driver:
/**
* struct mdev_driver - Mediated device's driver
* @name: driver name
* @probe: called when new device created
* @remove:called when device removed
* @driver:device driver structure
*
**/
struct mdev_driver {
const char *name;
int (*probe) (struct device *dev);
void (*remove) (struct device *dev);
struct device_driver driver;
};
Mediated bus driver for mdev device should use this interface to register
and unregister with core driver respectively:
int mdev_register_driver(struct mdev_driver *drv, struct module *owner);
void mdev_unregister_driver(struct mdev_driver *drv);
Mediated bus driver is responsible to add/delete mediated devices to/from
VFIO group when devices are bound and unbound to the driver.
2. Physical device driver interface
This interface provides vendor driver the set APIs to manage physical
device related work in its driver. APIs are :
* dev_attr_groups: attributes of the parent device.
* mdev_attr_groups: attributes of the mediated device.
* supported_type_groups: attributes to define supported type. This is
mandatory field.
* create: to allocate basic resources in vendor driver for a mediated
device. This is mandatory to be provided by vendor driver.
* remove: to free resources in vendor driver when mediated device is
destroyed. This is mandatory to be provided by vendor driver.
* open: open callback of mediated device
* release: release callback of mediated device
* read : read emulation callback.
* write: write emulation callback.
* ioctl: ioctl callback.
* mmap: mmap emulation callback.
Drivers should use these interfaces to register and unregister device to
mdev core driver respectively:
extern int mdev_register_device(struct device *dev,
const struct parent_ops *ops);
extern void mdev_unregister_device(struct device *dev);
There are no locks to serialize above callbacks in mdev driver and
vfio_mdev driver. If required, vendor driver can have locks to serialize
above APIs in their driver.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Jike Song <jike.song@intel.com>
Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The VFIO_DEVICE_SET_IRQS ioctl did not sufficiently sanitize
user-supplied integers, potentially allowing memory corruption. This
patch adds appropriate integer overflow checks, checks the range bounds
for VFIO_IRQ_SET_DATA_NONE, and also verifies that only single element
in the VFIO_IRQ_SET_DATA_TYPE_MASK bitmask is set.
VFIO_IRQ_SET_ACTION_TYPE_MASK is already correctly checked later in
vfio_pci_set_irqs_ioctl().
Furthermore, a kzalloc is changed to a kcalloc because the use of a
kzalloc with an integer multiplication allowed an integer overflow
condition to be reached without this patch. kcalloc checks for overflow
and should prevent a similar occurrence.
Signed-off-by: Vlad Tsyrklevich <vlad@tsyrklevich.net>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Simplify the interrupt setup by using the new PCI layer helpers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The MSI/X shutdown path can gratuitously enable INTx, which is not
something we want to happen if we're dealing with broken INTx device.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We use a BAR restore trick to try to detect when a user has performed
a device reset, possibly through FLR or other backdoors, to put things
back into a working state. This is important for backdoor resets, but
we can actually just virtualize the "front door" resets provided via
PCIe and AF FLR. Set these bits as virtualized + writable, allowing
the default write to set them in vconfig, then we can simply check the
bit, perform an FLR of our own, and clear the bit. We don't actually
have the granularity in PCI to specify the type of reset we want to
do, but generally devices don't implement both PCIe and AF FLR and
we'll favor these over other types of reset, so we should generally
lineup. We do test whether the device provides the requested FLR type
to stay consistent with hardware capabilities though.
This seems to fix several instance of devices getting into bad states
with userspace drivers, like dpdk, running inside a VM.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Greg Rose <grose@lightfleet.com>
We get a few warnings when building kernel with W=1:
drivers/vfio/platform/vfio_platform_common.c:76:5: warning: no previous prototype for 'vfio_platform_acpi_call_reset' [-Wmissing-prototypes]
drivers/vfio/platform/vfio_platform_common.c:98:6: warning: no previous prototype for 'vfio_platform_acpi_has_reset' [-Wmissing-prototypes]
drivers/vfio/platform/vfio_platform_common.c:640:5: warning: no previous prototype for 'vfio_platform_of_probe' [-Wmissing-prototypes]
drivers/vfio/platform/reset/vfio_platform_amdxgbe.c:59:5: warning: no previous prototype for 'vfio_platform_amdxgbe_reset' [-Wmissing-prototypes]
drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c:60:5: warning: no previous prototype for 'vfio_platform_calxedaxgmac_reset' [-Wmissing-prototypes]
....
In fact, these functions are only used in the file in which they are
declared and don't need a declaration, but can be made static.
so this patch marks these functions with 'static'.
Signed-off-by: Baoyou Xie <baoyou.xie@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
There are multiple cases in vfio_pci_set_ctx_trigger_single() where
we assume we can safely read from our data pointer without actually
checking whether the user has passed any data via the count field.
VFIO_IRQ_SET_DATA_NONE in particular is entirely broken since we
attempt to pull an int32_t file descriptor out before even checking
the data type. The other data types assume the data pointer contains
one element of their type as well.
In part this is good news because we were previously restricted from
doing much sanitization of parameters because it was missed in the
past and we didn't want to break existing users. Clearly DATA_NONE
is completely broken, so it must not have any users and we can fix
it up completely. For DATA_BOOL and DATA_EVENTFD, we'll just
protect ourselves, returning error when count is zero since we
previously would have oopsed.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reported-by: Chris Thompson <the_cartographer@hotmail.com>
Cc: stable@vger.kernel.org
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Release call is ignoring the return code from reset call and can
potentially continue even though reset call failed.
If reset_required module parameter is set, this patch is going
to validate the return code and will cause stack dump with
WARN_ON and warn the user of failure.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Open call is ignoring the return code from reset call and can
potentially continue even though reset call failed.
If reset_required module parameter is set, this patch is going
to validate the return code and will abort open if reset fails.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The code was allowing platform devices to be used without a supporting
VFIO reset driver. The hardware can be left in some inconsistent state
after a guest machine abort.
The reset driver will put the hardware back to safe state and disable
interrupts before returning the control back to the host machine.
Adding a new reset_required kernel module option to platform VFIO drivers.
The default value is true for the DT and ACPI based drivers.
The reset requirement value for AMBA drivers is set to false and is
unchangeable to maintain the existing functionality.
New requirements are:
1. A reset function needs to be implemented by the corresponding driver
via DT/ACPI.
2. The reset function needs to be discovered via DT/ACPI.
The probe of the driver will fail if any of the above conditions are
not satisfied.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The device tree code checks for the presence of a reset driver and calls
the of_reset function pointer by looking up the reset driver as a module.
ACPI defines _RST method to perform device level reset. After the _RST
method is executed, the OS can resume using the device. _RST method is
expected to stop DMA transfers and IRQs.
This patch introduces two functions as vfio_platform_acpi_has_reset and
vfio_platform_acpi_call_reset. The has reset method is used to declare
reset capability via the ioctl flag VFIO_DEVICE_FLAGS_RESET. The call
reset function is used to execute the _RST ACPI method.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Getting ready to bring out extra debug information to the caller
so that more verbose information can be printed when an error is
observed.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The code is using the compatible DT string to associate a reset driver
with the actual device itself. The compatible string does not exist on
ACPI based systems. HID is the unique identifier for a device driver
instead.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Creating a new function to determine if this driver supports reset
function or not. This is an attempt to abstract device tree calls
from the rest of the code.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The reset call sequence seems to replicate itself multiple times
across the file. Grouping them together for maintenance reasons.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Renaming the reset function to of_reset as it is only used
by the device tree based platforms.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The vfio group should be released after
the vfio_group_try_dissolve_container call.
The code should not rely on someone else to hold
a reference on the group.
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Current vfio-pci implementation disallows to mmap
sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio
page may be shared with other BARs. This will cause some
performance issues when we passthrough a PCI device with
this kind of BARs. Guest will be not able to handle the mmio
accesses to the BARs which leads to mmio emulations in host.
However, not all sub-page BARs will share page with other BARs.
We should allow to mmap the sub-page MMIO BARs which we can
make sure will not share page with other BARs.
This patch adds support for this case. And we try to add a
dummy resource to reserve the remainder of the page which
hot-add device's BAR might be assigned into. But it's not
necessary to handle the case when the BAR is not page aligned.
Because we can't expect the BAR will be assigned into the same
location in a page in guest when we passthrough the BAR. And
it's hard to access this BAR in userspace because we have
no way to get the BAR's location in a page.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The vfio No-IOMMU mode was supported by this
'commit 03a76b60f8 ("vfio: Include No-IOMMU mode")',
but it only support vfio-pci.
Using vfio_iommu_group_get/put, but not iommu_group_get/put,
the platform devices can be exposed to userspace with
CONFIG_VFIO_NOIOMMU and the "enable_unsafe_noiommu_mode"
option enabled.
From 'commit 03a76b60f8 ("vfio: Include No-IOMMU mode")',
"This should make it very clear that this mode is not safe.
Additionally, CAP_SYS_RAWIO privileges are necessary to work
with groups and containers using this mode. Groups making
use of this support are named /dev/vfio/noiommu-$GROUP and
can only make use of the special VFIO_NOIOMMU_IOMMU for the
container. Use of this mode, specifically binding a device
without a native IOMMU group to a VFIO bus driver will taint
the kernel and should therefore not be considered supported."
Signed-off-by: Peng Fan <van.freenix@gmail.com>
Cc: Eric Auger <eric.auger@linaro.org>
Cc: Baptiste Reynal <b.reynal@virtualopensystems.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The size of the VPD area is not necessarily 4-byte aligned, so a
pci_vpd_read() might return less than 4 bytes. Zero our buffer and
accept anything other than an error. Intel X710 NICs exercise this.
Fixes: 4e1a635552 ("vfio/pci: Use kernel VPD access functions")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This function cannot actually be called with npage = 0, so in practice
this doesn't return an uninitialized value.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Both the INTx and MSI/X disable paths do an eventfd_ctx_put() for the
trigger eventfd before calling vfio_virqfd_disable() any potential
mask and unmask eventfds. This opens a use-after-free race where an
inopportune irqfd can reference the freed signalling eventfd. Reorder
to avoid this possibility.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
- Hide INTx on certain known broken devices (Alex Williamson)
- Additional backdoor reset detection (Alex Williamson)
- Remove unused iommudata reference (Alexey Kardashevskiy)
- Use cfg_size to avoid probing extended config space (Alexey Kardashevskiy)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJXRc8hAAoJECObm247sIsigWAP/R+q+UnOHXd7cvSKpI33p2IP
ur09xwV9GktQViWz4tclFdfk8cScq86UX8I5e/jkeFVCXT7e5FU7FhKqJi7dPbp1
Yh3g5fGUkk6a0B3gebIq3qY02aLwJF17WZg0R/KHkQT0B/yQuTQqdF37Xh9rGv0S
c7sgQn1X5nmO7jVDYYKO3SwEgmZWAcBz4Ht2XpqCdSrmhsse9OxlnmOkieM5BNUz
rQhnziaeJE/Ya+y/A74XicbGforyThvNyJs/anJnPEE89773SGVQy/Jdlz4Lwji7
XImPuj4AT9duTgX9HaD38xIpFOsAKDfZ6sClsICkIvhbs232UXuiMxcPszDA97c0
7MxSJcLVr9fKB6zatq2JWhGDp7C6ylUxapEI9PFCV6gE5OYZRM7KD+hm32oVnuPv
rSzPoqnm0Sudu8SO6n46QUZAifp+mX9MNhqzkXGR/YlBHhB1L3/QcczyekY0eBbj
vJ2htebrg0qNQn4G9n4ygMm19r53ew/Q+pO2y7y4TdOr+gZNW1Wj7uOezMpvDOOB
hiy+HkJ24MCTfAGNgjpjjCot/o608+QT5H+y8SR7vT1IK4shOSE4rYfvo6jlzRQp
9FRTolGhYqrtih+zB5R7eLghtvlDp4lN0gDCSgWGHM7e2rMLxaomzoF1VNQ8eKJZ
iUJZ8jE2QrdGDpKssRuF
=fGnT
-----END PGP SIGNATURE-----
Merge tag 'vfio-v4.7-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Hide INTx on certain known broken devices (Alex Williamson)
- Additional backdoor reset detection (Alex Williamson)
- Remove unused iommudata reference (Alexey Kardashevskiy)
- Use cfg_size to avoid probing extended config space (Alexey
Kardashevskiy)
* tag 'vfio-v4.7-rc1' of git://github.com/awilliam/linux-vfio:
vfio_pci: Test for extended capabilities if config space > 256 bytes
vfio_iommu_spapr_tce: Remove unneeded iommu_group_get_iommudata
vfio/pci: Add test for BAR restore
vfio/pci: Hide broken INTx support from user
Highlights:
- Support for Power ISA 3.0 (Power9) Radix Tree MMU from Aneesh Kumar K.V
- Live patching support for ppc64le (also merged via livepatching.git)
Various cleanups & minor fixes from:
- Aaro Koskinen, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
Chris Smart, Daniel Axtens, Frederic Barrat, Gavin Shan, Ian Munsie, Lennart
Sorensen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring, Michael
Ellerman, Oliver O'Halloran, Paul Gortmaker, Paul Mackerras, Rashmica Gupta,
Russell Currey, Suraj Jitindar Singh, Thiago Jung Bauermann, Valentin
Rothberg, Vipin K Parashar.
General:
- Update LMB associativity index during DLPAR add/remove from Nathan Fontenot
- Fix branching to OOL handlers in relocatable kernel from Hari Bathini
- Add support for userspace Power9 copy/paste from Chris Smart
- Always use STRICT_MM_TYPECHECKS from Michael Ellerman
- Add mask of possible MMU features from Michael Ellerman
PCI:
- Enable pass through of NVLink to guests from Alexey Kardashevskiy
- Cleanups in preparation for powernv PCI hotplug from Gavin Shan
- Don't report error in eeh_pe_reset_and_recover() from Gavin Shan
- Restore initial state in eeh_pe_reset_and_recover() from Gavin Shan
- Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell" from Guilherme G. Piccoli
- Remove the dependency on EEH struct in DDW mechanism from Guilherme G. Piccoli
selftests:
- Test cp_abort during context switch from Chris Smart
- Add several tests for transactional memory support from Rashmica Gupta
perf:
- Add support for sampling interrupt register state from Anju T
- Add support for unwinding perf-stackdump from Chandan Kumar
cxl:
- Configure the PSL for two CAPI ports on POWER8NVL from Philippe Bergheaud
- Allow initialization on timebase sync failures from Frederic Barrat
- Increase timeout for detection of AFU mmio hang from Frederic Barrat
- Handle num_of_processes larger than can fit in the SPA from Ian Munsie
- Ensure PSL interrupt is configured for contexts with no AFU IRQs from Ian Munsie
- Add kernel API to allow a context to operate with relocate disabled from Ian Munsie
- Check periodically the coherent platform function's state from Christophe Lombard
Freescale:
- Updates from Scott: "Contains 86xx fixes, minor device tree fixes, an erratum
workaround, and a kconfig dependency fix."
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXPsGzAAoJEFHr6jzI4aWAVoAP/iKdrDe0eYHlVAE9SqnbsiZs
lgDxdsC8P3fsmP1G9o/HkKhC82zHl/La8Ztz8dtqa+LkSzbfliWP1ztJsI7GsBFo
tyCKzWnX9Rwvd3meHu/o/SQ29TNLm/PbPyyRqpj5QPbJ8XCXkAXR7ZZZqjvcMsJW
/AgIr7Cgf53tl9oZzzl/c7CnNHhMq+NBdA71vhWtUx+T97wfJEGyKW6HhZyHDbEU
iAki7fu77ZpEqC/Fh9swf0dCGBJ+a132NoMVo0AdV7EQLznUYlQpQEqa+1PyHZOP
/ArOzf2mDg6m3PfCo1eiB07v8PnVZ3llEUbVAJNg3GUxbE4SHrqq/kwm0iElm3p/
DvFxerCwdX9vmskJX4wDs+pSZRabXYj9XVMptsgFzA4joWrqqb7mBHqaort88YcY
YSljEt1bHyXmiJ+dBya40qARsWUkCVN7ZgEzdxckq0KI3w7g2tqpqIbO2lClWT6t
B3GpqQ4jp34+d1M14FB91fIGK7tMvOhSInE0Mv9+tPvRsepXqiiU/SwdAtRlr3m2
zs/K+4FYcVjJ3Rmpgc+tI38PbZxHe212I35YN6L1LP+4ZfAtzz0NyKdooTIBtkbO
19pX4WbBjKq8zK+YutrySncBIrbnI6VjW51vtRhgVKZliPFO/6zKagyU6FbxM+E5
udQES+t3F/9gvtxgxtDe
=YvyQ
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Highlights:
- Support for Power ISA 3.0 (Power9) Radix Tree MMU from Aneesh Kumar K.V
- Live patching support for ppc64le (also merged via livepatching.git)
Various cleanups & minor fixes from:
- Aaro Koskinen, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
Chris Smart, Daniel Axtens, Frederic Barrat, Gavin Shan, Ian Munsie,
Lennart Sorensen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring,
Michael Ellerman, Oliver O'Halloran, Paul Gortmaker, Paul Mackerras,
Rashmica Gupta, Russell Currey, Suraj Jitindar Singh, Thiago Jung
Bauermann, Valentin Rothberg, Vipin K Parashar.
General:
- Update LMB associativity index during DLPAR add/remove from Nathan
Fontenot
- Fix branching to OOL handlers in relocatable kernel from Hari Bathini
- Add support for userspace Power9 copy/paste from Chris Smart
- Always use STRICT_MM_TYPECHECKS from Michael Ellerman
- Add mask of possible MMU features from Michael Ellerman
PCI:
- Enable pass through of NVLink to guests from Alexey Kardashevskiy
- Cleanups in preparation for powernv PCI hotplug from Gavin Shan
- Don't report error in eeh_pe_reset_and_recover() from Gavin Shan
- Restore initial state in eeh_pe_reset_and_recover() from Gavin Shan
- Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell"
from Guilherme G Piccoli
- Remove the dependency on EEH struct in DDW mechanism from Guilherme
G Piccoli
selftests:
- Test cp_abort during context switch from Chris Smart
- Add several tests for transactional memory support from Rashmica
Gupta
perf:
- Add support for sampling interrupt register state from Anju T
- Add support for unwinding perf-stackdump from Chandan Kumar
cxl:
- Configure the PSL for two CAPI ports on POWER8NVL from Philippe
Bergheaud
- Allow initialization on timebase sync failures from Frederic Barrat
- Increase timeout for detection of AFU mmio hang from Frederic
Barrat
- Handle num_of_processes larger than can fit in the SPA from Ian
Munsie
- Ensure PSL interrupt is configured for contexts with no AFU IRQs
from Ian Munsie
- Add kernel API to allow a context to operate with relocate disabled
from Ian Munsie
- Check periodically the coherent platform function's state from
Christophe Lombard
Freescale:
- Updates from Scott: "Contains 86xx fixes, minor device tree fixes,
an erratum workaround, and a kconfig dependency fix."
* tag 'powerpc-4.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (192 commits)
powerpc/86xx: Fix PCI interrupt map definition
powerpc/86xx: Move pci1 definition to the include file
powerpc/fsl: Fix build of the dtb embedded kernel images
powerpc/fsl: Fix rcpm compatible string
powerpc/fsl: Remove FSL_SOC dependency from FSL_LBC
powerpc/fsl-pci: Add a workaround for PCI 5 errata
powerpc/fsl: Fix SPI compatible on t208xrdb and t1040rdb
powerpc/powernv/npu: Add PE to PHB's list
powerpc/powernv: Fix insufficient memory allocation
powerpc/iommu: Remove the dependency on EEH struct in DDW mechanism
Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell"
powerpc/eeh: Drop unnecessary label in eeh_pe_change_owner()
powerpc/eeh: Ignore handlers in eeh_pe_reset_and_recover()
powerpc/eeh: Restore initial state in eeh_pe_reset_and_recover()
powerpc/eeh: Don't report error in eeh_pe_reset_and_recover()
Revert "powerpc/powernv: Exclude root bus in pnv_pci_reset_secondary_bus()"
powerpc/powernv/npu: Enable NVLink pass through
powerpc/powernv/npu: Rework TCE Kill handling
powerpc/powernv/npu: Add set/unset window helpers
powerpc/powernv/ioda2: Export debug helper pe_level_printk()
...
PCI-Express spec says that reading 4 bytes at offset 100h should return
zero if there is no extended capability so VFIO reads this dword to
know if there are extended capabilities.
However it is not always possible to access the extended space so
generic PCI code in pci_cfg_space_size_ext() checks if
pci_read_config_dword() can read beyond 100h and if the check fails,
it sets the config space size to 100h.
VFIO does its own extended capabilities check by reading at offset 100h
which may produce 0xffffffff which VFIO treats as the extended config
space presense and calls vfio_ecap_init() which fails to parse
capabilities (which is expected) but right before the exit, it writes
zero at offset 100h which is beyond the buffer allocated for
vdev->vconfig (which is 256 bytes) which leads to random memory
corruption.
This makes VFIO only check for the extended capabilities if
the discovered config size is more than 256 bytes.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We are going to have multiple different types of PHB on the same system
with POWER8 + NVLink and PHBs will have different IOMMU ops. However
we only really care about one callback - create_table - so we can
relax the compatibility check here.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Many IOMMUs support multiple page table formats, meaning that any given
domain may only support a subset of the hardware page sizes presented in
iommu_ops->pgsize_bitmap. There are also certain use-cases where the
creator of a domain may want to control which page sizes are used, for
example to force the use of hugepage mappings to reduce pagetable walk
depth.
To this end, add a per-domain pgsize_bitmap to represent the subset of
page sizes actually in use, to make it possible for domains with
different requirements to coexist.
Signed-off-by: Will Deacon <will.deacon@arm.com>
[rm: hijacked and rebased original patch with new commit message]
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
This removes iommu_group_get_iommudata() as the result is never used.
As this is a minor cleanup, no change in behavior is expected.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If a device is reset without the memory or i/o bits enabled in the
command register we may not detect it, potentially leaving the device
without valid BAR programming. Add an additional test to check the
BARs on each write to the command register.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
INTx masking has two components, the first is that we need the ability
to prevent the device from continuing to assert INTx. This is
provided via the DisINTx bit in the command register and is the only
thing we can really probe for when testing if INTx masking is
supported. The second component is that the device needs to indicate
if INTx is asserted via the interrupt status bit in the device status
register. With these two features we can generically determine if one
of the devices we own is asserting INTx, signal the user, and mask the
interrupt while the user services the device.
Generally if one or both of these components is broken we resort to
APIC level interrupt masking, which requires an exclusive interrupt
since we have no way to determine the source of the interrupt in a
shared configuration. This often makes it difficult or impossible to
configure the system for userspace use of the device, for an interrupt
mode that the user may not need.
One possible configuration of broken INTx masking is that the DisINTx
support is fully functional, but the interrupt status bit never
signals interrupt assertion. In this case we do have the ability to
prevent the device from asserting INTx, but lack the ability to
identify the interrupt source. For this case we can simply pretend
that the device lacks INTx support entirely, keeping DisINTx set on
the physical device, virtualizing this bit for the user, and
virtualizing the interrupt pin register to indicate no INTx support.
We already support virtualization of the DisINTx bit and already
virtualize the interrupt pin for platforms without INTx support. By
tying these components together, setting DisINTx on open and reset,
and identifying devices broken in this particular way, we can provide
support for them w/o the handicap of APIC level INTx masking.
Intel i40e (XL710/X710) 10/20/40GbE NICs have been identified as being
broken in this specific way. We leave the vfio-pci.nointxmask option
as a mechanism to bypass this support, enabling INTx on the device
with all the requirements of APIC level masking.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: John Ronciak <john.ronciak@intel.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Various enablers for assignment of Intel graphics devices and future
support of vGPU devices (Alex Williamson). This includes
- Handling the vfio type1 interface as an API rather than a specific
implementation, allowing multiple type1 providers.
- Capability chains, similar to PCI device capabilities, that allow
extending ioctls. Extensions here include device specific regions
and sparse mmap descriptions. The former is used to expose non-PCI
regions for IGD, including the OpRegion (particularly the Video
BIOS Table), and read only PCI config access to the host and LPC
bridge as drivers often depend on identifying those devices.
Sparse mmaps here are used to describe the MSIx vector table,
which vfio has always protected from mmap, but never had an API to
explicitly define that protection. In future vGPU support this is
expected to allow the description of PCI BARs that may mix direct
access and emulated access within a single region.
- The ability to expose the shadow ROM as an option ROM as IGD use
cases may rely on the ROM even though the physical device does not
make use of a PCI option ROM BAR.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJW6aT1AAoJECObm247sIsiiP4P/1xf7Z08/2QWVFQzex9CLcZk
+/iJlyb/fTpPVQE+NTKPz3Qh5h6ZhSd/57s85IUqq0T6tgVPkoGx8kkyCjBaw2y1
yMezXZlQqJdZqGzQNI4OiHWvO+/vGxYKjQMfUnMlDM6dJgz4lGncGFoSouFPa3Vp
mB12hGxrlk1cfIdb+C1KbfZcEdS0WhtigQtz8flBKgOfO+hYWmUO+CClJBhVw8Z4
RNcWNAxFfLuwUPVsPb6uOLG2g65SC2vmQ9k0Tnknf1znV3PFFVjITf0aM6uChLNP
S3SgqtPX+6yOFyCuSEs8UKhhmCbeQmAyKgt5BpxV3Rw3OMP4PsVAehr82vQmSj6g
2o96pR2s8MDPBr8eG7gdRe4DQe3PonpLkpDfaghcpYqhkGEqNVeW5/GjiOzGQqD3
xMshzxJ1Iz7DOHkQRUVqOfupDB0TusJmTVKwvXe6yIYL9pjkUS/sbN9U563HYSES
JTV68TMj0VKfKwD3XKYXvGH3km1sL4i5NMlAUrsDtsMkGlXEswoGbj82Mjc8+jUo
BvWQTJb+kouJQ88VhsO2abg1UrO9E6u82iHFHy9fEObxE8KH7pvROlS93ihMT1Wv
WQNuUcltdpHMRVX0BDknaPs3YtC3/TGgm3RcU5SZPbv/ys1471ZmJxMlAAKcfITr
SuvkMTYElF5b1pigv46c
=/lJn
-----END PGP SIGNATURE-----
Merge tag 'vfio-v4.6-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
"Various enablers for assignment of Intel graphics devices and future
support of vGPU devices (Alex Williamson). This includes
- Handling the vfio type1 interface as an API rather than a specific
implementation, allowing multiple type1 providers.
- Capability chains, similar to PCI device capabilities, that allow
extending ioctls. Extensions here include device specific regions
and sparse mmap descriptions. The former is used to expose non-PCI
regions for IGD, including the OpRegion (particularly the Video
BIOS Table), and read only PCI config access to the host and LPC
bridge as drivers often depend on identifying those devices.
Sparse mmaps here are used to describe the MSIx vector table, which
vfio has always protected from mmap, but never had an API to
explicitly define that protection. In future vGPU support this is
expected to allow the description of PCI BARs that may mix direct
access and emulated access within a single region.
- The ability to expose the shadow ROM as an option ROM as IGD use
cases may rely on the ROM even though the physical device does not
make use of a PCI option ROM BAR"
* tag 'vfio-v4.6-rc1' of git://github.com/awilliam/linux-vfio:
vfio/pci: return -EFAULT if copy_to_user fails
vfio/pci: Expose shadow ROM as PCI option ROM
vfio/pci: Intel IGD host and LCP bridge config space access
vfio/pci: Intel IGD OpRegion support
vfio/pci: Enable virtual register in PCI config space
vfio/pci: Add infrastructure for additional device specific regions
vfio: Define device specific region type capability
vfio/pci: Include sparse mmap capability for MSI-X table regions
vfio: Define sparse mmap capability for regions
vfio: Add capability chain helpers
vfio: Define capability chains
vfio: If an IOMMU backend fails, keep looking
vfio/pci: Fix unsigned comparison overflow
Calling return copy_to_user(...) in an ioctl will not
do the right thing if there's a pagefault:
copy_to_user returns the number of bytes not copied
in this case.
Fix up vfio to do
return copy_to_user(...)) ?
-EFAULT : 0;
everywhere.
Cc: stable@vger.kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The copy_to_user() function returns the number of bytes that were not
copied but we want to return -EFAULT on error here.
Fixes: 188ad9d6cb ('vfio/pci: Include sparse mmap capability for MSI-X table regions')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Integrated graphics may have their ROM shadowed at 0xc0000 rather than
implement a PCI option ROM. Make this ROM appear to the user using
the ROM BAR.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Provide read-only access to PCI config space of the PCI host bridge
and LPC bridge through device specific regions. This may be used to
configure a VM with matching register contents to satisfy driver
requirements. Providing this through the vfio file descriptor removes
an additional userspace requirement for access through pci-sysfs and
removes the CAP_SYS_ADMIN requirement that doesn't appear to apply to
the specific devices we're accessing.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This is the first consumer of vfio device specific resource support,
providing read-only access to the OpRegion for Intel graphics devices.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Typically config space for a device is mapped out into capability
specific handlers and unassigned space. The latter allows direct
read/write access to config space. Sometimes we know about registers
living in this void space and would like an easy way to virtualize
them, similar to how BAR registers are managed. To do this, create
one more pseudo (fake) PCI capability to be handled as purely virtual
space. Reads and writes are serviced entirely from virtual config
space.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Add support for additional regions with indexes started after the
already defined fixed regions. Device specific code can register
these regions with the new vfio_pci_register_dev_region() function.
The ops structure per region currently only includes read/write
access and a release function, allowing automatic cleanup when the
device is closed. mmap support is only missing here because it's
not needed by the first user queued for this support.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vfio-pci has never allowed the user to directly mmap the MSI-X vector
table, but we've always relied on implicit knowledge of the user that
they cannot do this. Now that we have capability chains that we can
expose in the region info ioctl and a sparse mmap capability that
represents the sub-areas within the region that can be mmap'd, we can
make the mmap constraints more explicit.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Allow sub-modules to easily reallocate a buffer for managing
capability chains for info ioctls.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Consider an IOMMU to be an API rather than an implementation, we might
have multiple implementations supporting the same API, so try another
if one fails. The expectation here is that we'll really only have
one implementation per device type. For instance the existing type1
driver works with any PCI device where the IOMMU API is available. A
vGPU vendor may have a virtual PCI device which provides DMA isolation
and mapping through other mechanisms, but can re-use userspaces that
make use of the type1 VFIO IOMMU API. This allows that to work.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed versus unsigned comparisons are implicitly cast to unsigned,
which result in a couple possible overflows. For instance (start +
count) might overflow and wrap, getting through our validation test.
Also when unwinding setup, -1 being compared as unsigned doesn't
produce the intended stop condition. Fix both of these and also fix
vfio_msi_set_vector_signal() to validate parameters before using the
vector index, though none of the callers should pass bad indexes
anymore.
Reported-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Using iommu_present() to determine whether an IOMMU group is real or
fake has some problems. First, apparently Power systems don't
register an IOMMU on the device bus, so the groups and containers get
marked as noiommu and then won't bind to their actual IOMMU driver.
Second, I expect we'll run into the same issue as we try to support
vGPUs through vfio, since they're likely to emulate this behavior of
creating an IOMMU group on a virtual device and then providing a vfio
IOMMU backend tailored to the sort of isolation they provide, which
won't necessarily be fully compatible with the IOMMU API.
The solution here is to use the existing iommudata interface to IOMMU
groups, which allows us to easily identify the fake groups we've
created for noiommu purposes. The iommudata we set is purely
arbitrary since we're only comparing the address, so we use the
address of the noiommu switch itself.
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Tested-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <sshukla@mvista.com>
Fixes: 03a76b60f8 ("vfio: Include No-IOMMU mode")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The flags entry is there to tell the user that some
optional information is available.
Since we report the iova_pgsizes signal it to the user
by setting the flags to VFIO_IOMMU_INFO_PGSIZES.
Signed-off-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
There is really no way to safely give a user full access to a DMA
capable device without an IOMMU to protect the host system. There is
also no way to provide DMA translation, for use cases such as device
assignment to virtual machines. However, there are still those users
that want userspace drivers even under those conditions. The UIO
driver exists for this use case, but does not provide the degree of
device access and programming that VFIO has. In an effort to avoid
code duplication, this introduces a No-IOMMU mode for VFIO.
This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
the "enable_unsafe_noiommu_mode" option on the vfio driver. This
should make it very clear that this mode is not safe. Additionally,
CAP_SYS_RAWIO privileges are necessary to work with groups and
containers using this mode. Groups making use of this support are
named /dev/vfio/noiommu-$GROUP and can only make use of the special
VFIO_NOIOMMU_IOMMU for the container. Use of this mode, specifically
binding a device without a native IOMMU group to a VFIO bus driver
will taint the kernel and should therefore not be considered
supported. This patch includes no-iommu support for the vfio-pci bus
driver only.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
This loop ends with count set to -1 and not zero so the warning message
isn't printed when it should be. I've fixed this by change the postop
to a preop.
Fixes: 0990822c98 ('VFIO: platform: reset: AMD xgbe reset module')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Revert commit 033291eccb ("vfio: Include No-IOMMU mode") due to lack
of a user. This was originally intended to fill a need for the DPDK
driver, but uptake has been slow so rather than support an unproven
kernel interface revert it and revisit when userspace catches up.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The first argument to the WARN() macro has to be a condition. I'm sort
of disappointed that this code doesn't generate a compiler warning. I
guess -Wformat-extra-args doesn't work in the kernel.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
request_module already takes format strings, so no need to duplicate
the effort.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This pci_error_handlers structure is never modified, like all the other
pci_error_handlers structures, so declare it as const.
Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
platform_driver does not need to set an owner because
platform_driver_register() will set it.
Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
- Use kernel interfaces for VPD emulation (Alex Williamson)
- Platform fix for releasing IRQs (Eric Auger)
- Type1 IOMMU always advertises PAGE_SIZE support when smaller
mapping sizes are available (Eric Auger)
- Platform fixes for incorrectly using copies of structures rather
than pointers to structures (James Morse)
- Rework platform reset modules, fix leak, and add AMD xgbe reset
module (Eric Auger)
- Fix vfio_device_get_from_name() return value (Joerg Roedel)
- No-IOMMU interface (Alex Williamson)
- Fix potential out of bounds array access in PCI config handling
(Dan Carpenter)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWRg8+AAoJECObm247sIsitkwP+wc6cRzBpeGxiufZl9Ci7JV9
G0qNHBm8tAYDwm1uJATcyZC303ad2B3gaYSO6msTFUXTg3d9ZDtUOWhoTNPs/px1
5dF38DREzqYHpyC9HT2Qj3i9G+ejvgg+SoyxBOTOIw/dmq3tGZhUz1Sj+wuvQTwr
XXBKsWltZ+8wwXmQXmOWI4L3m7Xhs8NwAl5iLJ3UiltpW9zZzuPtoKnQCfMYUcmh
hIJg52t0WPSLyn47UvecUQcqxaO+QYELa7UN84fnQAihk+ewMpzg5blTebayFdu3
f2WC3ivxbamebw74LaRfQFjx4mT+DI0aXYtraC600PVe7gdVXB66QMNNpPhBwAy5
wpfeFpTKU5gC+LHmrIMUS2/A4sdNfUBw44CS8+Lm2D6bQAblPv/C5xQV1rz9HADv
f4/D3Y0TUKSYArewtBHTC0mnXdkZwetttBoy6/zQBl8vkelhoJ3GPcVa8FEZCIuT
2MSS17I3ftJ1enfynicF+Wstn/H5lWcuRBdg5wTLHIuhFn6MiEVxfIuSEx9JfjIb
NGZO7y5JiJ0b5QRCG0tFznwceU/cql/3oRqOGXqaf1cQ1Ag3JOIAUzxknFoJQUj1
XYe+Im1eMaugjj39J3+m5EYNKT3nh/bBLD/V3iWYpgoZtQrmQQm5nu0JsQo88/JR
je0BuJioCuPlO/Wj/KYw
=bI62
-----END PGP SIGNATURE-----
Merge tag 'vfio-v4.4-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Use kernel interfaces for VPD emulation (Alex Williamson)
- Platform fix for releasing IRQs (Eric Auger)
- Type1 IOMMU always advertises PAGE_SIZE support when smaller mapping
sizes are available (Eric Auger)
- Platform fixes for incorrectly using copies of structures rather than
pointers to structures (James Morse)
- Rework platform reset modules, fix leak, and add AMD xgbe reset
module (Eric Auger)
- Fix vfio_device_get_from_name() return value (Joerg Roedel)
- No-IOMMU interface (Alex Williamson)
- Fix potential out of bounds array access in PCI config handling (Dan
Carpenter)
* tag 'vfio-v4.4-rc1' of git://github.com/awilliam/linux-vfio:
vfio/pci: make an array larger
vfio: Include No-IOMMU mode
vfio: Fix bug in vfio_device_get_from_name()
VFIO: platform: reset: AMD xgbe reset module
vfio: platform: reset: calxedaxgmac: fix ioaddr leak
vfio: platform: add dev_info on device reset
vfio: platform: use list of registered reset function
vfio: platform: add compat in vfio_platform_device
vfio: platform: reset: calxedaxgmac: add reset function registration
vfio: platform: introduce module_vfio_reset_handler macro
vfio: platform: add capability to register a reset function
vfio: platform: introduce vfio-platform-base module
vfio/platform: store mapped memory in region, instead of an on-stack copy
vfio/type1: handle case where IOMMU does not support PAGE_SIZE size
VFIO: platform: clear IRQ_NOAUTOEN when de-assigning the IRQ
vfio/pci: Use kernel VPD access functions
vfio: Whitelist PCI bridges
Smatch complains about a possible out of bounds error:
drivers/vfio/pci/vfio_pci_config.c:1241 vfio_cap_init()
error: buffer overflow 'pci_cap_length' 20 <= 20
The problem is that pci_cap_length[] was defined as large enough to
hold "PCI_CAP_ID_AF + 1" elements. The code in vfio_cap_init() assumes
it has PCI_CAP_ID_MAX + 1 elements. Originally, PCI_CAP_ID_AF and
PCI_CAP_ID_MAX were the same but then we introduced PCI_CAP_ID_EA in
commit f80b0ba959 ("PCI: Add Enhanced Allocation register entries")
so now the array is too small.
Let's fix this by making the array size PCI_CAP_ID_MAX + 1. And let's
make a similar change to pci_ext_cap_length[] for consistency. Also
both these arrays can be made const.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
There is really no way to safely give a user full access to a DMA
capable device without an IOMMU to protect the host system. There is
also no way to provide DMA translation, for use cases such as device
assignment to virtual machines. However, there are still those users
that want userspace drivers even under those conditions. The UIO
driver exists for this use case, but does not provide the degree of
device access and programming that VFIO has. In an effort to avoid
code duplication, this introduces a No-IOMMU mode for VFIO.
This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
the "enable_unsafe_noiommu_mode" option on the vfio driver. This
should make it very clear that this mode is not safe. Additionally,
CAP_SYS_RAWIO privileges are necessary to work with groups and
containers using this mode. Groups making use of this support are
named /dev/vfio/noiommu-$GROUP and can only make use of the special
VFIO_NOIOMMU_IOMMU for the container. Use of this mode, specifically
binding a device without a native IOMMU group to a VFIO bus driver
will taint the kernel and should therefore not be considered
supported. This patch includes no-iommu support for the vfio-pci bus
driver only.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
The vfio_device_get_from_name() function might return a
non-NULL pointer, when called with a device name that is not
found in the list. This causes undefined behavior, in my
case calling an invalid function pointer later on:
kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
BUG: unable to handle kernel paging request at ffff8800cb3ddc08
[...]
Call Trace:
[<ffffffffa03bd733>] ? vfio_group_fops_unl_ioctl+0x253/0x410 [vfio]
[<ffffffff811efc4d>] do_vfs_ioctl+0x2cd/0x4c0
[<ffffffff811f9657>] ? __fget+0x77/0xb0
[<ffffffff811efeb9>] SyS_ioctl+0x79/0x90
[<ffffffff81001bb0>] ? syscall_return_slowpath+0x50/0x130
[<ffffffff8167f776>] entry_SYSCALL_64_fastpath+0x16/0x75
Fix the issue by returning NULL when there is no device with
the requested name in the list.
Cc: stable@vger.kernel.org # v4.2+
Fixes: 4bc94d5dc9 ("vfio: Fix lockdep issue")
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This patch introduces a module that registers and implements a low-level
reset function for the AMD XGBE device.
it performs the following actions:
- reset the PHY
- disable auto-negotiation
- disable & clear auto-negotiation IRQ
- soft-reset the MAC
Those tiny pieces of code are inherited from the native xgbe driver.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In the current code the vfio_platform_region is copied on the stack.
As a consequence the ioaddr address is not iounmapped in the vfio
platform driver (vfio_platform_regions_cleanup). The patch uses the
pointer to the region instead.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
It might be helpful for the end-user to check the device reset
function was found by the vfio platform reset framework.
Lets store a pointer to the struct device in vfio_platform_device
and trace when the reset function is called or not found.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Remove the static lookup table and use the dynamic list of registered
reset functions instead. Also load the reset module through its alias.
The reset struct module pointer is stored in vfio_platform_device.
We also remove the useless struct device pointer parameter in
vfio_platform_get_reset.
This patch fixes the issue related to the usage of __symbol_get, which
besides from being moot, prevented compilation with CONFIG_MODULES
disabled.
Also usage of MODULE_ALIAS makes possible to add a new reset module
without needing to update the framework. This was suggested by Arnd.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Let's retrieve the compatibility string on probe and store it
in the vfio_platform_device struct
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This patch adds the reset function registration/unregistration.
This is handled through the module_vfio_reset_handler macro. This
latter also defines a MODULE_ALIAS which simplifies the load from
vfio-platform.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The module_vfio_reset_handler macro
- define a module alias
- implement module init/exit function which respectively registers
and unregisters the reset function.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In preparation for subsequent changes in reset function lookup,
lets introduce a dynamic list of reset combos (compat string,
reset module, reset function). The list can be populated/voided with
vfio_platform_register/unregister_reset. Those are not yet used in
this patch.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
To prepare for vfio platform reset rework let's build
vfio_platform_common.c and vfio_platform_irq.c in a separate
module from vfio-platform and vfio-amba. This makes possible
to have separate module inits and works around a race between
platform driver init and vfio reset module init: that way we
make sure symbols exported by base are available when vfio-platform
driver gets probed.
The open/release being implemented in the base module, the ref
count is applied to the parent module instead.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vfio_platform_{read,write}_mmio() call ioremap_nocache() to map
a region of io memory, which they store in struct vfio_platform_region to
be eventually re-used, or unmapped by vfio_platform_regions_cleanup().
These functions receive a copy of their struct vfio_platform_region
argument on the stack - so these mapped areas are always allocated, and
always leaked.
Pass this argument as a pointer instead.
Fixes: 6e3f264560 "vfio/platform: read and write support for the device fd"
Signed-off-by: James Morse <james.morse@arm.com>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Current vfio_pgsize_bitmap code hides the supported IOMMU page
sizes smaller than PAGE_SIZE. As a result, in case the IOMMU
does not support PAGE_SIZE page, the alignment check on map/unmap
is done with larger page sizes, if any. This can fail although
mapping could be done with pages smaller than PAGE_SIZE.
This patch modifies vfio_pgsize_bitmap implementation so that,
in case the IOMMU supports page sizes smaller than PAGE_SIZE
we pretend PAGE_SIZE is supported and hide sub-PAGE_SIZE sizes.
That way the user will be able to map/unmap buffers whose size/
start address is aligned with PAGE_SIZE. Pinning code uses that
granularity while iommu driver can use the sub-PAGE_SIZE size
to map the buffer.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The vfio platform driver currently sets the IRQ_NOAUTOEN before
doing the request_irq to properly handle the user masking. However
it does not clear it when de-assigning the IRQ. This brings issues
when loading the native driver again which may not explicitly enable
the IRQ. This problem was observed with xgbe driver.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The PCI VPD capability operates on a set of window registers in PCI
config space. Writing to the address register triggers either a read
or write, depending on the setting of the PCI_VPD_ADDR_F bit within
the address register. The data register provides either the source
for writes or the target for reads.
This model is susceptible to being broken by concurrent access, for
which the kernel has adopted a set of access functions to serialize
these registers. Additionally, commits like 932c435cab ("PCI: Add
dev_flags bit to access VPD through function 0") and 7aa6ca4d39
("PCI: Add VPD function 0 quirk for Intel Ethernet devices") indicate
that VPD registers can be shared between functions on multifunction
devices creating dependencies between otherwise independent devices.
Fortunately it's quite easy to emulate the VPD registers, simply
storing copies of the address and data registers in memory and
triggering a VPD read or write on writes to the address register.
This allows vfio users to avoid seeing spurious register changes from
accesses on other devices and enables the use of shared quirks in the
host kernel. We can theoretically still race with access through
sysfs, but the window of opportunity is much smaller.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Mark Rustad <mark.d.rustad@intel.com>
When determining whether a group is viable, we already allow devices
bound to pcieport. Generalize this to include any PCI bridge device.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This patch adds the registration/unregistration of an
irq_bypass_producer for MSI/MSIx on vfio pci devices.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When we open a device file descriptor, we currently have the
following:
vfio_group_get_device_fd()
mutex_lock(&group->device_lock);
open()
...
if (ret)
release()
If we hit that error case, we call the backend driver release path,
which for vfio-pci looks like this:
vfio_pci_release()
vfio_pci_disable()
vfio_pci_try_bus_reset()
vfio_pci_get_devs()
vfio_device_get_from_dev()
vfio_group_get_device()
mutex_lock(&group->device_lock);
Whoops, we've stumbled back onto group.device_lock and created a
deadlock. There's a low likelihood of ever seeing this play out, but
obviously it needs to be fixed. To do that we can use a reference to
the vfio_device for vfio_group_get_device_fd() rather than holding the
lock. There was a loop in this function, theoretically allowing
multiple devices with the same name, but in practice we don't expect
such a thing to happen and the code is already aborting from the loop
with break on any sort of error rather than continuing and only
parsing the first match anyway, so the loop was effectively unused
already.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Fixes: 20f300175a ("vfio/pci: Fix racy vfio_device_get_from_dev() call")
Reported-by: Joerg Roedel <joro@8bytes.org>
Tested-by: Joerg Roedel <jroedel@suse.de>
- Disable the 32-bit vdso when building LE, so we can build with a 64-bit only
toolchain.
- EEH fixes from Gavin & Richard.
- Enable the sys_kcmp syscall from Laurent.
- Sysfs control for fastsleep workaround from Shreyas.
- Expose OPAL events as an irq chip by Alistair.
- MSI ops moved to pci_controller_ops by Daniel.
- Fix for kernel to userspace backtraces for perf from Anton.
- Merge pseries and pseries_le defconfigs from Cyril.
- CXL in-kernel API from Mikey.
- OPAL prd driver from Jeremy.
- Fix for DSCR handling & tests from Anshuman.
- Powernv flash mtd driver from Cyril.
- Dynamic DMA Window support on powernv from Alexey.
- LLVM clang fixes & workarounds from Anton.
- Reworked version of the patch to abort syscalls when transactional.
- Fix the swap encoding to support 4TB, from Aneesh.
- Various fixes as usual.
- Freescale updates from Scott: Highlights include more 8xx optimizations, an
e6500 hugetlb optimization, QMan device tree nodes, t1024/t1023 support, and
various fixes and cleanup.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJViSZqAAoJEFHr6jzI4aWAA7kQAKq3+pejfo2rY7alpKJyeVao
vlaIEaDNOTh+ctcmu3MFF9Jy6fai8gNZziRXU5JRmE5RW4GVBN4KZiqXRbkVjdBK
uG9sCX7Y58VRsS2vnGBYLsamfTMgjaXeDvgunQHVLiechJnrDr0RHEK90F3LSi73
Axp6l8XIG63a3zFZmkhzANMCme2lm5+MWmGlSjUUNi5F+viQUgJc5iiO8xrVUgM5
RpNlV2NJSqFiU+gMQWJ226V85UIniouq4j+qtyUcu8/m9BberyolXVU0GPlPFdsx
r/Qh9uCJyZaUdSB5hzomQZj50IsSz6J6nEuJTeGRoVZOmeI8Dnc2xU9fxQF5fC8H
lUJw10WPoNOggQZTeSUKn7wTXw3i4p3KsWNUczaW68VJdhqZUVaSp0+I6mnDSqzs
9iGC+VffLYNa1OHq7mGRFrgDdLBCHes31aZ3CxlQsmyNpAPCwMzsD4TUfVnvOG6E
oJOeaQ4mZM9PvqxEYJfoIL+vgRxmQ8sdIBtNY4in+C7J6eFnZNFO9xmPnJZuVU31
PGtx60kjFCOVMXvqn34WkRNbgqGWI91IK0KcRwFO2LXVio1uY77TWL52kNK2IMsp
Az+VDDvqnT3+BoV1yz0P6SrXAkwTpvFk2y+IdmEiUUN7zZFL5ZSA2epej9AzHTAK
WID2bc5yVtIL6p6x5ICH
=d9Wh
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux
Pull powerpc updates from Michael Ellerman:
- disable the 32-bit vdso when building LE, so we can build with a
64-bit only toolchain.
- EEH fixes from Gavin & Richard.
- enable the sys_kcmp syscall from Laurent.
- sysfs control for fastsleep workaround from Shreyas.
- expose OPAL events as an irq chip by Alistair.
- MSI ops moved to pci_controller_ops by Daniel.
- fix for kernel to userspace backtraces for perf from Anton.
- merge pseries and pseries_le defconfigs from Cyril.
- CXL in-kernel API from Mikey.
- OPAL prd driver from Jeremy.
- fix for DSCR handling & tests from Anshuman.
- Powernv flash mtd driver from Cyril.
- dynamic DMA Window support on powernv from Alexey.
- LLVM clang fixes & workarounds from Anton.
- reworked version of the patch to abort syscalls when transactional.
- fix the swap encoding to support 4TB, from Aneesh.
- various fixes as usual.
- Freescale updates from Scott: Highlights include more 8xx
optimizations, an e6500 hugetlb optimization, QMan device tree nodes,
t1024/t1023 support, and various fixes and cleanup.
* tag 'powerpc-4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (180 commits)
cxl: Fix typo in debug print
cxl: Add CXL_KERNEL_API config option
powerpc/powernv: Fix wrong IOMMU table in pnv_ioda_setup_bus_dma()
powerpc/mm: Change the swap encoding in pte.
powerpc/mm: PTE_RPN_MAX is not used, remove the same
powerpc/tm: Abort syscalls in active transactions
powerpc/iommu/ioda2: Enable compile with IOV=on and IOMMU_API=off
powerpc/include: Add opal-prd to installed uapi headers
powerpc/powernv: fix construction of opal PRD messages
powerpc/powernv: Increase opal-irqchip initcall priority
powerpc: Make doorbell check preemption safe
powerpc/powernv: pnv_init_idle_states() should only run on powernv
macintosh/nvram: Remove as unused
powerpc: Don't use gcc specific options on clang
powerpc: Don't use -mno-strict-align on clang
powerpc: Only use -mtraceback=no, -mno-string and -msoft-float if toolchain supports it
powerpc: Only use -mabi=altivec if toolchain supports it
powerpc: Fix duplicate const clang warning in user access code
vfio: powerpc/spapr: Support Dynamic DMA windows
vfio: powerpc/spapr: Register memory and define IOMMU v2
...
This patch enables building VFIO platform and derivatives on ARM64.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This patch introduces a module that registers and implements a basic
reset function for the Calxeda xgmac device. This latter basically disables
interrupts and stops DMA transfers.
The reset function code is inherited from the native calxeda xgmac driver.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The reset function lookup happens on vfio-platform probe. The reset
module load is requested and a reference to the function symbol is
hold. The reference is released on vfio-platform remove.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
A new reset callback is introduced. If this callback is populated,
the reset is invoked on device first open/last close or upon userspace
ioctl. The modality is exposed on VFIO_DEVICE_GET_INFO.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This patch introduces the vfio_platform_reset_combo struct that
stores all the information useful to handle the reset modality:
compat string, name of the reset function, name of the module that
implements the reset function. A lookup table of such structures
is added, currently void.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This adds create/remove window ioctls to create and remove DMA windows.
sPAPR defines a Dynamic DMA windows capability which allows
para-virtualized guests to create additional DMA windows on a PCI bus.
The existing linux kernels use this new window to map the entire guest
memory and switch to the direct DMA operations saving time on map/unmap
requests which would normally happen in a big amounts.
This adds 2 ioctl handlers - VFIO_IOMMU_SPAPR_TCE_CREATE and
VFIO_IOMMU_SPAPR_TCE_REMOVE - to create and remove windows.
Up to 2 windows are supported now by the hardware and by this driver.
This changes VFIO_IOMMU_SPAPR_TCE_GET_INFO handler to return additional
information such as a number of supported windows and maximum number
levels of TCE tables.
DDW is added as a capability, not as a SPAPR TCE IOMMU v2 unique feature
as we still want to support v2 on platforms which cannot do DDW for
the sake of TCE acceleration in KVM (coming soon).
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The existing implementation accounts the whole DMA window in
the locked_vm counter. This is going to be worse with multiple
containers and huge DMA windows. Also, real-time accounting would requite
additional tracking of accounted pages due to the page size difference -
IOMMU uses 4K pages and system uses 4K or 64K pages.
Another issue is that actual pages pinning/unpinning happens on every
DMA map/unmap request. This does not affect the performance much now as
we spend way too much time now on switching context between
guest/userspace/host but this will start to matter when we add in-kernel
DMA map/unmap acceleration.
This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
2 new ioctls to register/unregister DMA memory -
VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
which receive user space address and size of a memory region which
needs to be pinned/unpinned and counted in locked_vm.
New IOMMU splits physical pages pinning and TCE table update
into 2 different operations. It requires:
1) guest pages to be registered first
2) consequent map/unmap requests to work only with pre-registered memory.
For the default single window case this means that the entire guest
(instead of 2GB) needs to be pinned before using VFIO.
When a huge DMA window is added, no additional pinning will be
required, otherwise it would be guest RAM + 2GB.
The new memory registration ioctls are not supported by
VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
will require memory to be preregistered in order to work.
The accounting is done per the user process.
This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
can do with v1 or v2 IOMMUs.
In order to support memory pre-registration, we need a way to track
the use of every registered memory region and only allow unregistration
if a region is not in use anymore. So we need a way to tell from what
region the just cleared TCE was from.
This adds a userspace view of the TCE table into iommu_table struct.
It contains userspace address, one per TCE entry. The table is only
allocated when the ownership over an IOMMU group is taken which means
it is only used from outside of the powernv code (such as VFIO).
As v2 IOMMU supports IODA2 and pre-IODA2 IOMMUs (which do not support
DDW API), this creates a default DMA window for IODA2 for consistency.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Before the IOMMU user (VFIO) would take control over the IOMMU table
belonging to a specific IOMMU group. This approach did not allow sharing
tables between IOMMU groups attached to the same container.
This introduces a new IOMMU ownership flavour when the user can not
just control the existing IOMMU table but remove/create tables on demand.
If an IOMMU implements take/release_ownership() callbacks, this lets
the user have full control over the IOMMU group. When the ownership
is taken, the platform code removes all the windows so the caller must
create them.
Before returning the ownership back to the platform code, VFIO
unprograms and removes all the tables it created.
This changes IODA2's onwership handler to remove the existing table
rather than manipulating with the existing one. From now on,
iommu_take_ownership() and iommu_release_ownership() are only called
from the vfio_iommu_spapr_tce driver.
Old-style ownership is still supported allowing VFIO to run on older
P5IOC2 and IODA IO controllers.
No change in userspace-visible behaviour is expected. Since it recreates
TCE tables on each ownership change, related kernel traces will appear
more often.
This adds a pnv_pci_ioda2_setup_default_config() which is called
when PE is being configured at boot time and when the ownership is
passed from VFIO to the platform code.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This extends iommu_table_group_ops by a set of callbacks to support
dynamic DMA windows management.
create_table() creates a TCE table with specific parameters.
it receives iommu_table_group to know nodeid in order to allocate
TCE table memory closer to the PHB. The exact format of allocated
multi-level table might be also specific to the PHB model (not
the case now though).
This callback calculated the DMA window offset on a PCI bus from @num
and stores it in a just created table.
set_window() sets the window at specified TVT index + @num on PHB.
unset_window() unsets the window from specified TVT.
This adds a free() callback to iommu_table_ops to free the memory
(potentially a tree of tables) allocated for the TCE table.
create_table() and free() are supposed to be called once per
VFIO container and set_window()/unset_window() are supposed to be
called for every group in a container.
This adds IOMMU capabilities to iommu_table_group such as default
32bit window parameters and others. This makes use of new values in
vfio_iommu_spapr_tce. IODA1/P5IOC2 do not support DDW so they do not
advertise pagemasks to the userspace.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
At the moment writing new TCE value to the IOMMU table fails with EBUSY
if there is a valid entry already. However PAPR specification allows
the guest to write new TCE value without clearing it first.
Another problem this patch is addressing is the use of pool locks for
external IOMMU users such as VFIO. The pool locks are to protect
DMA page allocator rather than entries and since the host kernel does
not control what pages are in use, there is no point in pool locks and
exchange()+put_page(oldtce) is sufficient to avoid possible races.
This adds an exchange() callback to iommu_table_ops which does the same
thing as set() plus it returns replaced TCE and DMA direction so
the caller can release the pages afterwards. The exchange() receives
a physical address unlike set() which receives linear mapping address;
and returns a physical address as the clear() does.
This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement
for a platform to have exchange() implemented in order to support VFIO.
This replaces iommu_tce_build() and iommu_clear_tce() with
a single iommu_tce_xchg().
This makes sure that TCE permission bits are not set in TCE passed to
IOMMU API as those are to be calculated by platform code from
DMA direction.
This moves SetPageDirty() to the IOMMU code to make it work for both
VFIO ioctl interface in in-kernel TCE acceleration (when it becomes
available later).
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds tce_iommu_take_ownership() and tce_iommu_release_ownership
which call in a loop iommu_take_ownership()/iommu_release_ownership()
for every table on the group. As there is just one now, no change in
behaviour is expected.
At the moment the iommu_table struct has a set_bypass() which enables/
disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
which calls this callback when external IOMMU users such as VFIO are
about to get over a PHB.
The set_bypass() callback is not really an iommu_table function but
IOMMU/PE function. This introduces a iommu_table_group_ops struct and
adds take_ownership()/release_ownership() callbacks to it which are
called when an external user takes/releases control over the IOMMU.
This replaces set_bypass() with ownership callbacks as it is not
necessarily just bypass enabling, it can be something else/more
so let's give it more generic name.
The callbacks is implemented for IODA2 only. Other platforms (P5IOC2,
IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
The following patches will replace iommu_take_ownership/
iommu_release_ownership calls in IODA2 with full IOMMU table release/
create.
As we here and touching bypass control, this removes
pnv_pci_ioda2_setup_bypass_pe() as it does not do much
more compared to pnv_pci_ioda2_set_bypass. This moves tce_bypass_base
initialization to pnv_pci_ioda2_setup_dma_pe.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
So far one TCE table could only be used by one IOMMU group. However
IODA2 hardware allows programming the same TCE table address to
multiple PE allowing sharing tables.
This replaces a single pointer to a group in a iommu_table struct
with a linked list of groups which provides the way of invalidating
TCE cache for every PE when an actual TCE table is updated. This adds
pnv_pci_link_table_and_group() and pnv_pci_unlink_table_and_group()
helpers to manage the list. However without VFIO, it is still going
to be a single IOMMU group per iommu_table.
This changes iommu_add_device() to add a device to a first group
from the group list of a table as it is only called from the platform
init code or PCI bus notifier and at these moments there is only
one group per table.
This does not change TCE invalidation code to loop through all
attached groups in order to simplify this patch and because
it is not really needed in most cases. IODA2 is fixed in a later
patch.
This should cause no behavioural change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Modern IBM POWERPC systems support multiple (currently two) TCE tables
per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
for TCE tables. Right now just one table is supported.
This defines iommu_table_group struct which stores pointers to
iommu_group and iommu_table(s). This replaces iommu_table with
iommu_table_group where iommu_table was used to identify a group:
- iommu_register_group();
- iommudata of generic iommu_group;
This removes @data from iommu_table as it_table_group provides
same access to pnv_ioda_pe.
For IODA, instead of embedding iommu_table, the new iommu_table_group
keeps pointers to those. The iommu_table structs are allocated
dynamically.
For P5IOC2, both iommu_table_group and iommu_table are embedded into
PE struct. As there is no EEH and SRIOV support for P5IOC2,
iommu_free_table() should not be called on iommu_table struct pointers
so we can keep it embedded in pnv_phb::p5ioc2.
For pSeries, this replaces multiple calls of kzalloc_node() with a new
iommu_pseries_alloc_group() helper and stores the table group struct
pointer into the pci_dn struct. For release, a iommu_table_free_group()
helper is added.
This moves iommu_table struct allocation from SR-IOV code to
the generic DMA initialization code in pnv_pci_ioda_setup_dma_pe and
pnv_pci_ioda2_setup_dma_pe as this is where DMA is actually initialized.
This change is here because those lines had to be changed anyway.
This should cause no behavioural change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This is to make extended ownership and multiple groups support patches
simpler for review.
This should cause no behavioural change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This is a pretty mechanical patch to make next patches simpler.
New tce_iommu_unuse_page() helper does put_page() now but it might skip
that after the memory registering patch applied.
As we are here, this removes unnecessary checks for a value returned
by pfn_to_page() as it cannot possibly return NULL.
This moves tce_iommu_disable() later to let tce_iommu_clear() know if
the container has been enabled because if it has not been, then
put_page() must not be called on TCEs from the TCE table. This situation
is not yet possible but it will after KVM acceleration patchset is
applied.
This changes code to work with physical addresses rather than linear
mapping addresses for better code readability. Following patches will
add an xchg() callback for an IOMMU table which will accept/return
physical addresses (unlike current tce_build()) which will eliminate
redundant conversions.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
At the moment DMA map/unmap requests are handled irrespective to
the container's state. This allows the user space to pin memory which
it might not be allowed to pin.
This adds checks to MAP/UNMAP that the container is enabled, otherwise
-EPERM is returned.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>