If the IOMMU group setup fails, the reset module is not released.
Fixes: b5add544d6 ("vfio, platform: make reset driver a requirement by default")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Simon Horman <horms+renesas@verge.net.au>
Acked-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
MAP_DMA ioctls might be called from various threads within a process,
for example when using QEMU, the vCPU threads are often generating
these calls and we therefore take a reference to that vCPU task.
However, QEMU also supports vCPU hotplug on some machines and the task
that called MAP_DMA may have exited by the time UNMAP_DMA is called,
resulting in the mm_struct pointer being NULL and thus a failure to
match against the existing mapping.
To resolve this, we instead take a reference to the thread
group_leader, which has the same mm_struct and resource limits, but
is less likely exit, at least in the QEMU case. A difficulty here is
guaranteeing that the capabilities of the group_leader match that of
the calling thread, which we resolve by tracking CAP_IPC_LOCK at the
time of calling rather than at an indeterminate time in the future.
Potentially this also results in better efficiency as this is now
recorded once per MAP_DMA ioctl.
Reported-by: Xu Yandong <xuyandong2@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Pull aio updates from Al Viro:
"Majority of AIO stuff this cycle. aio-fsync and aio-poll, mostly.
The only thing I'm holding back for a day or so is Adam's aio ioprio -
his last-minute fixup is trivial (missing stub in !CONFIG_BLOCK case),
but let it sit in -next for decency sake..."
* 'work.aio-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
aio: sanitize the limit checking in io_submit(2)
aio: fold do_io_submit() into callers
aio: shift copyin of iocb into io_submit_one()
aio_read_events_ring(): make a bit more readable
aio: all callers of aio_{read,write,fsync,poll} treat 0 and -EIOCBQUEUED the same way
aio: take list removal to (some) callers of aio_complete()
aio: add missing break for the IOCB_CMD_FDSYNC case
random: convert to ->poll_mask
timerfd: convert to ->poll_mask
eventfd: switch to ->poll_mask
pipe: convert to ->poll_mask
crypto: af_alg: convert to ->poll_mask
net/rxrpc: convert to ->poll_mask
net/iucv: convert to ->poll_mask
net/phonet: convert to ->poll_mask
net/nfc: convert to ->poll_mask
net/caif: convert to ->poll_mask
net/bluetooth: convert to ->poll_mask
net/sctp: convert to ->poll_mask
net/tipc: convert to ->poll_mask
...
Bisection by Amadeusz Sławiński implicates this commit leading to bad
page state issues after VM shutdown, likely due to unbalanced page
references. The original commit was intended only as a performance
improvement, therefore revert for offline rework.
Link: https://lkml.org/lkml/2018/6/2/97
Fixes: 356e88ebe4 ("vfio/type1: Improve memory pinning process for raw PFN mapping")
Cc: Jason Cai (Xiang Feng) <jason.cai@linux.alibaba.com>
Reported-by: Amadeusz Sławiński <amade@asmblr.net>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
These abstract out calls to the poll method in preparation for changes
in how we poll.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
The ioeventfd here is actually irqfd handling of an ioeventfd such as
supported in KVM. A user is able to pre-program a device write to
occur when the eventfd triggers. This is yet another instance of
eventfd-irqfd triggering between KVM and vfio. The impetus for this
is high frequency writes to pages which are virtualized in QEMU.
Enabling this near-direct write path for selected registers within
the virtualized page can improve performance and reduce overhead.
Specifically this is initially targeted at NVIDIA graphics cards where
the driver issues a write to an MMIO register within a virtualized
region in order to allow the MSI interrupt to re-trigger.
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The iowriteXX/ioreadXX functions assume little endian hardware and
convert to little endian on a write and from little endian on a read.
We currently do our own explicit conversion to negate this. Instead,
add some endian dependent defines to avoid all byte swaps. There
should be no functional change other than big endian systems aren't
penalized with wasted swaps.
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This creates a common helper that we'll use for ioeventfd setup.
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
When using vfio to pass through a PCIe device (e.g. a GPU card) that
has a huge BAR (e.g. 16GB), a lot of cycles are wasted on memory
pinning because PFNs of PCI BAR are not backed by struct page, and
the corresponding VMA has flag VM_PFNMAP.
With this change, when pinning a region which is a raw PFN mapping,
it can skip unnecessary user memory pinning process, and thus, can
significantly improve VM's boot up time when passing through devices
via VFIO. In my test on a Xeon E5 2.6GHz, the time mapping a 16GB
BAR was reduced from about 0.4s to 1.5us.
Signed-off-by: Jason Cai (Xiang Feng) <jason.cai@linux.alibaba.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This reverts commit 2170dd0431
The intent of commit 2170dd0431 ("vfio-pci: Mask INTx if a device is
not capabable of enabling it") was to disallow the user from seeing
that the device supports INTx if the platform is incapable of enabling
it. The detection of this case however incorrectly includes devices
which natively do not support INTx, such as SR-IOV VFs, and further
discussions reveal gaps even for the target use case.
Reported-by: Arjun Vynipadath <arjun@chelsio.com>
Fixes: 2170dd0431 ("vfio-pci: Mask INTx if a device is not capabable of enabling it")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
VFIO IOMMU type1 currently upmaps IOVA pages synchronously, which requires
IOTLB flushing for every unmapping. This results in large IOTLB flushing
overhead when handling pass-through devices has a large number of mapped
IOVAs. This can be avoided by using the new IOTLB flushing interface.
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
[aw - use LIST_HEAD]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Filesystem-DAX is incompatible with 'longterm' page pinning. Without
page cache indirection a DAX mapping maps filesystem blocks directly.
This means that the filesystem must not modify a file's block map while
any page in a mapping is pinned. In order to prevent the situation of
userspace holding of filesystem operations indefinitely, disallow
'longterm' Filesystem-DAX mappings.
RDMA has the same conflict and the plan there is to add a 'with lease'
mechanism to allow the kernel to notify userspace that the mapping is
being torn down for block-map maintenance. Perhaps something similar can
be put in place for vfio.
Note that xfs and ext4 still report:
"DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
...at mount time, and resolving the dax-dma-vs-truncate problem is one
of the last hurdles to remove that designation.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: kvm@vger.kernel.org
Cc: <stable@vger.kernel.org>
Reported-by: Haozhong Zhang <haozhong.zhang@intel.com>
Tested-by: Haozhong Zhang <haozhong.zhang@intel.com>
Fixes: d475c6346a ("dax,ext2: replace XIP read and write with DAX I/O")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:
for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
done
with de-mangling cleanups yet to come.
NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do. But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.
The next patch from Al will sort out the final differences, and we
should be all done.
Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Mask INTx from user if pdev->irq is zero (Alexey Kardashevskiy)
- Capability helper cleanup (Alex Williamson)
- Allow mmaps overlapping MSI-X vector table with region capability
exposing this feature (Alexey Kardashevskiy)
- mdev static cleanups (Xiongwei Song)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJac2tQAAoJECObm247sIsiUr8P/2zrK0G/mVrHWGEjlnlvgYjg
FSozgN0fmc9mJ+Hg8ntfXn4GXuTaCq0uz96eMl9Wy1tMpLs4hoQf20IRojIpqgpb
aVUHKg+7sJUWNsd+u1jSBb64SvQLdTbesfGgL8WLcJB90jWxzaZjEZ3mgKO4Lb88
rNEpiTVoOURDgJo+bOMEnJJ6okmpRLBgw5pqrdLT2BButZg3QfLtcuoY18pFYxc7
INy4YPWPe93aiDGloUrjj6xKRKfTaL7L8KGZBlk4FR5JENDRoOtGEguGcQRl0u8w
IYLIkIE9p172S5bkeCYqawyxAPgQQIk5Wd4buFArg9w7tdWJOkEiCMwSu/LCocR8
CiwZHKOWp7mJWv3XxJGY+rU3nHIuB+IaeKknE6rLXgkLVXED5Ta4pPzrpoUZezsT
yyI5U2BnWNL5ISaWY3i+YQKvFUgPe8Az9Zw4p3zGUJ+zu2QheN7x+BVXT0xENMfk
2sdNFeZwkxmB18pdsJdb+/DpL9yCrS7VxP3IDnAdfR8VIyLE3QFRTnsfNX7F1Uvr
zKBZChuah2osiiM3k3ncwkovsM6iN1ZLVnHUg7xRBmp6WQnBYN02wrTXM/e1UqsS
nKIBYA9fIcpLZ4hXMuiMmk7LFlScl/HngW+h+jWAyoP/j3X0gw/YM5hrPEJEV3fN
WuIPpBYN36+0V9cUDMnN
=Ww3n
-----END PGP SIGNATURE-----
Merge tag 'vfio-v4.16-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Mask INTx from user if pdev->irq is zero (Alexey Kardashevskiy)
- Capability helper cleanup (Alex Williamson)
- Allow mmaps overlapping MSI-X vector table with region capability
exposing this feature (Alexey Kardashevskiy)
- mdev static cleanups (Xiongwei Song)
* tag 'vfio-v4.16-rc1' of git://github.com/awilliam/linux-vfio:
vfio: mdev: make a couple of functions and structure vfio_mdev_driver static
vfio-pci: Allow mapping MSIX BAR
vfio: Simplify capability helper
vfio-pci: Mask INTx if a device is not capabable of enabling it
The functions vfio_mdev_probe, vfio_mdev_remove and the structure
vfio_mdev_driver are only used in this file, so make them static.
Clean up sparse warnings:
drivers/vfio/mdev/vfio_mdev.c:114:5: warning: no previous prototype
for 'vfio_mdev_probe' [-Wmissing-prototypes]
drivers/vfio/mdev/vfio_mdev.c:121:6: warning: no previous prototype
for 'vfio_mdev_remove' [-Wmissing-prototypes]
Signed-off-by: Xiongwei Song <sxwjean@gmail.com>
Reviewed-by: Quan Xu <quan.xu0@gmail.com>
Reviewed-by: Liu, Yi L <yi.l.liu@intel.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
By default VFIO disables mapping of MSIX BAR to the userspace as
the userspace may program it in a way allowing spurious interrupts;
instead the userspace uses the VFIO_DEVICE_SET_IRQS ioctl.
In order to eliminate guessing from the userspace about what is
mmapable, VFIO also advertises a sparse list of regions allowed to mmap.
This works fine as long as the system page size equals to the MSIX
alignment requirement which is 4KB. However with a bigger page size
the existing code prohibits mapping non-MSIX parts of a page with MSIX
structures so these parts have to be emulated via slow reads/writes on
a VFIO device fd. If these emulated bits are accessed often, this has
serious impact on performance.
This allows mmap of the entire BAR containing MSIX vector table.
This removes the sparse capability for PCI devices as it becomes useless.
As the userspace needs to know for sure whether mmapping of the MSIX
vector containing data can succeed, this adds a new capability -
VFIO_REGION_INFO_CAP_MSIX_MAPPABLE - which explicitly tells the userspace
that the entire BAR can be mmapped.
This does not touch the MSIX mangling in the BAR read/write handlers as
we are doing this just to enable direct access to non MSIX registers.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw - fixup whitespace, trim function name]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The vfio_info_add_capability() helper requires the caller to pass a
capability ID, which it then uses to fill in header fields, assuming
hard coded versions. This makes for an awkward and rigid interface.
The only thing we want this helper to do is allocate sufficient
space in the caps buffer and chain this capability into the list.
Reduce it to that simple task.
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
At the moment VFIO rightfully assumes that INTx is supported if
the interrupt pin is not set to zero in the device config space.
However if that is not the case (the pin is not zero but pdev->irq is),
vfio_intx_enable() fails.
In order to prevent the userspace from trying to enable INTx when we know
that it cannot work, let's mask the PCI_INTERRUPT_PIN register.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
__poll_t is also used as wait key in some waitqueues.
Verify that wait_..._poll() gets __poll_t as key and
provide a helper for wakeup functions to get back to
that __poll_t value.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.
However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:
----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()
// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
virtual patch
@ depends on patch @
expression E1, E2;
@@
- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)
@ depends on patch @
expression E;
@@
- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds Broadcom FlexRM low-level reset for
VFIO platform.
It will do the following:
1. Disable/Deactivate each FlexRM ring
2. Flush each FlexRM ring
The cleanup sequence for FlexRM rings is adapted from
Broadcom FlexRM mailbox driver.
Signed-off-by: Anup Patel <anup.patel@broadcom.com>
Reviewed-by: Oza Oza <oza.oza@broadcom.com>
Reviewed-by: Scott Branden <scott.branden@broadcom.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
I get a static checker warning about the potential integer overflow if
we add "unmap->iova + unmap->size". The integer overflow isn't really
harmful, but we may as well fix it. Also unmap->size gets truncated to
size_t when we pass it to vfio_find_dma() so we could check for too high
values of that as well.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Clearing very big IOMMU tables can trigger soft lockups. This adds
cond_resched() to allow the scheduler to do context switching when
it decides to.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
MRRS defines the maximum read request size a device is allowed to
make. Drivers will often increase this to allow more data transfer
with a single request. Completions to this request are bound by the
MPS setting for the bus. Aside from device quirks (none known), it
doesn't seem to make sense to set an MRRS value less than MPS, yet
this is a likely scenario given that user drivers do not have a
system-wide view of the PCI topology. Virtualize MRRS such that the
user can set MRRS >= MPS, but use MPS as the floor value that we'll
write to hardware.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
With virtual PCI-Express chipsets, we now see userspace/guest drivers
trying to match the physical MPS setting to a virtual downstream port.
Of course a lone physical device surrounded by virtual interconnects
cannot make a correct decision for a proper MPS setting. Instead,
let's virtualize the MPS control register so that writes through to
hardware are disallowed. Userspace drivers like QEMU assume they can
write anything to the device and we'll filter out anything dangerous.
Since mismatched MPS can lead to AER and other faults, let's add it
to the kernel side rather than relying on userspace virtualization to
handle it.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
amba_id are not supposed to change at runtime. All functions
working with const amba_id. So mark the non-const structs as const.
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
When the user unbinds the last device of a group from a vfio bus
driver, the devices within that group should be available for other
purposes. We currently have a race that makes this generally, but
not always true. The device can be unbound from the vfio bus driver,
but remaining IOMMU context of the group attached to the container
can result in errors as the next driver configures DMA for the device.
Wait for the group to be detached from the IOMMU backend before
allowing the bus driver remove callback to complete.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In vfio_iommu_group_get() we want to increase the reference
count of the iommu group.
In noiommu case, the group does not exist and is allocated.
iommu_group_add_device() increases the group ref count. However we
then call iommu_group_put() which decrements it.
This leads to a "refcount_t: underflow WARN_ON".
Only decrement the ref count in case of iommu_group_add_device
failure.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If the IOMMU driver advertises 'real' reserved regions for MSIs, but
still includes the software-managed region as well, we are currently
blind to the former and will configure the IOMMU domain to map MSIs into
the latter, which is unlikely to work as expected.
Since it would take a ridiculous hardware topology for both regions to
be valid (which would be rather difficult to support in general), we
should be safe to assume that the presence of any hardware regions makes
the software region irrelevant. However, the IOMMU driver might still
advertise the software region by default, particularly if the hardware
regions are filled in elsewhere by generic code, so it might not be fair
for VFIO to be super-strict about not mixing them. To that end, make
vfio_iommu_has_sw_msi() robust against the presence of both region types
at once, so that we end up doing what is almost certainly right, rather
than what is almost certainly wrong.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
For ARM-based systems with a GICv3 ITS to provide interrupt isolation,
but hardware limitations which are worked around by having MSIs bypass
SMMU translation (e.g. HiSilicon Hip06/Hip07), VFIO neglects to check
for the IRQ_DOMAIN_FLAG_MSI_REMAP capability, (and thus erroneously
demands unsafe_interrupts) if a software-managed MSI region is absent.
Fix this by always checking for isolation capability at both the IRQ
domain and IOMMU domain levels, rather than predicating that on whether
MSIs require an IOMMU mapping (which was always slightly tenuous logic).
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Root complex integrated endpoints do not have a link and therefore may
use a smaller PCIe capability in config space than we expect when
building our config map. Add a case for these to avoid reporting an
erroneous overlap.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Device lock bites again; if a device .remove() callback races a user
calling ioctl(VFIO_GROUP_GET_DEVICE_FD), the unbind request will hold
the device lock, but the user ioctl may have already taken a vfio_device
reference. In the case of a PCI device, the initial open will attempt
to reset the device, which again attempts to get the device lock,
resulting in deadlock. Use the trylock PCI reset interface and return
error on the open path if reset fails due to lock contention.
Link: https://lkml.org/lkml/2017/7/25/381
Reported-by: Wen Congyang <wencongyang2@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
- Include Intel XXV710 in INTx workaround (Alex Williamson)
- Make use of ERR_CAST() for error return (Dan Carpenter)
- Fix vfio_group release deadlock from iommu notifier (Alex Williamson)
- Unset KVM-VFIO attributes only on group match (Alex Williamson)
- Fix release path group/file matching with KVM-VFIO (Alex Williamson)
- Remove unnecessary lock uses triggering lockdep splat (Alex Williamson)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJZZ71TAAoJECObm247sIsi1mUP/1/5kubhHRE/Y11SnX4Bpie0
X6YVbV3WLGuV9jaFMI7EgYJLZ6pBvCRX0CLUzEdrbS4LAlTLQu2GSY5tqhcLJumE
0mV1Wj1jOO9BAur13/sJ+IohbZeK10dtbTkWv+YhrVSpRaLP+ituaKsajEV12WRH
6dxho2bqfvNcTC8yjE+vqe9mU3jXYPGuCx8oYGaDXEsCjzrdbOFnAG0/s/2cWpb+
D1ADHd3020VAHZRnHRBLFfMczza1jqllhSAUfdMw1gRGCQDq3k1XenzVNLLTbtYy
VmEWHa+R/OCfbKVxaDqPsgOTK7x7DKn+Pzb3lWCdQ8v5X+2ubHpVIYjxTDSSTbt3
YJ7a+hNk8AHkFgwS7x8BdOT8mmNGb1NZldjS4dv2VWkfcTnMQnubnYSCzGztto9h
P2THKBil6djPb9S3pCvtKUiHSIZedYZlKofUldrOdGDAZmzLLlf8lTzijGjDYKFM
pQeZC+xeEhZXURipgkH+a+paYgDtKEfwSlABODghjCcJf7S/GbyVPLOKLXzoVb2y
Ml8eGlo4O/cNniQK5faH447ilM7hzS1aG83uGnHTfe8VgKjI7Z5ZSxKOtoEq5bDz
bb91E6GVLKHqT0LVS1YZfrnqK0hX/QAd/sK1REM5nN8JNmPLyLpjv8FaJEhpk1vC
z4At5+pfKM8DYrW3EGmc
=3A4K
-----END PGP SIGNATURE-----
Merge tag 'vfio-v4.13-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Include Intel XXV710 in INTx workaround (Alex Williamson)
- Make use of ERR_CAST() for error return (Dan Carpenter)
- Fix vfio_group release deadlock from iommu notifier (Alex Williamson)
- Unset KVM-VFIO attributes only on group match (Alex Williamson)
- Fix release path group/file matching with KVM-VFIO (Alex Williamson)
- Remove unnecessary lock uses triggering lockdep splat (Alex Williamson)
* tag 'vfio-v4.13-rc1' of git://github.com/awilliam/linux-vfio:
vfio: Remove unnecessary uses of vfio_container.group_lock
vfio: New external user group/file match
kvm-vfio: Decouple only when we match a group
vfio: Fix group release deadlock
vfio: Use ERR_CAST() instead of open coding it
vfio/pci: Add Intel XXV710 to hidden INTx devices
The original intent of vfio_container.group_lock is to protect
vfio_container.group_list, however over time it's become a crutch to
prevent changes in container composition any time we call into the
iommu driver backend. This introduces problems when we start to have
more complex interactions, for example when a user's DMA unmap request
triggers a notification to an mdev vendor driver, who responds by
attempting to unpin mappings within that request, re-entering the
iommu backend. We incorrectly assume that the use of read-locks here
allow for this nested locking behavior, but a poorly timed write-lock
could in fact trigger a deadlock.
The current use of group_lock seems to fall into the trap of locking
code, not data. Correct that by removing uses of group_lock that are
not directly related to group_list. Note that the vfio type1 iommu
backend has its own mutex, vfio_iommu.lock, which it uses to protect
itself for each of these interfaces anyway. The group_lock appears to
be a redundancy for these interfaces and type1 even goes so far as to
release its mutex to allow for exactly the re-entrant code path above.
Reported-by: Chuanxiao Dong <chuanxiao.dong@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: stable@vger.kernel.org # v4.10+
At the point where the kvm-vfio pseudo device wants to release its
vfio group reference, we can't always acquire a new reference to make
that happen. The group can be in a state where we wouldn't allow a
new reference to be added. This new helper function allows a caller
to match a file to a group to facilitate this. Given a file and
group, report if they match. Thus the caller needs to already have a
group reference to match to the file. This allows the deletion of a
group without acquiring a new reference.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Cc: stable@vger.kernel.org
If vfio_iommu_group_notifier() acquires a group reference and that
reference becomes the last reference to the group, then vfio_group_put
introduces a deadlock code path where we're trying to unregister from
the iommu notifier chain from within a callout of that chain. Use a
work_struct to release this reference asynchronously.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Cc: stable@vger.kernel.org
Rename:
wait_queue_t => wait_queue_entry_t
'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.
Start sorting this out by renaming it to 'wait_queue_entry_t'.
This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It's a small cleanup to use ERR_CAST() here.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
XXV710 has the same broken INTx behavior as the rest of the X/XL710
series, the interrupt status register is not wired to report pending
INTx interrupts, thus we never associate the interrupt to the device.
Extend the device IDs to include these so that we hide that the
device supports INTx at all to the user.
Reported-by: Stefan Assmann <sassmann@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Highlights include:
- Larger virtual address space on 64-bit server CPUs. By default we use a 128TB
virtual address space, but a process can request access to the full 512TB by
passing a hint to mmap().
- Support for the new Power9 "XIVE" interrupt controller.
- TLB flushing optimisations for the radix MMU on Power9.
- Support for CAPI cards on Power9, using the "Coherent Accelerator Interface
Architecture 2.0".
- The ability to configure the mmap randomisation limits at build and runtime.
- Several small fixes and cleanups to the kprobes code, as well as support for
KPROBES_ON_FTRACE.
- Major improvements to handling of system reset interrupts, correctly treating
them as NMIs, giving them a dedicated stack and using a new hypervisor call
to trigger them, all of which should aid debugging and robustness.
Many fixes and other minor enhancements.
Thanks to:
Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan,
Aneesh Kumar K.V, Anshuman Khandual, Anton Blanchard, Balbir Singh, Ben
Hutchings, Benjamin Herrenschmidt, Bhupesh Sharma, Chris Packham, Christian
Zigotzky, Christophe Leroy, Christophe Lombard, Daniel Axtens, David Gibson,
Gautham R. Shenoy, Gavin Shan, Geert Uytterhoeven, Guilherme G. Piccoli,
Hamish Martin, Hari Bathini, Kees Cook, Laurent Dufour, Madhavan Srinivasan,
Mahesh J Salgaonkar, Mahesh Salgaonkar, Masami Hiramatsu, Matt Brown, Matthew
R. Ochs, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran,
Pan Xinhui, Paul Mackerras, Rashmica Gupta, Russell Currey, Sukadev
Bhattiprolu, Thadeu Lima de Souza Cascardo, Tobin C. Harding, Tyrel Datwyler,
Uma Krishnan, Vaibhav Jain, Vipin K Parashar, Yang Shi.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJZDHUMAAoJEFHr6jzI4aWAT7oQALkE2Nj3gjcn1z0SkFhq/1iO
Py9Elmqm4E+L6NKYtBY5dS8xVAJ088ffzERyqJ1FY1LHkB8tn8bWRcMQmbjAFzTI
V4TAzDNI890BN/F4ptrYRwNFxRBHAvZ4NDunTzagwYnwmTzW9PYHmOi4pvWTo3Tw
KFUQ0joLSEgHzyfXxYB3fyj41u8N0FZvhfazdNSqia2Y5Vwwv/ION5jKplDM+09Y
EtVEXFvaKAS1sjbM/d/Jo5rblHfR0D9/lYV10+jjyIokjzslIpyTbnj3izeYoM5V
I4h99372zfsEjBGPPXyM3khL3zizGMSDYRmJHQSaKxjtecS9SPywPTZ8ufO/aSzV
Ngq6nlND+f1zep29VQ0cxd3Jh40skWOXzxJaFjfDT25xa6FbfsWP2NCtk8PGylZ7
EyqTuCWkMgIP02KlX3oHvEB2LRRPCDmRU2zECecRGNJrIQwYC2xjoiVi7Q8Qe8rY
gr7Ib5Jj/a+uiTcCIy37+5nXq2s14/JBOKqxuYZIxeuZFvKYuRUipbKWO05WDOAz
m/pSzeC3J8AAoYiqR0gcSOuJTOnJpGhs7zrQFqnEISbXIwLW+ICumzOmTAiBqOEY
Rt8uW2gYkPwKLrE05445RfVUoERaAjaE06eRMOWS6slnngHmmnRJbf3PcoALiJkT
ediqGEj0/N1HMB31V5tS
=vSF3
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Highlights include:
- Larger virtual address space on 64-bit server CPUs. By default we
use a 128TB virtual address space, but a process can request access
to the full 512TB by passing a hint to mmap().
- Support for the new Power9 "XIVE" interrupt controller.
- TLB flushing optimisations for the radix MMU on Power9.
- Support for CAPI cards on Power9, using the "Coherent Accelerator
Interface Architecture 2.0".
- The ability to configure the mmap randomisation limits at build and
runtime.
- Several small fixes and cleanups to the kprobes code, as well as
support for KPROBES_ON_FTRACE.
- Major improvements to handling of system reset interrupts,
correctly treating them as NMIs, giving them a dedicated stack and
using a new hypervisor call to trigger them, all of which should
aid debugging and robustness.
- Many fixes and other minor enhancements.
Thanks to: Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple,
Andrew Donnellan, Aneesh Kumar K.V, Anshuman Khandual, Anton
Blanchard, Balbir Singh, Ben Hutchings, Benjamin Herrenschmidt,
Bhupesh Sharma, Chris Packham, Christian Zigotzky, Christophe Leroy,
Christophe Lombard, Daniel Axtens, David Gibson, Gautham R. Shenoy,
Gavin Shan, Geert Uytterhoeven, Guilherme G. Piccoli, Hamish Martin,
Hari Bathini, Kees Cook, Laurent Dufour, Madhavan Srinivasan, Mahesh J
Salgaonkar, Mahesh Salgaonkar, Masami Hiramatsu, Matt Brown, Matthew
R. Ochs, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Oliver
O'Halloran, Pan Xinhui, Paul Mackerras, Rashmica Gupta, Russell
Currey, Sukadev Bhattiprolu, Thadeu Lima de Souza Cascardo, Tobin C.
Harding, Tyrel Datwyler, Uma Krishnan, Vaibhav Jain, Vipin K Parashar,
Yang Shi"
* tag 'powerpc-4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (214 commits)
powerpc/64s: Power9 has no LPCR[VRMASD] field so don't set it
powerpc/powernv: Fix TCE kill on NVLink2
powerpc/mm/radix: Drop support for CPUs without lockless tlbie
powerpc/book3s/mce: Move add_taint() later in virtual mode
powerpc/sysfs: Move #ifdef CONFIG_HOTPLUG_CPU out of the function body
powerpc/smp: Document irq enable/disable after migrating IRQs
powerpc/mpc52xx: Don't select user-visible RTAS_PROC
powerpc/powernv: Document cxl dependency on special case in pnv_eeh_reset()
powerpc/eeh: Clean up and document event handling functions
powerpc/eeh: Avoid use after free in eeh_handle_special_event()
cxl: Mask slice error interrupts after first occurrence
cxl: Route eeh events to all drivers in cxl_pci_error_detected()
cxl: Force context lock during EEH flow
powerpc/64: Allow CONFIG_RELOCATABLE if COMPILE_TEST
powerpc/xmon: Teach xmon oops about radix vectors
powerpc/mm/hash: Fix off-by-one in comment about kernel contexts ids
powerpc/pseries: Enable VFIO
powerpc/powernv: Fix iommu table size calculation hook for small tables
powerpc/powernv: Check kzalloc() return value in pnv_pci_table_alloc
powerpc: Add arch/powerpc/tools directory
...
vfio_pin_pages_remote() is typically called to iterate over a range
of memory. Testing CAP_IPC_LOCK is relatively expensive, so it makes
sense to push it up to the caller, which can then repeatedly call
vfio_pin_pages_remote() using that value. This can show nearly a 20%
improvement on the worst case path through VFIO_IOMMU_MAP_DMA with
contiguous page mapping disabled. Testing RLIMIT_MEMLOCK is much more
lightweight, but we bring it along on the same principle and it does
seem to show a marginal improvement.
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
With vfio_lock_acct() testing the locked memory limit under mmap_sem,
it's redundant to do it here for a single page. We can also reorder
our tests such that we can avoid testing for reserved pages if we're
not doing accounting and let vfio_lock_acct() test the process
CAP_IPC_LOCK. Finally, this function oddly returns 1 on success.
Update to return zero on success, -errno on error. Since the function
only pins a single page, there's no need to return the number of pages
pinned.
N.B. vfio_pin_pages_remote() can pin a large contiguous range of pages
before calling vfio_lock_acct(). If we were to similarly remove the
extra test there, a user could temporarily pin far more pages than
they're allowed.
Suggested-by: Kirti Wankhede <kwankhede@nvidia.com>
Suggested-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If the mmap_sem is contented then the vfio type1 IOMMU backend will
defer locked page accounting updates to a workqueue task. This has a
few problems and depending on which side the user tries to play, they
might be over-penalized for unmaps that haven't yet been accounted or
race the workqueue to enter more mappings than they're allowed. The
original intent of this workqueue mechanism seems to be focused on
reducing latency through the ioctl, but we cannot do so at the cost
of correctness. Remove this workqueue mechanism and update the
callers to allow for failure. We can also now recheck the limit under
write lock to make sure we don't exceed it.
vfio_pin_pages_remote() also now necessarily includes an unwind path
which we can jump to directly if the consecutive page pinning finds
that we're exceeding the user's memory limits. This avoids the
current lazy approach which does accounting and mapping up to the
fault, only to return an error on the next iteration to unwind the
entire vfio_dma.
Cc: stable@vger.kernel.org
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This adds missing checking for kzalloc() return value.
Fixes: 4b6fad7097 ("powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The existing SPAPR TCE driver advertises both VFIO_SPAPR_TCE_IOMMU and
VFIO_SPAPR_TCE_v2_IOMMU types to the userspace and the userspace usually
picks the v2.
Normally the userspace would create a container, attach an IOMMU group
to it and only then set the IOMMU type (which would normally be v2).
However a specific IOMMU group may not support v2, in other words
it may not implement set_window/unset_window/take_ownership/
release_ownership and such a group should not be attached to
a v2 container.
This adds extra checks that a new group can do what the selected IOMMU
type suggests. The userspace can then test the return value from
ioctl(VFIO_SET_IOMMU, VFIO_SPAPR_TCE_v2_IOMMU) and try
VFIO_SPAPR_TCE_IOMMU.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
So far iommu_table obejcts were only used in virtual mode and had
a single owner. We are going to change this by implementing in-kernel
acceleration of DMA mapping requests. The proposed acceleration
will handle requests in real mode and KVM will keep references to tables.
This adds a kref to iommu_table and defines new helpers to update it.
This replaces iommu_free_table() with iommu_tce_table_put() and makes
iommu_free_table() static. iommu_tce_table_get() is not used in this patch
but it will be in the following patch.
Since this touches prototypes, this also removes @node_name parameter as
it has never been really useful on powernv and carrying it for
the pseries platform code to iommu_free_table() seems to be quite
useless as well.
This should cause no behavioral change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
At the moment iommu_table can be disposed by either calling
iommu_table_free() directly or it_ops::free(); the only implementation
of free() is in IODA2 - pnv_ioda2_table_free() - and it calls
iommu_table_free() anyway.
As we are going to have reference counting on tables, we need an unified
way of disposing tables.
This moves it_ops::free() call into iommu_free_table() and makes use
of the latter. The free() callback now handles only platform-specific
data.
As from now on the iommu_free_table() calls it_ops->free(), we need
to have it_ops initialized before calling iommu_free_table() so this
moves this initialization in pnv_pci_ioda2_create_table().
This should cause no behavioral change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The introduction of reserved regions has left a couple of rough edges
which we could do with sorting out sooner rather than later. Since we
are not yet addressing the potential dynamic aspect of software-managed
reservations and presenting them at arbitrary fixed addresses, it is
incongruous that we end up displaying hardware vs. software-managed MSI
regions to userspace differently, especially since ARM-based systems may
actually require one or the other, or even potentially both at once,
(which iommu-dma currently has no hope of dealing with at all). Let's
resolve the former user-visible inconsistency ASAP before the ABI has
been baked into a kernel release, in a way that also lays the groundwork
for the latter shortcoming to be addressed by follow-up patches.
For clarity, rename the software-managed type to IOMMU_RESV_SW_MSI, use
IOMMU_RESV_MSI to describe the hardware type, and document everything a
little bit. Since the x86 MSI remapping hardware falls squarely under
this meaning of IOMMU_RESV_MSI, apply that type to their regions as well,
so that we tell the same story to userspace across all platforms.
Secondly, as the various region types require quite different handling,
and it really makes little sense to ever try combining them, convert the
bitfield-esque #defines to a plain enum in the process before anyone
gets the wrong impression.
Fixes: d30ddcaa7b ("iommu: Add a new type field in iommu_resv_region")
Reviewed-by: Eric Auger <eric.auger@redhat.com>
CC: Alex Williamson <alex.williamson@redhat.com>
CC: David Woodhouse <dwmw2@infradead.org>
CC: kvm@vger.kernel.org
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The intent of the original warning is make sure that the mdev vendor
driver has removed any group notifiers at the point where the group
is closed by the user. Theoretically this would be through an
orderly shutdown where any devices are release prior to the group
release. We can't always count on an orderly shutdown, the user can
close the group before the notifier can be removed or the user task
might be killed. We'd like to add this sanity test when the group is
idle and the only references are from the devices within the group
themselves, but we don't have a good way to do that. Instead check
both when the group itself is removed and when the group is opened.
A bit later than we'd prefer, but better than the current over
aggressive approach.
Fixes: ccd46dbae7 ("vfio: support notifier chain in vfio_group")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: <stable@vger.kernel.org> # v4.10
Cc: Jike Song <jike.song@intel.com>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/mm.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
The APIs that are going to be moved first are:
mm_alloc()
__mmdrop()
mmdrop()
mmdrop_async_fn()
mmdrop_async()
mmget_not_zero()
mmput()
mmput_async()
get_task_mm()
mm_access()
mm_release()
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- Kconfig fixes for SPAPR_TCE_IOMMU=n (Michael Ellerman)
- Module softdep rather than request_module to simplify usage from
initrd (Alex Williamson)
- Comment typo fix (Changbin Du)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJYry1kAAoJECObm247sIsieQ8P/0iydvBdW3AACPADQIvfV3+K
PfAj3RwO1AvKB2Uiiql8t+S+qr5WO2Z4oeyOsaapmaOq8fxN76nTtIPUPg9SzkUx
DDFNT+Zl/wFGA+vDfPcDFaFAf35mNTLN2hmqcrkTwLL7flA0ICYZzpT0U3nP6aEt
go10DkQXQ1aGJkTfBluIqe0FTBd2EZ3+/mJbjiaHRlV5CcXGEtAAe8IC+XJoJJds
AZ1RQiLauE6Rj4t2JQegZJH+ntX7zPk329jD0DjxZq9y08uEqdyeXWj31O8SRucB
hD4QGN6NS8Qv+TlpdMVnsv6E3oZhUZIObH2p0nhUJ8ZpdEkTZoR/73rJKipz6svX
9TuuiIJL2ibZ40rEpABIx1N0ykHiUmXgtCuVfjiKSXmsdZYSu7bK/CbOVBGvS+WE
dKbgob44QozgD34wt5mv9by1jnxxeDodbIfkXE3p4b2TKaaFFnpCG2Yex6XOjjCU
T4peLwxTyszqVDdghJ5l8S2661+UqS+VZQXSXMQRg4fLLx+HjB0JPo/NcfjNFhUE
/RwXvOoVRRn0EKTX5v5Ty5YVx+/KWGm3wAsa1Ar+YGATKn3tS91Vz4cKYZA/bdUc
GKkGI5FrARjtWF860cBtq/KWk2qYy7EzCdVfkk56e5s4clrZwjbet6W/RmZGSvVn
R+XAN7GucRg/aMJyrwJM
=6Gze
-----END PGP SIGNATURE-----
Merge tag 'vfio-v4.11-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Kconfig fixes for SPAPR_TCE_IOMMU=n (Michael Ellerman)
- Module softdep rather than request_module to simplify usage from
initrd (Alex Williamson)
- Comment typo fix (Changbin Du)
* tag 'vfio-v4.11-rc1' of git://github.com/awilliam/linux-vfio:
vfio: fix a typo in comment of function vfio_pin_pages
vfio: Replace module request with softdep
vfio/mdev: Use a module softdep for vfio_mdev
vfio: Fix build break when SPAPR_TCE_IOMMU=n
Correct the description that 'unpinned' -> 'pinned'.
Signed-off-by: Changbin Du <changbin.du@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The changes include:
* KVM PCIe/MSI passthrough support on ARM/ARM64
* Introduction of a core representation for individual hardware
iommus
* Support for IOMMU privileged mappings as supported by some
ARM IOMMUS
* 16-bit SID support for ARM-SMMUv2
* Stream table optimization for ARM-SMMUv3
* Various fixes and other small improvements
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJYqw3hAAoJECvwRC2XARrjPy0P/35ykfHIAESJuF+72ziaoAYA
ZvMrli8rGq7n+ntaIGPx9rV+hZTUSF8V2bfsYV7SAn5iYuViXZqvOtC3BAEp6GNC
cdMeQfqXoHiWVMdXdOihzk+6YCQvBxqPOvUtYFqVhOo3Yrz8Dc71KsKvrTndEUVY
f7bXHKssVONkWMga9lIVDgEefG5VyJPEQaxJXB9ymLHXbwWOcISe1lgtkrzFSxSH
H9YNI07Tfcxfn6rN8jGmcYFYM58xwBicpB4HBw5uytMBYAsxqTEsx4X5dGpOF6RH
cFW9nby+9ZlcTMyuWXKAck3o8df2ZC1xiSjnz+DHQdBPFiFNqIL3PVUcaz9PnF2e
e6Y+DA3s+jykeiCvi2K0Z9RwTg7t8S5spel+UCeNVSnIjE9pqZNLF8vsDjF17zuR
+zcFm7RVI397QVQGp0dbqhtxnwqt/3CX/wlzpvuNdEZa4vwujpcnM9tfl6gyFrF8
awK9Fj5ryAn4DEiM+8yiRHwLrU5ij1cfc8jQdqleEB2ca7Wv3g1uhhS0QTXOFY9u
A7ygOna25U1EcOwjC6ebjiEL115ZEOrXo+eChhzCHoUEHCVxL+L/NAMEsUcMqPIw
3XsHhru0HbXgd5O5wHX39s2je8G3+ElqQwy8Ja3DimV6tvon7yaKCXy9QU+2aa1u
3r53R/0mW1ijtOfK+I0b
=5b3I
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU UPDATES from Joerg Roedel:
- KVM PCIe/MSI passthrough support on ARM/ARM64
- introduction of a core representation for individual hardware iommus
- support for IOMMU privileged mappings as supported by some ARM IOMMUS
- 16-bit SID support for ARM-SMMUv2
- stream table optimization for ARM-SMMUv3
- various fixes and other small improvements
* tag 'iommu-updates-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (61 commits)
vfio/type1: Fix error return code in vfio_iommu_type1_attach_group()
iommu: Remove iommu_register_instance interface
iommu/exynos: Make use of iommu_device_register interface
iommu/mediatek: Make use of iommu_device_register interface
iommu/msm: Make use of iommu_device_register interface
iommu/arm-smmu: Make use of the iommu_register interface
iommu: Add iommu_device_set_fwnode() interface
iommu: Make iommu_device_link/unlink take a struct iommu_device
iommu: Add sysfs bindings for struct iommu_device
iommu: Introduce new 'struct iommu_device'
iommu: Rename struct iommu_device
iommu: Rename iommu_get_instance()
iommu: Fix static checker warning in iommu_insert_device_resv_regions
iommu: Avoid unnecessary assignment of dev->iommu_fwspec
iommu/mediatek: Remove bogus 'select' statements
iommu/dma: Remove bogus dma_supported() implementation
iommu/ipmmu-vmsa: Restrict IOMMU Domain Geometry to 32-bit address space
iommu/vt-d: Don't over-free page table directories
iommu/vt-d: Tylersburg isoch identity map check is done too late.
iommu/vt-d: Fix some macros that are incorrectly specified in intel-iommu
...
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.
Fixes: 5d70499218 ("vfio/type1: Allow transparent MSI IOVA allocation")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Rather than doing a module request from within the init function, add
a soft dependency on the available IOMMU backend drivers. This makes
the dependency visible to userspace when picking modules for the
ram disk.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Use an explicit module softdep rather than a request module call such
that the dependency is exposed to userspace. This allows us to more
easily support modules loaded at initrd time.
Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Currently the kconfig logic for VFIO_IOMMU_SPAPR_TCE and VFIO_SPAPR_EEH
is broken when SPAPR_TCE_IOMMU=n. Leading to:
warning: (VFIO) selects VFIO_IOMMU_SPAPR_TCE which has unmet direct dependencies (VFIO && SPAPR_TCE_IOMMU)
warning: (VFIO) selects VFIO_IOMMU_SPAPR_TCE which has unmet direct dependencies (VFIO && SPAPR_TCE_IOMMU)
drivers/vfio/vfio_iommu_spapr_tce.c:113:8: error: implicit declaration of function 'mm_iommu_find'
This stems from the fact that VFIO selects VFIO_IOMMU_SPAPR_TCE, and
although it has an if clause, the condition is not correct.
We could fix it by doing select VFIO_IOMMU_SPAPR_TCE if SPAPR_TCE_IOMMU,
but the cleaner fix is to drop the selects and tie VFIO_IOMMU_SPAPR_TCE
to the value of VFIO, and express the dependencies in only once place.
Do the same for VFIO_SPAPR_EEH.
The end result is that the values of VFIO_IOMMU_SPAPR_TCE and
VFIO_SPAPR_EEH follow the value of VFIO, except when SPAPR_TCE_IOMMU=n
and/or EEH=n. Which is exactly what we want to happen.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If a container already has a group attached, attaching a new group
should just program already created IOMMU tables to the hardware via
the iommu_table_group_ops::set_window() callback.
However commit 6f01cc692a ("vfio/spapr: Add a helper to create
default DMA window") did not just simplify the code but also removed
the set_window() calls in the case of attaching groups to a container
which already has tables so it broke VFIO PCI hotplug.
This reverts set_window() bits in tce_iommu_take_ownership_ddw().
Fixes: 6f01cc692a ("vfio/spapr: Add a helper to create default DMA window")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Commit d9c728949d ("vfio/spapr: Postpone default window creation")
added an additional exit to the VFIO_IOMMU_SPAPR_TCE_CREATE case and
made it possible to return from tce_iommu_ioctl() without unlocking
container->lock; this fixes the issue.
Fixes: d9c728949d ("vfio/spapr: Postpone default window creation")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In case the IOMMU translates MSI transactions (typical case
on ARM), we check MSI remapping capability at IRQ domain
level. Otherwise it is checked at IOMMU level.
At this stage the arm-smmu-(v3) still advertise the
IOMMU_CAP_INTR_REMAP capability at IOMMU level. This will be
removed in subsequent patches.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Bharat Bhushan <bharat.bhushan@nxp.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
When attaching a group to the container, check the group's
reserved regions and test whether the IOMMU translates MSI
transactions. If yes, we initialize an IOVA allocator through
the iommu_get_msi_cookie API. This will allow the MSI IOVAs
to be transparently allocated on MSI controller's compose().
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Tomasz Nowicki <tomasz.nowicki@caviumnetworks.com>
Tested-by: Bharat Bhushan <bharat.bhushan@nxp.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Using has_capability() rather than ns_capable(), we're no longer using
this header.
Cc: Jike Song <jike.song@intel.com>
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Before the mdev enhancement type1 iommu used capable() to test the
capability of current task; in the course of mdev development a
new requirement, testing for another task other than current, was
raised. ns_capable() was used for this purpose, however it still
tests current, the only difference is, in a specified namespace.
Fix it by using has_capability() instead, which tests the cap for
specified task in init_user_ns, the same namespace as capable().
Cc: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Jike Song <jike.song@intel.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Here, pci_iomap can fail, handle this case release selected
pci regions and return -ENOMEM.
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Using ancient compilers (gcc-4.5 or older) on ARM, we get a link
failure with the vfio-pci driver:
ERROR: "__aeabi_lcmp" [drivers/vfio/pci/vfio-pci.ko] undefined!
The reason is that the compiler tries to do a comparison of
a 64-bit range. This changes it to convert to a 32-bit number
explicitly first, as newer compilers do for themselves.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Abstract access to mdev_device so that we can define which interfaces
are public rather than relying on comments in the structure.
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Jike Song <jike.song@intel.com>
Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
Rather than hoping for good behavior by marking some elements
internal, enforce it by making the entire structure private and
creating an accessor function for the one useful external field.
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Jike Song <jike.song@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
Add an mdev_ prefix so we're not poluting the namespace so much.
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Zhi Wang <zhi.a.wang@intel.com>
Cc: Jike Song <jike.song@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
Using the mtty mdev sample driver we can generate a remove race by
starting one shell that continuously creates mtty devices and several
other shells all attempting to remove devices, in my case four remove
shells. The fault occurs in mdev_remove_sysfs_files() where the
passed type arg is NULL, which suggests we've received a struct device
in mdev_device_remove() but it's in some sort of teardown state. The
solution here is to make use of the accidentally unused list_head on
the mdev_device such that the mdev core keeps a list of all the mdev
devices. This allows us to validate that we have a valid mdev before
we start removal, remove it from the list to prevent others from
working on it, and if the vendor driver refuses to remove, we can
re-add it to the list.
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
As part of the mdev support, type1 now gets a task reference per
vfio_dma and uses that to get an mm reference for the task while
working on accounting. That's correct, but it's not fast. For some
paths, like vfio_pin_pages_remote(), we know we're only called from
user context, so we can restore the lighter weight calls. In other
cases, we're effectively already testing whether we're in the stored
task context elsewhere, extend this vfio_lock_acct() as well.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed by: Kirti Wankhede <kwankhede@nvidia.com>
Highlights include:
- Support for the kexec_file_load() syscall, which is a prereq for secure and
trusted boot.
- Prevent kernel execution of userspace on P9 Radix (similar to SMEP/PXN).
- Sort the exception tables at build time, to save time at boot, and store
them as relative offsets to save space in the kernel image & memory.
- Allow building the kernel with thin archives, which should allow us to build
an allyesconfig once some other fixes land.
- Build fixes to allow us to correctly rebuild when changing the kernel endian
from big to little or vice versa.
- Plumbing so that we can avoid doing a full mm TLB flush on P9 Radix.
- Initial stack protector support (-fstack-protector).
- Support for dumping the radix (aka. Linux) and hash page tables via debugfs.
- Fix an oops in cxl coredump generation when cxl_get_fd() is used.
- Freescale updates from Scott: "Highlights include 8xx hugepage support,
qbman fixes/cleanup, device tree updates, and some misc cleanup."
- Many and varied fixes and minor enhancements as always.
Thanks to:
Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Anshuman Khandual,
Anton Blanchard, Balbir Singh, Bartlomiej Zolnierkiewicz, Christophe Jaillet,
Christophe Leroy, Denis Kirjanov, Elimar Riesebieter, Frederic Barrat,
Gautham R. Shenoy, Geliang Tang, Geoff Levand, Jack Miller, Johan Hovold,
Lars-Peter Clausen, Libin, Madhavan Srinivasan, Michael Neuling, Nathan
Fontenot, Naveen N. Rao, Nicholas Piggin, Pan Xinhui, Peter Senna Tschudin,
Rashmica Gupta, Rui Teng, Russell Currey, Scott Wood, Simon Guo, Suraj
Jitindar Singh, Thiago Jung Bauermann, Tobias Klauser, Vaibhav Jain.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJYU4YSAAoJEFHr6jzI4aWAC4gQALtIAqqPon0Cd5b/FVVcMbW7
mMqB2b/0FGEl5GoRTzGUDaQqElilm6AEVfHO86C7DFji/a6olneFfw87iz+mtWuZ
JvrNq68ZiSnoeszdUy4MgtXFLb5sTzNMev4skaHfjI9E5CepWBoR0zH4G+kNVnd5
WSgudv8Cq4Px+MEuTOigt3QYjHzZ3cw/XNOOm9c+oGj+PDW4O9UItVI+S1WLoey4
rAB2nRcLMDPuwfRQC9XsF3zEbkv4h1dEXo/EBRuRpcF+0lLTzFw1lv1WE8OxlUmS
kAXbty3dIytBfSbtJT0c0Ps6sfQ4HFhu6ZV2fjnxNTz2KDkBIN7LBYHmBYiqY9oZ
9zvbUWtfiTu5ocfRtTq7rC/Hcj4Kbr9S9F/FvXR0WyDsKgu4xxAovqC3gcn6YjYK
Rr1tcCI4nUzyhVJVmd+OEhUvc5JbFy9aGage+YeOyejfvvSbXIunaxWlPjoDkvim
Vjl+UKU8gw51XFssqY5ZBi/HNlMFKYedLpMFp/fItnLglhj50V0eFWkpDgdSCYom
vo9ifPLZx8n8m8De3H7TV4E0F4gCHcTeqZdu7tW9AAUVM6iLJcDLm3asGmtNh21t
snOHNOJ5QSIno6ezUUg29T6VBjbPh46fdJJSlIZrEe8OzLZ1haGyttf0tD00PQvY
Z2W/m3gxafnOeGgBqvyv
=xOzf
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Highlights include:
- Support for the kexec_file_load() syscall, which is a prereq for
secure and trusted boot.
- Prevent kernel execution of userspace on P9 Radix (similar to
SMEP/PXN).
- Sort the exception tables at build time, to save time at boot, and
store them as relative offsets to save space in the kernel image &
memory.
- Allow building the kernel with thin archives, which should allow us
to build an allyesconfig once some other fixes land.
- Build fixes to allow us to correctly rebuild when changing the
kernel endian from big to little or vice versa.
- Plumbing so that we can avoid doing a full mm TLB flush on P9
Radix.
- Initial stack protector support (-fstack-protector).
- Support for dumping the radix (aka. Linux) and hash page tables via
debugfs.
- Fix an oops in cxl coredump generation when cxl_get_fd() is used.
- Freescale updates from Scott: "Highlights include 8xx hugepage
support, qbman fixes/cleanup, device tree updates, and some misc
cleanup."
- Many and varied fixes and minor enhancements as always.
Thanks to:
Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V, Anshuman
Khandual, Anton Blanchard, Balbir Singh, Bartlomiej Zolnierkiewicz,
Christophe Jaillet, Christophe Leroy, Denis Kirjanov, Elimar
Riesebieter, Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff
Levand, Jack Miller, Johan Hovold, Lars-Peter Clausen, Libin,
Madhavan Srinivasan, Michael Neuling, Nathan Fontenot, Naveen N.
Rao, Nicholas Piggin, Pan Xinhui, Peter Senna Tschudin, Rashmica
Gupta, Rui Teng, Russell Currey, Scott Wood, Simon Guo, Suraj
Jitindar Singh, Thiago Jung Bauermann, Tobias Klauser, Vaibhav Jain"
[ And thanks to Michael, who took time off from a new baby to get this
pull request done. - Linus ]
* tag 'powerpc-4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (174 commits)
powerpc/fsl/dts: add FMan node for t1042d4rdb
powerpc/fsl/dts: add sg_2500_aqr105_phy4 alias on t1024rdb
powerpc/fsl/dts: add QMan and BMan nodes on t1024
powerpc/fsl/dts: add QMan and BMan nodes on t1023
soc/fsl/qman: test: use DEFINE_SPINLOCK()
powerpc/fsl-lbc: use DEFINE_SPINLOCK()
powerpc/8xx: Implement support of hugepages
powerpc: get hugetlbpage handling more generic
powerpc: port 64 bits pgtable_cache to 32 bits
powerpc/boot: Request no dynamic linker for boot wrapper
soc/fsl/bman: Use resource_size instead of computation
soc/fsl/qe: use builtin_platform_driver
powerpc/fsl_pmc: use builtin_platform_driver
powerpc/83xx/suspend: use builtin_platform_driver
powerpc/ftrace: Fix the comments for ftrace_modify_code
powerpc/perf: macros for power9 format encoding
powerpc/perf: power9 raw event format encoding
powerpc/perf: update attribute_group data structure
powerpc/perf: factor out the event format field
powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJYUt1vAAoJEFmIoMA60/r8abgP/3R+5Lsk5/kfAHk5/2Mtqbvg
mZ0eDUpY9GbUeMjSq84Nr2H8u7d+1AJCCu8KtDJYZCmjZpnSp2SuE2PS5JoGC7zC
fintD24jlIF4/J5+HeVXXmbfr3xATxvpTuiSLEi8sLBRJ3KRIswhMSwoPwOyeTQw
v/EclWKPGYcI5Zp0oigY9/Jd3q3lQ17KXppi/0dDoLh7PNOFvEHItXWzmf++u/NP
iYT9R1xmzEsy0/HRd6hiwPT2xA8YsAXxgobhHooUgh1FWmZ02Tg1WjgDemOW4lVh
kNIUcsLczh7wZCceogrrJ+pwb9+NyyIyKuHPv6OG3ieyz1IZdznaj1fAE5HJYiPo
eVS7cP1S6DyV3Y5qFj5F2dSRS7T4GXdXG5mNhmeCpUHs0vfzSCG36jLmhTy8UIxs
1rCf5oFa+uU9q0okfH8VtcGOXqWjGgyxTSGGfF71HUMLnPbsci2fxC2cO6svzIX7
wDY0uxOzpyMIYMuQR6iz7VqvAwEaZ+7pfMIrWWdDcQ9/5tCNJ49cLuKaThPL4bVu
juiGBQtnTLg8tjrhjDL9tQiJpuVIweVXyyQ1fvZoVXkMLlhVCF2ttirvwFUit2PB
84OlevQZ+9QdE/qalrWbv4qzhesuiwu0avkzjGoqg6tWTF0epu2AHI2vqy6UBYEG
tcfJPEcz1019PKZNSvWy
=ut0k
-----END PGP SIGNATURE-----
Merge tag 'pci-v4.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI updates from Bjorn Helgaas:
"PCI changes:
- add support for PCI on ARM64 boxes with ACPI. We already had this
for theoretical spec-compliant hardware; now we're adding quirks
for the actual hardware (Cavium, HiSilicon, Qualcomm, X-Gene)
- add runtime PM support for hotplug ports
- enable runtime suspend for Intel UHCI that uses platform-specific
wakeup signaling
- add yet another host bridge registration interface. We hope this is
extensible enough to subsume the others
- expose device revision in sysfs for DRM
- to avoid device conflicts, make sure any VF BAR updates are done
before enabling the VF
- avoid unnecessary link retrains for ASPM
- allow INTx masking on Mellanox devices that support it
- allow access to non-standard VPD for Chelsio devices
- update Broadcom iProc support for PAXB v2, PAXC v2, inbound DMA,
etc
- update Rockchip support for max-link-speed
- add NVIDIA Tegra210 support
- add Layerscape LS1046a support
- update R-Car compatibility strings
- add Qualcomm MSM8996 support
- remove some uninformative bootup messages"
* tag 'pci-v4.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (115 commits)
PCI: Enable access to non-standard VPD for Chelsio devices (cxgb3)
PCI: Expand "VPD access disabled" quirk message
PCI: pciehp: Remove loading message
PCI: hotplug: Remove hotplug core message
PCI: Remove service driver load/unload messages
PCI/AER: Log AER IRQ when claiming Root Port
PCI/AER: Log errors with PCI device, not PCIe service device
PCI/AER: Remove unused version macros
PCI/PME: Log PME IRQ when claiming Root Port
PCI/PME: Drop unused support for PMEs from Root Complex Event Collectors
PCI: Move config space size macros to pci_regs.h
x86/platform/intel-mid: Constify mid_pci_platform_pm
PCI/ASPM: Don't retrain link if ASPM not possible
PCI: iproc: Skip check for legacy IRQ on PAXC buses
PCI: pciehp: Leave power indicator on when enabling already-enabled slot
PCI: pciehp: Prioritize data-link event over presence detect
PCI: rcar: Add gen3 fallback compatibility string for pcie-rcar
PCI: rcar: Use gen2 fallback compatibility last
PCI: rcar-gen2: Use gen2 fallback compatibility last
PCI: rockchip: Move the deassert of pm/aclk/pclk after phy_init()
..
Patch series "mm: unexport __get_user_pages_unlocked()".
This patch series continues the cleanup of get_user_pages*() functions
taking advantage of the fact we can now pass gup_flags as we please.
It firstly adds an additional 'locked' parameter to
get_user_pages_remote() to allow for its callers to utilise
VM_FAULT_RETRY functionality. This is necessary as the invocation of
__get_user_pages_unlocked() in process_vm_rw_single_vec() makes use of
this and no other existing higher level function would allow it to do
so.
Secondly existing callers of __get_user_pages_unlocked() are replaced
with the appropriate higher-level replacement -
get_user_pages_unlocked() if the current task and memory descriptor are
referenced, or get_user_pages_remote() if other task/memory descriptors
are referenced (having acquiring mmap_sem.)
This patch (of 2):
Add a int *locked parameter to get_user_pages_remote() to allow
VM_FAULT_RETRY faulting behaviour similar to get_user_pages_[un]locked().
Taking into account the previous adjustments to get_user_pages*()
functions allowing for the passing of gup_flags, we are now in a
position where __get_user_pages_unlocked() need only be exported for his
ability to allow VM_FAULT_RETRY behaviour, this adjustment allows us to
subsequently unexport __get_user_pages_unlocked() as well as allowing
for future flexibility in the use of get_user_pages_remote().
[sfr@canb.auug.org.au: merge fix for get_user_pages_remote API change]
Link: http://lkml.kernel.org/r/20161122210511.024ec341@canb.auug.org.au
Link: http://lkml.kernel.org/r/20161027095141.2569-2-lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move PCI configuration space size macros (PCI_CFG_SPACE_SIZE and
PCI_CFG_SPACE_EXP_SIZE) from drivers/pci/pci.h to
include/uapi/linux/pci_regs.h so they can be used by more drivers and
eliminate duplicate definitions.
[bhelgaas: Expand comment to include PCI-X details]
Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Passing zero for the size to vfio_find_dma() isn't compatible with
matching the start address of an existing vfio_dma. Doing so triggers a
corner case. In vfio_find_dma(), when the start address is equal to
dma->iova and size is 0, check for the end of search range makes it to
take wrong side of RB-tree. That fails the search even though the address
is present in mapped dma ranges.
In functions pin_pages and unpin_pages, the iova which is being searched
is base address of page to be pinned or unpinned. So here size should be
set to PAGE_SIZE, as argument to vfio_find_dma().
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Passing zero for the size to vfio_find_dma() isn't compatible with
matching the start address of an existing vfio_dma. Doing so triggers a
corner case. In vfio_find_dma(), when the start address is equal to
dma->iova and size is 0, check for the end of search range makes it to
take wrong side of RB-tree. That fails the search even though the address
is present in mapped dma ranges. Due to this, in vfio_dma_do_unmap(),
while checking boundary conditions, size should be set to 1 for verifying
start address of unmap range.
vfio_find_dma() is also used to verify last address in unmap range with
size = 0, but in that case address to be searched is calculated with
start + size - 1 and so it works correctly.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
[aw: changelog tweak]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
mdev vendor driver should unregister the iommu notifier since the vfio
iommu can persist beyond the attachment of the mdev group. WARN_ON will
show warning if vendor driver doesn't unregister the notifier and is
forced to follow the implementations steps.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
At the moment the userspace tool is expected to request pinning of
the entire guest RAM when VFIO IOMMU SPAPR v2 driver is present.
When the userspace process finishes, all the pinned pages need to
be put; this is done as a part of the userspace memory context (MM)
destruction which happens on the very last mmdrop().
This approach has a problem that a MM of the userspace process
may live longer than the userspace process itself as kernel threads
use userspace process MMs which was runnning on a CPU where
the kernel thread was scheduled to. If this happened, the MM remains
referenced until this exact kernel thread wakes up again
and releases the very last reference to the MM, on an idle system this
can take even hours.
This moves preregistered regions tracking from MM to VFIO; insteads of
using mm_iommu_table_group_mem_t::used, tce_container::prereg_list is
added so each container releases regions which it has pre-registered.
This changes the userspace interface to return EBUSY if a memory
region is already registered in a container. However it should not
have any practical effect as the only userspace tool available now
does register memory region once per container anyway.
As tce_iommu_register_pages/tce_iommu_unregister_pages are called
under container->lock, this does not need additional locking.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In some situations the userspace memory context may live longer than
the userspace process itself so if we need to do proper memory context
cleanup, we better have tce_container take a reference to mm_struct and
use it later when the process is gone (@current or @current->mm is NULL).
This references mm and stores the pointer in the container; this is done
in a new helper - tce_iommu_mm_set() - when one of the following happens:
- a container is enabled (IOMMU v1);
- a first attempt to pre-register memory is made (IOMMU v2);
- a DMA window is created (IOMMU v2).
The @mm stays referenced till the container is destroyed.
This replaces current->mm with container->mm everywhere except debug
prints.
This adds a check that current->mm is the same as the one stored in
the container to prevent userspace from making changes to a memory
context of other processes.
DMA map/unmap ioctls() do not check for @mm as they already check
for @enabled which is set after tce_iommu_mm_set() is called.
This does not reference a task as multiple threads within the same mm
are allowed to ioctl() to vfio and supposedly they will have same limits
and capabilities and if they do not, we'll just fail with no harm made.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We are going to allow the userspace to configure container in
one memory context and pass container fd to another so
we are postponing memory allocations accounted against
the locked memory limit. One of previous patches took care of
it_userspace.
At the moment we create the default DMA window when the first group is
attached to a container; this is done for the userspace which is not
DDW-aware but familiar with the SPAPR TCE IOMMU v2 in the part of memory
pre-registration - such client expects the default DMA window to exist.
This postpones the default DMA window allocation till one of
the folliwing happens:
1. first map/unmap request arrives;
2. new window is requested;
This adds noop for the case when the userspace requested removal
of the default window which has not been created yet.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There is already a helper to create a DMA window which does allocate
a table and programs it to the IOMMU group. However
tce_iommu_take_ownership_ddw() did not use it and did these 2 calls
itself to simplify error path.
Since we are going to delay the default window creation till
the default window is accessed/removed or new window is added,
we need a helper to create a default window from all these cases.
This adds tce_iommu_create_default_window(). Since it relies on
a VFIO container to have at least one IOMMU group (for future use),
this changes tce_iommu_attach_group() to add a group to the container
first and then call the new helper.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The iommu_table struct manages a hardware TCE table and a vmalloc'd
table with corresponding userspace addresses. Both are allocated when
the default DMA window is created and this happens when the very first
group is attached to a container.
As we are going to allow the userspace to configure container in one
memory context and pas container fd to another, we have to postpones
such allocations till a container fd is passed to the destination
user process so we would account locked memory limit against the actual
container user constrainsts.
This postpones the it_userspace array allocation till it is used first
time for mapping. The unmapping patch already checks if the array is
allocated.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This changes mm_iommu_xxx helpers to take mm_struct as a parameter
instead of getting it from @current which in some situations may
not have a valid reference to mm.
This changes helpers to receive @mm and moves all references to @current
to the caller, including checks for !current and !current->mm;
checks in mm_iommu_preregistered() are removed as there is no caller
yet.
This moves the mm_iommu_adjust_locked_vm() call to the caller as
it receives mm_iommu_table_group_mem_t but it needs mm.
This should cause no behavioral change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Beyond vfio_iommu events, users might also be interested in
vfio_group events. For example, if a vfio_group is used along
with Qemu/KVM, whenever kvm pointer is set to/cleared from the
vfio_group, users could be notified.
Currently only VFIO_GROUP_NOTIFY_SET_KVM supported.
Cc: Kirti Wankhede <kwankhede@nvidia.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Jike Song <jike.song@intel.com>
[aw: remove use of new typedef]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Currently vfio_register_notifier assumes that there is only one
notifier chain, which is in vfio_iommu. However, the user might
also be interested in events other than vfio_iommu, for example,
vfio_group. Refactor vfio_{un}register_notifier implementation
to make it feasible.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Jike Song <jike.song@intel.com>
[aw: merge with commit 816ca69ea9c7 ("vfio: Fix handling of error returned by 'vfio_group_get_from_dev()'"), remove typedef]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
'vfio_group_get_from_dev()' seems to return only NULL on error, not an
error pointer.
Fixes: 2169037dc3 ("vfio iommu: Added pin and unpin callback functions to vfio_iommu_driver_ops")
Fixes: c086de818d ("vfio iommu: Add blocking notifier to notify DMA_UNMAP")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Capability header next field is an offset relative to the start of
the INFO buffer. tmp->next is assigned the proper value but iterations
implemented in vfio_info_cap_add and vfio_info_cap_shift use next
as an offset between the headers. When coping with multiple capabilities
this leads to an Oops.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
As of commit d97ffe2368 ("PCI: Fix return value from
pci_user_{read,write}_config_*()") it's unnecessary to call
pcibios_err_to_errno() to fixup the return value from these functions.
pcibios_err_to_errno() already does simple passthrough of -errno values,
therefore no functional change is expected.
[aw: changelog]
Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Vendor driver using mediated device framework would use same mechnism to
validate and prepare IRQs. Introducing this function to reduce code
replication in multiple drivers.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Update msix_sparse_mmap_cap() to use vfio_info_add_capability()
Update region type capability to use vfio_info_add_capability()
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Vendor driver using mediated device framework should use
vfio_info_add_capability() to add capabilities.
Introduced this function to reduce code duplication in vendor drivers.
vfio_info_cap_shift() manipulated a data buffer to add an offset to each
element in a chain. This data buffer is documented in a uapi header.
Changing vfio_info_cap_shift symbol to be available to all drivers.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Added blocking notifier to IOMMU TYPE1 driver to notify vendor drivers
about DMA_UNMAP.
Exported two APIs vfio_register_notifier() and vfio_unregister_notifier().
Notifier should be registered, if external user wants to use
vfio_pin_pages()/vfio_unpin_pages() APIs to pin/unpin pages.
Vendor driver should use VFIO_IOMMU_NOTIFY_DMA_UNMAP action to invalidate
mappings.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
Mediated device only uses IOMMU APIs, the underlying hardware can be
managed by an IOMMU domain.
Aim of this change is:
- To use most of the code of TYPE1 IOMMU driver for mediated devices
- To support direct assigned device and mediated device in single module
This change adds pin and unpin support for mediated device to TYPE1 IOMMU
backend module. More details:
- Domain for external user is tracked separately in vfio_iommu structure.
It is allocated when group for first mdev device is attached.
- Pages pinned for external domain are tracked in each vfio_dma structure
for that iova range.
- Page tracking rb-tree in vfio_dma keeps <iova, pfn, ref_count>. Key of
rb-tree is iova, but it actually aims to track pfns.
- On external pin request for an iova, page is pinned once, if iova is
already pinned and tracked, ref_count is incremented.
- External unpin request unpins pages only when ref_count is 0.
- Pinned pages list is used to find pfn from iova and then unpin it.
WARN_ON is added if there are entires in pfn_list while detaching the
group and releasing the domain.
- Page accounting is updated to account in its address space where the
pages are pinned/unpinned, i.e dma->task
- Accouting for mdev device is only done if there is no iommu capable
domain in the container. When there is a direct device assigned to the
container and that domain is iommu capable, all pages are already pinned
during DMA_MAP.
- Page accouting is updated on hot plug and unplug mdev device and pass
through device.
Tested by assigning below combinations of devices to a single VM:
- GPU pass through only
- vGPU device only
- One GPU pass through and one vGPU device
- Linux VM hot plug and unplug vGPU device while GPU pass through device
exist
- Linux VM hot plug and unplug GPU pass through device while vGPU device
exist
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Add task structure to vfio_dma structure. Task structure is used for:
- During DMA_UNMAP, same task who mapped it or other task who shares same
address space is allowed to unmap, otherwise unmap fails.
QEMU maps few iova ranges initially, then fork threads and from the child
thread calls DMA_UNMAP on previously mapped iova. Since child shares same
address space, DMA_UNMAP is successful.
- Avoid accessing struct mm while process is exiting by acquiring
reference of task's mm during page accounting.
- It is also used to get task mlock capability and rlimit for mlock.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Update arguments of vaddr_get_pfn() to take struct mm_struct *mm as input
argument.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Added APIs for pining and unpining set of pages. These call back into
backend iommu module to actually pin and unpin pages.
Added two new callback functions to struct vfio_iommu_driver_ops. Backend
IOMMU module that supports pining and unpinning pages for mdev devices
should provide these functions.
Renamed static functions in vfio_type1_iommu.c to resolve conflicts
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This change rearrange functions to have common function to increment
container_users
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Jike Song <jike.song@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This patch rearranges functions to get vfio_group from device
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Jike Song <jike.song@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vfio_mdev driver registers with mdev core driver.
mdev core driver creates mediated device and calls probe routine of
vfio_mdev driver for each device.
Probe routine of vfio_mdev driver adds mediated device to VFIO core module
This driver forms a shim layer that pass through VFIO devices operations
to vendor driver for mediated devices.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Jike Song <jike.song@intel.com>
Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Design for Mediated Device Driver:
Main purpose of this driver is to provide a common interface for mediated
device management that can be used by different drivers of different
devices.
This module provides a generic interface to create the device, add it to
mediated bus, add device to IOMMU group and then add it to vfio group.
Below is the high Level block diagram, with Nvidia, Intel and IBM devices
as example, since these are the devices which are going to actively use
this module as of now.
+---------------+
| |
| +-----------+ | mdev_register_driver() +--------------+
| | | +<------------------------+ __init() |
| | mdev | | | |
| | bus | +------------------------>+ |<-> VFIO user
| | driver | | probe()/remove() | vfio_mdev.ko | APIs
| | | | | |
| +-----------+ | +--------------+
| |
| MDEV CORE |
| MODULE |
| mdev.ko |
| +-----------+ | mdev_register_device() +--------------+
| | | +<------------------------+ |
| | | | | nvidia.ko |<-> physical
| | | +------------------------>+ | device
| | | | callback +--------------+
| | Physical | |
| | device | | mdev_register_device() +--------------+
| | interface | |<------------------------+ |
| | | | | i915.ko |<-> physical
| | | +------------------------>+ | device
| | | | callback +--------------+
| | | |
| | | | mdev_register_device() +--------------+
| | | +<------------------------+ |
| | | | | ccw_device.ko|<-> physical
| | | +------------------------>+ | device
| | | | callback +--------------+
| +-----------+ |
+---------------+
Core driver provides two types of registration interfaces:
1. Registration interface for mediated bus driver:
/**
* struct mdev_driver - Mediated device's driver
* @name: driver name
* @probe: called when new device created
* @remove:called when device removed
* @driver:device driver structure
*
**/
struct mdev_driver {
const char *name;
int (*probe) (struct device *dev);
void (*remove) (struct device *dev);
struct device_driver driver;
};
Mediated bus driver for mdev device should use this interface to register
and unregister with core driver respectively:
int mdev_register_driver(struct mdev_driver *drv, struct module *owner);
void mdev_unregister_driver(struct mdev_driver *drv);
Mediated bus driver is responsible to add/delete mediated devices to/from
VFIO group when devices are bound and unbound to the driver.
2. Physical device driver interface
This interface provides vendor driver the set APIs to manage physical
device related work in its driver. APIs are :
* dev_attr_groups: attributes of the parent device.
* mdev_attr_groups: attributes of the mediated device.
* supported_type_groups: attributes to define supported type. This is
mandatory field.
* create: to allocate basic resources in vendor driver for a mediated
device. This is mandatory to be provided by vendor driver.
* remove: to free resources in vendor driver when mediated device is
destroyed. This is mandatory to be provided by vendor driver.
* open: open callback of mediated device
* release: release callback of mediated device
* read : read emulation callback.
* write: write emulation callback.
* ioctl: ioctl callback.
* mmap: mmap emulation callback.
Drivers should use these interfaces to register and unregister device to
mdev core driver respectively:
extern int mdev_register_device(struct device *dev,
const struct parent_ops *ops);
extern void mdev_unregister_device(struct device *dev);
There are no locks to serialize above callbacks in mdev driver and
vfio_mdev driver. If required, vendor driver can have locks to serialize
above APIs in their driver.
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Reviewed-by: Jike Song <jike.song@intel.com>
Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The VFIO_DEVICE_SET_IRQS ioctl did not sufficiently sanitize
user-supplied integers, potentially allowing memory corruption. This
patch adds appropriate integer overflow checks, checks the range bounds
for VFIO_IRQ_SET_DATA_NONE, and also verifies that only single element
in the VFIO_IRQ_SET_DATA_TYPE_MASK bitmask is set.
VFIO_IRQ_SET_ACTION_TYPE_MASK is already correctly checked later in
vfio_pci_set_irqs_ioctl().
Furthermore, a kzalloc is changed to a kcalloc because the use of a
kzalloc with an integer multiplication allowed an integer overflow
condition to be reached without this patch. kcalloc checks for overflow
and should prevent a similar occurrence.
Signed-off-by: Vlad Tsyrklevich <vlad@tsyrklevich.net>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Simplify the interrupt setup by using the new PCI layer helpers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The MSI/X shutdown path can gratuitously enable INTx, which is not
something we want to happen if we're dealing with broken INTx device.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We use a BAR restore trick to try to detect when a user has performed
a device reset, possibly through FLR or other backdoors, to put things
back into a working state. This is important for backdoor resets, but
we can actually just virtualize the "front door" resets provided via
PCIe and AF FLR. Set these bits as virtualized + writable, allowing
the default write to set them in vconfig, then we can simply check the
bit, perform an FLR of our own, and clear the bit. We don't actually
have the granularity in PCI to specify the type of reset we want to
do, but generally devices don't implement both PCIe and AF FLR and
we'll favor these over other types of reset, so we should generally
lineup. We do test whether the device provides the requested FLR type
to stay consistent with hardware capabilities though.
This seems to fix several instance of devices getting into bad states
with userspace drivers, like dpdk, running inside a VM.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Greg Rose <grose@lightfleet.com>
We get a few warnings when building kernel with W=1:
drivers/vfio/platform/vfio_platform_common.c:76:5: warning: no previous prototype for 'vfio_platform_acpi_call_reset' [-Wmissing-prototypes]
drivers/vfio/platform/vfio_platform_common.c:98:6: warning: no previous prototype for 'vfio_platform_acpi_has_reset' [-Wmissing-prototypes]
drivers/vfio/platform/vfio_platform_common.c:640:5: warning: no previous prototype for 'vfio_platform_of_probe' [-Wmissing-prototypes]
drivers/vfio/platform/reset/vfio_platform_amdxgbe.c:59:5: warning: no previous prototype for 'vfio_platform_amdxgbe_reset' [-Wmissing-prototypes]
drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c:60:5: warning: no previous prototype for 'vfio_platform_calxedaxgmac_reset' [-Wmissing-prototypes]
....
In fact, these functions are only used in the file in which they are
declared and don't need a declaration, but can be made static.
so this patch marks these functions with 'static'.
Signed-off-by: Baoyou Xie <baoyou.xie@linaro.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
There are multiple cases in vfio_pci_set_ctx_trigger_single() where
we assume we can safely read from our data pointer without actually
checking whether the user has passed any data via the count field.
VFIO_IRQ_SET_DATA_NONE in particular is entirely broken since we
attempt to pull an int32_t file descriptor out before even checking
the data type. The other data types assume the data pointer contains
one element of their type as well.
In part this is good news because we were previously restricted from
doing much sanitization of parameters because it was missed in the
past and we didn't want to break existing users. Clearly DATA_NONE
is completely broken, so it must not have any users and we can fix
it up completely. For DATA_BOOL and DATA_EVENTFD, we'll just
protect ourselves, returning error when count is zero since we
previously would have oopsed.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reported-by: Chris Thompson <the_cartographer@hotmail.com>
Cc: stable@vger.kernel.org
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Release call is ignoring the return code from reset call and can
potentially continue even though reset call failed.
If reset_required module parameter is set, this patch is going
to validate the return code and will cause stack dump with
WARN_ON and warn the user of failure.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Open call is ignoring the return code from reset call and can
potentially continue even though reset call failed.
If reset_required module parameter is set, this patch is going
to validate the return code and will abort open if reset fails.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The code was allowing platform devices to be used without a supporting
VFIO reset driver. The hardware can be left in some inconsistent state
after a guest machine abort.
The reset driver will put the hardware back to safe state and disable
interrupts before returning the control back to the host machine.
Adding a new reset_required kernel module option to platform VFIO drivers.
The default value is true for the DT and ACPI based drivers.
The reset requirement value for AMBA drivers is set to false and is
unchangeable to maintain the existing functionality.
New requirements are:
1. A reset function needs to be implemented by the corresponding driver
via DT/ACPI.
2. The reset function needs to be discovered via DT/ACPI.
The probe of the driver will fail if any of the above conditions are
not satisfied.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The device tree code checks for the presence of a reset driver and calls
the of_reset function pointer by looking up the reset driver as a module.
ACPI defines _RST method to perform device level reset. After the _RST
method is executed, the OS can resume using the device. _RST method is
expected to stop DMA transfers and IRQs.
This patch introduces two functions as vfio_platform_acpi_has_reset and
vfio_platform_acpi_call_reset. The has reset method is used to declare
reset capability via the ioctl flag VFIO_DEVICE_FLAGS_RESET. The call
reset function is used to execute the _RST ACPI method.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Getting ready to bring out extra debug information to the caller
so that more verbose information can be printed when an error is
observed.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The code is using the compatible DT string to associate a reset driver
with the actual device itself. The compatible string does not exist on
ACPI based systems. HID is the unique identifier for a device driver
instead.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Creating a new function to determine if this driver supports reset
function or not. This is an attempt to abstract device tree calls
from the rest of the code.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The reset call sequence seems to replicate itself multiple times
across the file. Grouping them together for maintenance reasons.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Renaming the reset function to of_reset as it is only used
by the device tree based platforms.
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The vfio group should be released after
the vfio_group_try_dissolve_container call.
The code should not rely on someone else to hold
a reference on the group.
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Current vfio-pci implementation disallows to mmap
sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio
page may be shared with other BARs. This will cause some
performance issues when we passthrough a PCI device with
this kind of BARs. Guest will be not able to handle the mmio
accesses to the BARs which leads to mmio emulations in host.
However, not all sub-page BARs will share page with other BARs.
We should allow to mmap the sub-page MMIO BARs which we can
make sure will not share page with other BARs.
This patch adds support for this case. And we try to add a
dummy resource to reserve the remainder of the page which
hot-add device's BAR might be assigned into. But it's not
necessary to handle the case when the BAR is not page aligned.
Because we can't expect the BAR will be assigned into the same
location in a page in guest when we passthrough the BAR. And
it's hard to access this BAR in userspace because we have
no way to get the BAR's location in a page.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The vfio No-IOMMU mode was supported by this
'commit 03a76b60f8 ("vfio: Include No-IOMMU mode")',
but it only support vfio-pci.
Using vfio_iommu_group_get/put, but not iommu_group_get/put,
the platform devices can be exposed to userspace with
CONFIG_VFIO_NOIOMMU and the "enable_unsafe_noiommu_mode"
option enabled.
From 'commit 03a76b60f8 ("vfio: Include No-IOMMU mode")',
"This should make it very clear that this mode is not safe.
Additionally, CAP_SYS_RAWIO privileges are necessary to work
with groups and containers using this mode. Groups making
use of this support are named /dev/vfio/noiommu-$GROUP and
can only make use of the special VFIO_NOIOMMU_IOMMU for the
container. Use of this mode, specifically binding a device
without a native IOMMU group to a VFIO bus driver will taint
the kernel and should therefore not be considered supported."
Signed-off-by: Peng Fan <van.freenix@gmail.com>
Cc: Eric Auger <eric.auger@linaro.org>
Cc: Baptiste Reynal <b.reynal@virtualopensystems.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The size of the VPD area is not necessarily 4-byte aligned, so a
pci_vpd_read() might return less than 4 bytes. Zero our buffer and
accept anything other than an error. Intel X710 NICs exercise this.
Fixes: 4e1a635552 ("vfio/pci: Use kernel VPD access functions")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This function cannot actually be called with npage = 0, so in practice
this doesn't return an uninitialized value.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Both the INTx and MSI/X disable paths do an eventfd_ctx_put() for the
trigger eventfd before calling vfio_virqfd_disable() any potential
mask and unmask eventfds. This opens a use-after-free race where an
inopportune irqfd can reference the freed signalling eventfd. Reorder
to avoid this possibility.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
- Hide INTx on certain known broken devices (Alex Williamson)
- Additional backdoor reset detection (Alex Williamson)
- Remove unused iommudata reference (Alexey Kardashevskiy)
- Use cfg_size to avoid probing extended config space (Alexey Kardashevskiy)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJXRc8hAAoJECObm247sIsigWAP/R+q+UnOHXd7cvSKpI33p2IP
ur09xwV9GktQViWz4tclFdfk8cScq86UX8I5e/jkeFVCXT7e5FU7FhKqJi7dPbp1
Yh3g5fGUkk6a0B3gebIq3qY02aLwJF17WZg0R/KHkQT0B/yQuTQqdF37Xh9rGv0S
c7sgQn1X5nmO7jVDYYKO3SwEgmZWAcBz4Ht2XpqCdSrmhsse9OxlnmOkieM5BNUz
rQhnziaeJE/Ya+y/A74XicbGforyThvNyJs/anJnPEE89773SGVQy/Jdlz4Lwji7
XImPuj4AT9duTgX9HaD38xIpFOsAKDfZ6sClsICkIvhbs232UXuiMxcPszDA97c0
7MxSJcLVr9fKB6zatq2JWhGDp7C6ylUxapEI9PFCV6gE5OYZRM7KD+hm32oVnuPv
rSzPoqnm0Sudu8SO6n46QUZAifp+mX9MNhqzkXGR/YlBHhB1L3/QcczyekY0eBbj
vJ2htebrg0qNQn4G9n4ygMm19r53ew/Q+pO2y7y4TdOr+gZNW1Wj7uOezMpvDOOB
hiy+HkJ24MCTfAGNgjpjjCot/o608+QT5H+y8SR7vT1IK4shOSE4rYfvo6jlzRQp
9FRTolGhYqrtih+zB5R7eLghtvlDp4lN0gDCSgWGHM7e2rMLxaomzoF1VNQ8eKJZ
iUJZ8jE2QrdGDpKssRuF
=fGnT
-----END PGP SIGNATURE-----
Merge tag 'vfio-v4.7-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Hide INTx on certain known broken devices (Alex Williamson)
- Additional backdoor reset detection (Alex Williamson)
- Remove unused iommudata reference (Alexey Kardashevskiy)
- Use cfg_size to avoid probing extended config space (Alexey
Kardashevskiy)
* tag 'vfio-v4.7-rc1' of git://github.com/awilliam/linux-vfio:
vfio_pci: Test for extended capabilities if config space > 256 bytes
vfio_iommu_spapr_tce: Remove unneeded iommu_group_get_iommudata
vfio/pci: Add test for BAR restore
vfio/pci: Hide broken INTx support from user
Highlights:
- Support for Power ISA 3.0 (Power9) Radix Tree MMU from Aneesh Kumar K.V
- Live patching support for ppc64le (also merged via livepatching.git)
Various cleanups & minor fixes from:
- Aaro Koskinen, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
Chris Smart, Daniel Axtens, Frederic Barrat, Gavin Shan, Ian Munsie, Lennart
Sorensen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring, Michael
Ellerman, Oliver O'Halloran, Paul Gortmaker, Paul Mackerras, Rashmica Gupta,
Russell Currey, Suraj Jitindar Singh, Thiago Jung Bauermann, Valentin
Rothberg, Vipin K Parashar.
General:
- Update LMB associativity index during DLPAR add/remove from Nathan Fontenot
- Fix branching to OOL handlers in relocatable kernel from Hari Bathini
- Add support for userspace Power9 copy/paste from Chris Smart
- Always use STRICT_MM_TYPECHECKS from Michael Ellerman
- Add mask of possible MMU features from Michael Ellerman
PCI:
- Enable pass through of NVLink to guests from Alexey Kardashevskiy
- Cleanups in preparation for powernv PCI hotplug from Gavin Shan
- Don't report error in eeh_pe_reset_and_recover() from Gavin Shan
- Restore initial state in eeh_pe_reset_and_recover() from Gavin Shan
- Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell" from Guilherme G. Piccoli
- Remove the dependency on EEH struct in DDW mechanism from Guilherme G. Piccoli
selftests:
- Test cp_abort during context switch from Chris Smart
- Add several tests for transactional memory support from Rashmica Gupta
perf:
- Add support for sampling interrupt register state from Anju T
- Add support for unwinding perf-stackdump from Chandan Kumar
cxl:
- Configure the PSL for two CAPI ports on POWER8NVL from Philippe Bergheaud
- Allow initialization on timebase sync failures from Frederic Barrat
- Increase timeout for detection of AFU mmio hang from Frederic Barrat
- Handle num_of_processes larger than can fit in the SPA from Ian Munsie
- Ensure PSL interrupt is configured for contexts with no AFU IRQs from Ian Munsie
- Add kernel API to allow a context to operate with relocate disabled from Ian Munsie
- Check periodically the coherent platform function's state from Christophe Lombard
Freescale:
- Updates from Scott: "Contains 86xx fixes, minor device tree fixes, an erratum
workaround, and a kconfig dependency fix."
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXPsGzAAoJEFHr6jzI4aWAVoAP/iKdrDe0eYHlVAE9SqnbsiZs
lgDxdsC8P3fsmP1G9o/HkKhC82zHl/La8Ztz8dtqa+LkSzbfliWP1ztJsI7GsBFo
tyCKzWnX9Rwvd3meHu/o/SQ29TNLm/PbPyyRqpj5QPbJ8XCXkAXR7ZZZqjvcMsJW
/AgIr7Cgf53tl9oZzzl/c7CnNHhMq+NBdA71vhWtUx+T97wfJEGyKW6HhZyHDbEU
iAki7fu77ZpEqC/Fh9swf0dCGBJ+a132NoMVo0AdV7EQLznUYlQpQEqa+1PyHZOP
/ArOzf2mDg6m3PfCo1eiB07v8PnVZ3llEUbVAJNg3GUxbE4SHrqq/kwm0iElm3p/
DvFxerCwdX9vmskJX4wDs+pSZRabXYj9XVMptsgFzA4joWrqqb7mBHqaort88YcY
YSljEt1bHyXmiJ+dBya40qARsWUkCVN7ZgEzdxckq0KI3w7g2tqpqIbO2lClWT6t
B3GpqQ4jp34+d1M14FB91fIGK7tMvOhSInE0Mv9+tPvRsepXqiiU/SwdAtRlr3m2
zs/K+4FYcVjJ3Rmpgc+tI38PbZxHe212I35YN6L1LP+4ZfAtzz0NyKdooTIBtkbO
19pX4WbBjKq8zK+YutrySncBIrbnI6VjW51vtRhgVKZliPFO/6zKagyU6FbxM+E5
udQES+t3F/9gvtxgxtDe
=YvyQ
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Highlights:
- Support for Power ISA 3.0 (Power9) Radix Tree MMU from Aneesh Kumar K.V
- Live patching support for ppc64le (also merged via livepatching.git)
Various cleanups & minor fixes from:
- Aaro Koskinen, Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
Chris Smart, Daniel Axtens, Frederic Barrat, Gavin Shan, Ian Munsie,
Lennart Sorensen, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring,
Michael Ellerman, Oliver O'Halloran, Paul Gortmaker, Paul Mackerras,
Rashmica Gupta, Russell Currey, Suraj Jitindar Singh, Thiago Jung
Bauermann, Valentin Rothberg, Vipin K Parashar.
General:
- Update LMB associativity index during DLPAR add/remove from Nathan
Fontenot
- Fix branching to OOL handlers in relocatable kernel from Hari Bathini
- Add support for userspace Power9 copy/paste from Chris Smart
- Always use STRICT_MM_TYPECHECKS from Michael Ellerman
- Add mask of possible MMU features from Michael Ellerman
PCI:
- Enable pass through of NVLink to guests from Alexey Kardashevskiy
- Cleanups in preparation for powernv PCI hotplug from Gavin Shan
- Don't report error in eeh_pe_reset_and_recover() from Gavin Shan
- Restore initial state in eeh_pe_reset_and_recover() from Gavin Shan
- Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell"
from Guilherme G Piccoli
- Remove the dependency on EEH struct in DDW mechanism from Guilherme
G Piccoli
selftests:
- Test cp_abort during context switch from Chris Smart
- Add several tests for transactional memory support from Rashmica
Gupta
perf:
- Add support for sampling interrupt register state from Anju T
- Add support for unwinding perf-stackdump from Chandan Kumar
cxl:
- Configure the PSL for two CAPI ports on POWER8NVL from Philippe
Bergheaud
- Allow initialization on timebase sync failures from Frederic Barrat
- Increase timeout for detection of AFU mmio hang from Frederic
Barrat
- Handle num_of_processes larger than can fit in the SPA from Ian
Munsie
- Ensure PSL interrupt is configured for contexts with no AFU IRQs
from Ian Munsie
- Add kernel API to allow a context to operate with relocate disabled
from Ian Munsie
- Check periodically the coherent platform function's state from
Christophe Lombard
Freescale:
- Updates from Scott: "Contains 86xx fixes, minor device tree fixes,
an erratum workaround, and a kconfig dependency fix."
* tag 'powerpc-4.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (192 commits)
powerpc/86xx: Fix PCI interrupt map definition
powerpc/86xx: Move pci1 definition to the include file
powerpc/fsl: Fix build of the dtb embedded kernel images
powerpc/fsl: Fix rcpm compatible string
powerpc/fsl: Remove FSL_SOC dependency from FSL_LBC
powerpc/fsl-pci: Add a workaround for PCI 5 errata
powerpc/fsl: Fix SPI compatible on t208xrdb and t1040rdb
powerpc/powernv/npu: Add PE to PHB's list
powerpc/powernv: Fix insufficient memory allocation
powerpc/iommu: Remove the dependency on EEH struct in DDW mechanism
Revert "powerpc/eeh: Fix crash in eeh_add_device_early() on Cell"
powerpc/eeh: Drop unnecessary label in eeh_pe_change_owner()
powerpc/eeh: Ignore handlers in eeh_pe_reset_and_recover()
powerpc/eeh: Restore initial state in eeh_pe_reset_and_recover()
powerpc/eeh: Don't report error in eeh_pe_reset_and_recover()
Revert "powerpc/powernv: Exclude root bus in pnv_pci_reset_secondary_bus()"
powerpc/powernv/npu: Enable NVLink pass through
powerpc/powernv/npu: Rework TCE Kill handling
powerpc/powernv/npu: Add set/unset window helpers
powerpc/powernv/ioda2: Export debug helper pe_level_printk()
...
PCI-Express spec says that reading 4 bytes at offset 100h should return
zero if there is no extended capability so VFIO reads this dword to
know if there are extended capabilities.
However it is not always possible to access the extended space so
generic PCI code in pci_cfg_space_size_ext() checks if
pci_read_config_dword() can read beyond 100h and if the check fails,
it sets the config space size to 100h.
VFIO does its own extended capabilities check by reading at offset 100h
which may produce 0xffffffff which VFIO treats as the extended config
space presense and calls vfio_ecap_init() which fails to parse
capabilities (which is expected) but right before the exit, it writes
zero at offset 100h which is beyond the buffer allocated for
vdev->vconfig (which is 256 bytes) which leads to random memory
corruption.
This makes VFIO only check for the extended capabilities if
the discovered config size is more than 256 bytes.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We are going to have multiple different types of PHB on the same system
with POWER8 + NVLink and PHBs will have different IOMMU ops. However
we only really care about one callback - create_table - so we can
relax the compatibility check here.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Many IOMMUs support multiple page table formats, meaning that any given
domain may only support a subset of the hardware page sizes presented in
iommu_ops->pgsize_bitmap. There are also certain use-cases where the
creator of a domain may want to control which page sizes are used, for
example to force the use of hugepage mappings to reduce pagetable walk
depth.
To this end, add a per-domain pgsize_bitmap to represent the subset of
page sizes actually in use, to make it possible for domains with
different requirements to coexist.
Signed-off-by: Will Deacon <will.deacon@arm.com>
[rm: hijacked and rebased original patch with new commit message]
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
This removes iommu_group_get_iommudata() as the result is never used.
As this is a minor cleanup, no change in behavior is expected.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If a device is reset without the memory or i/o bits enabled in the
command register we may not detect it, potentially leaving the device
without valid BAR programming. Add an additional test to check the
BARs on each write to the command register.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
INTx masking has two components, the first is that we need the ability
to prevent the device from continuing to assert INTx. This is
provided via the DisINTx bit in the command register and is the only
thing we can really probe for when testing if INTx masking is
supported. The second component is that the device needs to indicate
if INTx is asserted via the interrupt status bit in the device status
register. With these two features we can generically determine if one
of the devices we own is asserting INTx, signal the user, and mask the
interrupt while the user services the device.
Generally if one or both of these components is broken we resort to
APIC level interrupt masking, which requires an exclusive interrupt
since we have no way to determine the source of the interrupt in a
shared configuration. This often makes it difficult or impossible to
configure the system for userspace use of the device, for an interrupt
mode that the user may not need.
One possible configuration of broken INTx masking is that the DisINTx
support is fully functional, but the interrupt status bit never
signals interrupt assertion. In this case we do have the ability to
prevent the device from asserting INTx, but lack the ability to
identify the interrupt source. For this case we can simply pretend
that the device lacks INTx support entirely, keeping DisINTx set on
the physical device, virtualizing this bit for the user, and
virtualizing the interrupt pin register to indicate no INTx support.
We already support virtualization of the DisINTx bit and already
virtualize the interrupt pin for platforms without INTx support. By
tying these components together, setting DisINTx on open and reset,
and identifying devices broken in this particular way, we can provide
support for them w/o the handicap of APIC level INTx masking.
Intel i40e (XL710/X710) 10/20/40GbE NICs have been identified as being
broken in this specific way. We leave the vfio-pci.nointxmask option
as a mechanism to bypass this support, enabling INTx on the device
with all the requirements of APIC level masking.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: John Ronciak <john.ronciak@intel.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Various enablers for assignment of Intel graphics devices and future
support of vGPU devices (Alex Williamson). This includes
- Handling the vfio type1 interface as an API rather than a specific
implementation, allowing multiple type1 providers.
- Capability chains, similar to PCI device capabilities, that allow
extending ioctls. Extensions here include device specific regions
and sparse mmap descriptions. The former is used to expose non-PCI
regions for IGD, including the OpRegion (particularly the Video
BIOS Table), and read only PCI config access to the host and LPC
bridge as drivers often depend on identifying those devices.
Sparse mmaps here are used to describe the MSIx vector table,
which vfio has always protected from mmap, but never had an API to
explicitly define that protection. In future vGPU support this is
expected to allow the description of PCI BARs that may mix direct
access and emulated access within a single region.
- The ability to expose the shadow ROM as an option ROM as IGD use
cases may rely on the ROM even though the physical device does not
make use of a PCI option ROM BAR.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJW6aT1AAoJECObm247sIsiiP4P/1xf7Z08/2QWVFQzex9CLcZk
+/iJlyb/fTpPVQE+NTKPz3Qh5h6ZhSd/57s85IUqq0T6tgVPkoGx8kkyCjBaw2y1
yMezXZlQqJdZqGzQNI4OiHWvO+/vGxYKjQMfUnMlDM6dJgz4lGncGFoSouFPa3Vp
mB12hGxrlk1cfIdb+C1KbfZcEdS0WhtigQtz8flBKgOfO+hYWmUO+CClJBhVw8Z4
RNcWNAxFfLuwUPVsPb6uOLG2g65SC2vmQ9k0Tnknf1znV3PFFVjITf0aM6uChLNP
S3SgqtPX+6yOFyCuSEs8UKhhmCbeQmAyKgt5BpxV3Rw3OMP4PsVAehr82vQmSj6g
2o96pR2s8MDPBr8eG7gdRe4DQe3PonpLkpDfaghcpYqhkGEqNVeW5/GjiOzGQqD3
xMshzxJ1Iz7DOHkQRUVqOfupDB0TusJmTVKwvXe6yIYL9pjkUS/sbN9U563HYSES
JTV68TMj0VKfKwD3XKYXvGH3km1sL4i5NMlAUrsDtsMkGlXEswoGbj82Mjc8+jUo
BvWQTJb+kouJQ88VhsO2abg1UrO9E6u82iHFHy9fEObxE8KH7pvROlS93ihMT1Wv
WQNuUcltdpHMRVX0BDknaPs3YtC3/TGgm3RcU5SZPbv/ys1471ZmJxMlAAKcfITr
SuvkMTYElF5b1pigv46c
=/lJn
-----END PGP SIGNATURE-----
Merge tag 'vfio-v4.6-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
"Various enablers for assignment of Intel graphics devices and future
support of vGPU devices (Alex Williamson). This includes
- Handling the vfio type1 interface as an API rather than a specific
implementation, allowing multiple type1 providers.
- Capability chains, similar to PCI device capabilities, that allow
extending ioctls. Extensions here include device specific regions
and sparse mmap descriptions. The former is used to expose non-PCI
regions for IGD, including the OpRegion (particularly the Video
BIOS Table), and read only PCI config access to the host and LPC
bridge as drivers often depend on identifying those devices.
Sparse mmaps here are used to describe the MSIx vector table, which
vfio has always protected from mmap, but never had an API to
explicitly define that protection. In future vGPU support this is
expected to allow the description of PCI BARs that may mix direct
access and emulated access within a single region.
- The ability to expose the shadow ROM as an option ROM as IGD use
cases may rely on the ROM even though the physical device does not
make use of a PCI option ROM BAR"
* tag 'vfio-v4.6-rc1' of git://github.com/awilliam/linux-vfio:
vfio/pci: return -EFAULT if copy_to_user fails
vfio/pci: Expose shadow ROM as PCI option ROM
vfio/pci: Intel IGD host and LCP bridge config space access
vfio/pci: Intel IGD OpRegion support
vfio/pci: Enable virtual register in PCI config space
vfio/pci: Add infrastructure for additional device specific regions
vfio: Define device specific region type capability
vfio/pci: Include sparse mmap capability for MSI-X table regions
vfio: Define sparse mmap capability for regions
vfio: Add capability chain helpers
vfio: Define capability chains
vfio: If an IOMMU backend fails, keep looking
vfio/pci: Fix unsigned comparison overflow
Calling return copy_to_user(...) in an ioctl will not
do the right thing if there's a pagefault:
copy_to_user returns the number of bytes not copied
in this case.
Fix up vfio to do
return copy_to_user(...)) ?
-EFAULT : 0;
everywhere.
Cc: stable@vger.kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The copy_to_user() function returns the number of bytes that were not
copied but we want to return -EFAULT on error here.
Fixes: 188ad9d6cb ('vfio/pci: Include sparse mmap capability for MSI-X table regions')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Integrated graphics may have their ROM shadowed at 0xc0000 rather than
implement a PCI option ROM. Make this ROM appear to the user using
the ROM BAR.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Provide read-only access to PCI config space of the PCI host bridge
and LPC bridge through device specific regions. This may be used to
configure a VM with matching register contents to satisfy driver
requirements. Providing this through the vfio file descriptor removes
an additional userspace requirement for access through pci-sysfs and
removes the CAP_SYS_ADMIN requirement that doesn't appear to apply to
the specific devices we're accessing.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This is the first consumer of vfio device specific resource support,
providing read-only access to the OpRegion for Intel graphics devices.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Typically config space for a device is mapped out into capability
specific handlers and unassigned space. The latter allows direct
read/write access to config space. Sometimes we know about registers
living in this void space and would like an easy way to virtualize
them, similar to how BAR registers are managed. To do this, create
one more pseudo (fake) PCI capability to be handled as purely virtual
space. Reads and writes are serviced entirely from virtual config
space.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Add support for additional regions with indexes started after the
already defined fixed regions. Device specific code can register
these regions with the new vfio_pci_register_dev_region() function.
The ops structure per region currently only includes read/write
access and a release function, allowing automatic cleanup when the
device is closed. mmap support is only missing here because it's
not needed by the first user queued for this support.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vfio-pci has never allowed the user to directly mmap the MSI-X vector
table, but we've always relied on implicit knowledge of the user that
they cannot do this. Now that we have capability chains that we can
expose in the region info ioctl and a sparse mmap capability that
represents the sub-areas within the region that can be mmap'd, we can
make the mmap constraints more explicit.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Allow sub-modules to easily reallocate a buffer for managing
capability chains for info ioctls.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Consider an IOMMU to be an API rather than an implementation, we might
have multiple implementations supporting the same API, so try another
if one fails. The expectation here is that we'll really only have
one implementation per device type. For instance the existing type1
driver works with any PCI device where the IOMMU API is available. A
vGPU vendor may have a virtual PCI device which provides DMA isolation
and mapping through other mechanisms, but can re-use userspaces that
make use of the type1 VFIO IOMMU API. This allows that to work.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed versus unsigned comparisons are implicitly cast to unsigned,
which result in a couple possible overflows. For instance (start +
count) might overflow and wrap, getting through our validation test.
Also when unwinding setup, -1 being compared as unsigned doesn't
produce the intended stop condition. Fix both of these and also fix
vfio_msi_set_vector_signal() to validate parameters before using the
vector index, though none of the callers should pass bad indexes
anymore.
Reported-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Using iommu_present() to determine whether an IOMMU group is real or
fake has some problems. First, apparently Power systems don't
register an IOMMU on the device bus, so the groups and containers get
marked as noiommu and then won't bind to their actual IOMMU driver.
Second, I expect we'll run into the same issue as we try to support
vGPUs through vfio, since they're likely to emulate this behavior of
creating an IOMMU group on a virtual device and then providing a vfio
IOMMU backend tailored to the sort of isolation they provide, which
won't necessarily be fully compatible with the IOMMU API.
The solution here is to use the existing iommudata interface to IOMMU
groups, which allows us to easily identify the fake groups we've
created for noiommu purposes. The iommudata we set is purely
arbitrary since we're only comparing the address, so we use the
address of the noiommu switch itself.
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Tested-by: Anatoly Burakov <anatoly.burakov@intel.com>
Tested-by: Santosh Shukla <sshukla@mvista.com>
Fixes: 03a76b60f8 ("vfio: Include No-IOMMU mode")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The flags entry is there to tell the user that some
optional information is available.
Since we report the iova_pgsizes signal it to the user
by setting the flags to VFIO_IOMMU_INFO_PGSIZES.
Signed-off-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
There is really no way to safely give a user full access to a DMA
capable device without an IOMMU to protect the host system. There is
also no way to provide DMA translation, for use cases such as device
assignment to virtual machines. However, there are still those users
that want userspace drivers even under those conditions. The UIO
driver exists for this use case, but does not provide the degree of
device access and programming that VFIO has. In an effort to avoid
code duplication, this introduces a No-IOMMU mode for VFIO.
This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
the "enable_unsafe_noiommu_mode" option on the vfio driver. This
should make it very clear that this mode is not safe. Additionally,
CAP_SYS_RAWIO privileges are necessary to work with groups and
containers using this mode. Groups making use of this support are
named /dev/vfio/noiommu-$GROUP and can only make use of the special
VFIO_NOIOMMU_IOMMU for the container. Use of this mode, specifically
binding a device without a native IOMMU group to a VFIO bus driver
will taint the kernel and should therefore not be considered
supported. This patch includes no-iommu support for the vfio-pci bus
driver only.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
This loop ends with count set to -1 and not zero so the warning message
isn't printed when it should be. I've fixed this by change the postop
to a preop.
Fixes: 0990822c98 ('VFIO: platform: reset: AMD xgbe reset module')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Revert commit 033291eccb ("vfio: Include No-IOMMU mode") due to lack
of a user. This was originally intended to fill a need for the DPDK
driver, but uptake has been slow so rather than support an unproven
kernel interface revert it and revisit when userspace catches up.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The first argument to the WARN() macro has to be a condition. I'm sort
of disappointed that this code doesn't generate a compiler warning. I
guess -Wformat-extra-args doesn't work in the kernel.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
request_module already takes format strings, so no need to duplicate
the effort.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This pci_error_handlers structure is never modified, like all the other
pci_error_handlers structures, so declare it as const.
Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
platform_driver does not need to set an owner because
platform_driver_register() will set it.
Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
- Use kernel interfaces for VPD emulation (Alex Williamson)
- Platform fix for releasing IRQs (Eric Auger)
- Type1 IOMMU always advertises PAGE_SIZE support when smaller
mapping sizes are available (Eric Auger)
- Platform fixes for incorrectly using copies of structures rather
than pointers to structures (James Morse)
- Rework platform reset modules, fix leak, and add AMD xgbe reset
module (Eric Auger)
- Fix vfio_device_get_from_name() return value (Joerg Roedel)
- No-IOMMU interface (Alex Williamson)
- Fix potential out of bounds array access in PCI config handling
(Dan Carpenter)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWRg8+AAoJECObm247sIsitkwP+wc6cRzBpeGxiufZl9Ci7JV9
G0qNHBm8tAYDwm1uJATcyZC303ad2B3gaYSO6msTFUXTg3d9ZDtUOWhoTNPs/px1
5dF38DREzqYHpyC9HT2Qj3i9G+ejvgg+SoyxBOTOIw/dmq3tGZhUz1Sj+wuvQTwr
XXBKsWltZ+8wwXmQXmOWI4L3m7Xhs8NwAl5iLJ3UiltpW9zZzuPtoKnQCfMYUcmh
hIJg52t0WPSLyn47UvecUQcqxaO+QYELa7UN84fnQAihk+ewMpzg5blTebayFdu3
f2WC3ivxbamebw74LaRfQFjx4mT+DI0aXYtraC600PVe7gdVXB66QMNNpPhBwAy5
wpfeFpTKU5gC+LHmrIMUS2/A4sdNfUBw44CS8+Lm2D6bQAblPv/C5xQV1rz9HADv
f4/D3Y0TUKSYArewtBHTC0mnXdkZwetttBoy6/zQBl8vkelhoJ3GPcVa8FEZCIuT
2MSS17I3ftJ1enfynicF+Wstn/H5lWcuRBdg5wTLHIuhFn6MiEVxfIuSEx9JfjIb
NGZO7y5JiJ0b5QRCG0tFznwceU/cql/3oRqOGXqaf1cQ1Ag3JOIAUzxknFoJQUj1
XYe+Im1eMaugjj39J3+m5EYNKT3nh/bBLD/V3iWYpgoZtQrmQQm5nu0JsQo88/JR
je0BuJioCuPlO/Wj/KYw
=bI62
-----END PGP SIGNATURE-----
Merge tag 'vfio-v4.4-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Use kernel interfaces for VPD emulation (Alex Williamson)
- Platform fix for releasing IRQs (Eric Auger)
- Type1 IOMMU always advertises PAGE_SIZE support when smaller mapping
sizes are available (Eric Auger)
- Platform fixes for incorrectly using copies of structures rather than
pointers to structures (James Morse)
- Rework platform reset modules, fix leak, and add AMD xgbe reset
module (Eric Auger)
- Fix vfio_device_get_from_name() return value (Joerg Roedel)
- No-IOMMU interface (Alex Williamson)
- Fix potential out of bounds array access in PCI config handling (Dan
Carpenter)
* tag 'vfio-v4.4-rc1' of git://github.com/awilliam/linux-vfio:
vfio/pci: make an array larger
vfio: Include No-IOMMU mode
vfio: Fix bug in vfio_device_get_from_name()
VFIO: platform: reset: AMD xgbe reset module
vfio: platform: reset: calxedaxgmac: fix ioaddr leak
vfio: platform: add dev_info on device reset
vfio: platform: use list of registered reset function
vfio: platform: add compat in vfio_platform_device
vfio: platform: reset: calxedaxgmac: add reset function registration
vfio: platform: introduce module_vfio_reset_handler macro
vfio: platform: add capability to register a reset function
vfio: platform: introduce vfio-platform-base module
vfio/platform: store mapped memory in region, instead of an on-stack copy
vfio/type1: handle case where IOMMU does not support PAGE_SIZE size
VFIO: platform: clear IRQ_NOAUTOEN when de-assigning the IRQ
vfio/pci: Use kernel VPD access functions
vfio: Whitelist PCI bridges
Smatch complains about a possible out of bounds error:
drivers/vfio/pci/vfio_pci_config.c:1241 vfio_cap_init()
error: buffer overflow 'pci_cap_length' 20 <= 20
The problem is that pci_cap_length[] was defined as large enough to
hold "PCI_CAP_ID_AF + 1" elements. The code in vfio_cap_init() assumes
it has PCI_CAP_ID_MAX + 1 elements. Originally, PCI_CAP_ID_AF and
PCI_CAP_ID_MAX were the same but then we introduced PCI_CAP_ID_EA in
commit f80b0ba959 ("PCI: Add Enhanced Allocation register entries")
so now the array is too small.
Let's fix this by making the array size PCI_CAP_ID_MAX + 1. And let's
make a similar change to pci_ext_cap_length[] for consistency. Also
both these arrays can be made const.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
There is really no way to safely give a user full access to a DMA
capable device without an IOMMU to protect the host system. There is
also no way to provide DMA translation, for use cases such as device
assignment to virtual machines. However, there are still those users
that want userspace drivers even under those conditions. The UIO
driver exists for this use case, but does not provide the degree of
device access and programming that VFIO has. In an effort to avoid
code duplication, this introduces a No-IOMMU mode for VFIO.
This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
the "enable_unsafe_noiommu_mode" option on the vfio driver. This
should make it very clear that this mode is not safe. Additionally,
CAP_SYS_RAWIO privileges are necessary to work with groups and
containers using this mode. Groups making use of this support are
named /dev/vfio/noiommu-$GROUP and can only make use of the special
VFIO_NOIOMMU_IOMMU for the container. Use of this mode, specifically
binding a device without a native IOMMU group to a VFIO bus driver
will taint the kernel and should therefore not be considered
supported. This patch includes no-iommu support for the vfio-pci bus
driver only.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
The vfio_device_get_from_name() function might return a
non-NULL pointer, when called with a device name that is not
found in the list. This causes undefined behavior, in my
case calling an invalid function pointer later on:
kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
BUG: unable to handle kernel paging request at ffff8800cb3ddc08
[...]
Call Trace:
[<ffffffffa03bd733>] ? vfio_group_fops_unl_ioctl+0x253/0x410 [vfio]
[<ffffffff811efc4d>] do_vfs_ioctl+0x2cd/0x4c0
[<ffffffff811f9657>] ? __fget+0x77/0xb0
[<ffffffff811efeb9>] SyS_ioctl+0x79/0x90
[<ffffffff81001bb0>] ? syscall_return_slowpath+0x50/0x130
[<ffffffff8167f776>] entry_SYSCALL_64_fastpath+0x16/0x75
Fix the issue by returning NULL when there is no device with
the requested name in the list.
Cc: stable@vger.kernel.org # v4.2+
Fixes: 4bc94d5dc9 ("vfio: Fix lockdep issue")
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This patch introduces a module that registers and implements a low-level
reset function for the AMD XGBE device.
it performs the following actions:
- reset the PHY
- disable auto-negotiation
- disable & clear auto-negotiation IRQ
- soft-reset the MAC
Those tiny pieces of code are inherited from the native xgbe driver.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In the current code the vfio_platform_region is copied on the stack.
As a consequence the ioaddr address is not iounmapped in the vfio
platform driver (vfio_platform_regions_cleanup). The patch uses the
pointer to the region instead.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
It might be helpful for the end-user to check the device reset
function was found by the vfio platform reset framework.
Lets store a pointer to the struct device in vfio_platform_device
and trace when the reset function is called or not found.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Remove the static lookup table and use the dynamic list of registered
reset functions instead. Also load the reset module through its alias.
The reset struct module pointer is stored in vfio_platform_device.
We also remove the useless struct device pointer parameter in
vfio_platform_get_reset.
This patch fixes the issue related to the usage of __symbol_get, which
besides from being moot, prevented compilation with CONFIG_MODULES
disabled.
Also usage of MODULE_ALIAS makes possible to add a new reset module
without needing to update the framework. This was suggested by Arnd.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Let's retrieve the compatibility string on probe and store it
in the vfio_platform_device struct
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This patch adds the reset function registration/unregistration.
This is handled through the module_vfio_reset_handler macro. This
latter also defines a MODULE_ALIAS which simplifies the load from
vfio-platform.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The module_vfio_reset_handler macro
- define a module alias
- implement module init/exit function which respectively registers
and unregisters the reset function.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In preparation for subsequent changes in reset function lookup,
lets introduce a dynamic list of reset combos (compat string,
reset module, reset function). The list can be populated/voided with
vfio_platform_register/unregister_reset. Those are not yet used in
this patch.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
To prepare for vfio platform reset rework let's build
vfio_platform_common.c and vfio_platform_irq.c in a separate
module from vfio-platform and vfio-amba. This makes possible
to have separate module inits and works around a race between
platform driver init and vfio reset module init: that way we
make sure symbols exported by base are available when vfio-platform
driver gets probed.
The open/release being implemented in the base module, the ref
count is applied to the parent module instead.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vfio_platform_{read,write}_mmio() call ioremap_nocache() to map
a region of io memory, which they store in struct vfio_platform_region to
be eventually re-used, or unmapped by vfio_platform_regions_cleanup().
These functions receive a copy of their struct vfio_platform_region
argument on the stack - so these mapped areas are always allocated, and
always leaked.
Pass this argument as a pointer instead.
Fixes: 6e3f264560 "vfio/platform: read and write support for the device fd"
Signed-off-by: James Morse <james.morse@arm.com>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Current vfio_pgsize_bitmap code hides the supported IOMMU page
sizes smaller than PAGE_SIZE. As a result, in case the IOMMU
does not support PAGE_SIZE page, the alignment check on map/unmap
is done with larger page sizes, if any. This can fail although
mapping could be done with pages smaller than PAGE_SIZE.
This patch modifies vfio_pgsize_bitmap implementation so that,
in case the IOMMU supports page sizes smaller than PAGE_SIZE
we pretend PAGE_SIZE is supported and hide sub-PAGE_SIZE sizes.
That way the user will be able to map/unmap buffers whose size/
start address is aligned with PAGE_SIZE. Pinning code uses that
granularity while iommu driver can use the sub-PAGE_SIZE size
to map the buffer.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The vfio platform driver currently sets the IRQ_NOAUTOEN before
doing the request_irq to properly handle the user masking. However
it does not clear it when de-assigning the IRQ. This brings issues
when loading the native driver again which may not explicitly enable
the IRQ. This problem was observed with xgbe driver.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The PCI VPD capability operates on a set of window registers in PCI
config space. Writing to the address register triggers either a read
or write, depending on the setting of the PCI_VPD_ADDR_F bit within
the address register. The data register provides either the source
for writes or the target for reads.
This model is susceptible to being broken by concurrent access, for
which the kernel has adopted a set of access functions to serialize
these registers. Additionally, commits like 932c435cab ("PCI: Add
dev_flags bit to access VPD through function 0") and 7aa6ca4d39
("PCI: Add VPD function 0 quirk for Intel Ethernet devices") indicate
that VPD registers can be shared between functions on multifunction
devices creating dependencies between otherwise independent devices.
Fortunately it's quite easy to emulate the VPD registers, simply
storing copies of the address and data registers in memory and
triggering a VPD read or write on writes to the address register.
This allows vfio users to avoid seeing spurious register changes from
accesses on other devices and enables the use of shared quirks in the
host kernel. We can theoretically still race with access through
sysfs, but the window of opportunity is much smaller.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Mark Rustad <mark.d.rustad@intel.com>
When determining whether a group is viable, we already allow devices
bound to pcieport. Generalize this to include any PCI bridge device.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This patch adds the registration/unregistration of an
irq_bypass_producer for MSI/MSIx on vfio pci devices.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When we open a device file descriptor, we currently have the
following:
vfio_group_get_device_fd()
mutex_lock(&group->device_lock);
open()
...
if (ret)
release()
If we hit that error case, we call the backend driver release path,
which for vfio-pci looks like this:
vfio_pci_release()
vfio_pci_disable()
vfio_pci_try_bus_reset()
vfio_pci_get_devs()
vfio_device_get_from_dev()
vfio_group_get_device()
mutex_lock(&group->device_lock);
Whoops, we've stumbled back onto group.device_lock and created a
deadlock. There's a low likelihood of ever seeing this play out, but
obviously it needs to be fixed. To do that we can use a reference to
the vfio_device for vfio_group_get_device_fd() rather than holding the
lock. There was a loop in this function, theoretically allowing
multiple devices with the same name, but in practice we don't expect
such a thing to happen and the code is already aborting from the loop
with break on any sort of error rather than continuing and only
parsing the first match anyway, so the loop was effectively unused
already.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Fixes: 20f300175a ("vfio/pci: Fix racy vfio_device_get_from_dev() call")
Reported-by: Joerg Roedel <joro@8bytes.org>
Tested-by: Joerg Roedel <jroedel@suse.de>
- Disable the 32-bit vdso when building LE, so we can build with a 64-bit only
toolchain.
- EEH fixes from Gavin & Richard.
- Enable the sys_kcmp syscall from Laurent.
- Sysfs control for fastsleep workaround from Shreyas.
- Expose OPAL events as an irq chip by Alistair.
- MSI ops moved to pci_controller_ops by Daniel.
- Fix for kernel to userspace backtraces for perf from Anton.
- Merge pseries and pseries_le defconfigs from Cyril.
- CXL in-kernel API from Mikey.
- OPAL prd driver from Jeremy.
- Fix for DSCR handling & tests from Anshuman.
- Powernv flash mtd driver from Cyril.
- Dynamic DMA Window support on powernv from Alexey.
- LLVM clang fixes & workarounds from Anton.
- Reworked version of the patch to abort syscalls when transactional.
- Fix the swap encoding to support 4TB, from Aneesh.
- Various fixes as usual.
- Freescale updates from Scott: Highlights include more 8xx optimizations, an
e6500 hugetlb optimization, QMan device tree nodes, t1024/t1023 support, and
various fixes and cleanup.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJViSZqAAoJEFHr6jzI4aWAA7kQAKq3+pejfo2rY7alpKJyeVao
vlaIEaDNOTh+ctcmu3MFF9Jy6fai8gNZziRXU5JRmE5RW4GVBN4KZiqXRbkVjdBK
uG9sCX7Y58VRsS2vnGBYLsamfTMgjaXeDvgunQHVLiechJnrDr0RHEK90F3LSi73
Axp6l8XIG63a3zFZmkhzANMCme2lm5+MWmGlSjUUNi5F+viQUgJc5iiO8xrVUgM5
RpNlV2NJSqFiU+gMQWJ226V85UIniouq4j+qtyUcu8/m9BberyolXVU0GPlPFdsx
r/Qh9uCJyZaUdSB5hzomQZj50IsSz6J6nEuJTeGRoVZOmeI8Dnc2xU9fxQF5fC8H
lUJw10WPoNOggQZTeSUKn7wTXw3i4p3KsWNUczaW68VJdhqZUVaSp0+I6mnDSqzs
9iGC+VffLYNa1OHq7mGRFrgDdLBCHes31aZ3CxlQsmyNpAPCwMzsD4TUfVnvOG6E
oJOeaQ4mZM9PvqxEYJfoIL+vgRxmQ8sdIBtNY4in+C7J6eFnZNFO9xmPnJZuVU31
PGtx60kjFCOVMXvqn34WkRNbgqGWI91IK0KcRwFO2LXVio1uY77TWL52kNK2IMsp
Az+VDDvqnT3+BoV1yz0P6SrXAkwTpvFk2y+IdmEiUUN7zZFL5ZSA2epej9AzHTAK
WID2bc5yVtIL6p6x5ICH
=d9Wh
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux
Pull powerpc updates from Michael Ellerman:
- disable the 32-bit vdso when building LE, so we can build with a
64-bit only toolchain.
- EEH fixes from Gavin & Richard.
- enable the sys_kcmp syscall from Laurent.
- sysfs control for fastsleep workaround from Shreyas.
- expose OPAL events as an irq chip by Alistair.
- MSI ops moved to pci_controller_ops by Daniel.
- fix for kernel to userspace backtraces for perf from Anton.
- merge pseries and pseries_le defconfigs from Cyril.
- CXL in-kernel API from Mikey.
- OPAL prd driver from Jeremy.
- fix for DSCR handling & tests from Anshuman.
- Powernv flash mtd driver from Cyril.
- dynamic DMA Window support on powernv from Alexey.
- LLVM clang fixes & workarounds from Anton.
- reworked version of the patch to abort syscalls when transactional.
- fix the swap encoding to support 4TB, from Aneesh.
- various fixes as usual.
- Freescale updates from Scott: Highlights include more 8xx
optimizations, an e6500 hugetlb optimization, QMan device tree nodes,
t1024/t1023 support, and various fixes and cleanup.
* tag 'powerpc-4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (180 commits)
cxl: Fix typo in debug print
cxl: Add CXL_KERNEL_API config option
powerpc/powernv: Fix wrong IOMMU table in pnv_ioda_setup_bus_dma()
powerpc/mm: Change the swap encoding in pte.
powerpc/mm: PTE_RPN_MAX is not used, remove the same
powerpc/tm: Abort syscalls in active transactions
powerpc/iommu/ioda2: Enable compile with IOV=on and IOMMU_API=off
powerpc/include: Add opal-prd to installed uapi headers
powerpc/powernv: fix construction of opal PRD messages
powerpc/powernv: Increase opal-irqchip initcall priority
powerpc: Make doorbell check preemption safe
powerpc/powernv: pnv_init_idle_states() should only run on powernv
macintosh/nvram: Remove as unused
powerpc: Don't use gcc specific options on clang
powerpc: Don't use -mno-strict-align on clang
powerpc: Only use -mtraceback=no, -mno-string and -msoft-float if toolchain supports it
powerpc: Only use -mabi=altivec if toolchain supports it
powerpc: Fix duplicate const clang warning in user access code
vfio: powerpc/spapr: Support Dynamic DMA windows
vfio: powerpc/spapr: Register memory and define IOMMU v2
...
This patch enables building VFIO platform and derivatives on ARM64.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This patch introduces a module that registers and implements a basic
reset function for the Calxeda xgmac device. This latter basically disables
interrupts and stops DMA transfers.
The reset function code is inherited from the native calxeda xgmac driver.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The reset function lookup happens on vfio-platform probe. The reset
module load is requested and a reference to the function symbol is
hold. The reference is released on vfio-platform remove.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
A new reset callback is introduced. If this callback is populated,
the reset is invoked on device first open/last close or upon userspace
ioctl. The modality is exposed on VFIO_DEVICE_GET_INFO.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This patch introduces the vfio_platform_reset_combo struct that
stores all the information useful to handle the reset modality:
compat string, name of the reset function, name of the module that
implements the reset function. A lookup table of such structures
is added, currently void.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Acked-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Tested-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This adds create/remove window ioctls to create and remove DMA windows.
sPAPR defines a Dynamic DMA windows capability which allows
para-virtualized guests to create additional DMA windows on a PCI bus.
The existing linux kernels use this new window to map the entire guest
memory and switch to the direct DMA operations saving time on map/unmap
requests which would normally happen in a big amounts.
This adds 2 ioctl handlers - VFIO_IOMMU_SPAPR_TCE_CREATE and
VFIO_IOMMU_SPAPR_TCE_REMOVE - to create and remove windows.
Up to 2 windows are supported now by the hardware and by this driver.
This changes VFIO_IOMMU_SPAPR_TCE_GET_INFO handler to return additional
information such as a number of supported windows and maximum number
levels of TCE tables.
DDW is added as a capability, not as a SPAPR TCE IOMMU v2 unique feature
as we still want to support v2 on platforms which cannot do DDW for
the sake of TCE acceleration in KVM (coming soon).
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The existing implementation accounts the whole DMA window in
the locked_vm counter. This is going to be worse with multiple
containers and huge DMA windows. Also, real-time accounting would requite
additional tracking of accounted pages due to the page size difference -
IOMMU uses 4K pages and system uses 4K or 64K pages.
Another issue is that actual pages pinning/unpinning happens on every
DMA map/unmap request. This does not affect the performance much now as
we spend way too much time now on switching context between
guest/userspace/host but this will start to matter when we add in-kernel
DMA map/unmap acceleration.
This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU.
New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces
2 new ioctls to register/unregister DMA memory -
VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY -
which receive user space address and size of a memory region which
needs to be pinned/unpinned and counted in locked_vm.
New IOMMU splits physical pages pinning and TCE table update
into 2 different operations. It requires:
1) guest pages to be registered first
2) consequent map/unmap requests to work only with pre-registered memory.
For the default single window case this means that the entire guest
(instead of 2GB) needs to be pinned before using VFIO.
When a huge DMA window is added, no additional pinning will be
required, otherwise it would be guest RAM + 2GB.
The new memory registration ioctls are not supported by
VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration
will require memory to be preregistered in order to work.
The accounting is done per the user process.
This advertises v2 SPAPR TCE IOMMU and restricts what the userspace
can do with v1 or v2 IOMMUs.
In order to support memory pre-registration, we need a way to track
the use of every registered memory region and only allow unregistration
if a region is not in use anymore. So we need a way to tell from what
region the just cleared TCE was from.
This adds a userspace view of the TCE table into iommu_table struct.
It contains userspace address, one per TCE entry. The table is only
allocated when the ownership over an IOMMU group is taken which means
it is only used from outside of the powernv code (such as VFIO).
As v2 IOMMU supports IODA2 and pre-IODA2 IOMMUs (which do not support
DDW API), this creates a default DMA window for IODA2 for consistency.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Before the IOMMU user (VFIO) would take control over the IOMMU table
belonging to a specific IOMMU group. This approach did not allow sharing
tables between IOMMU groups attached to the same container.
This introduces a new IOMMU ownership flavour when the user can not
just control the existing IOMMU table but remove/create tables on demand.
If an IOMMU implements take/release_ownership() callbacks, this lets
the user have full control over the IOMMU group. When the ownership
is taken, the platform code removes all the windows so the caller must
create them.
Before returning the ownership back to the platform code, VFIO
unprograms and removes all the tables it created.
This changes IODA2's onwership handler to remove the existing table
rather than manipulating with the existing one. From now on,
iommu_take_ownership() and iommu_release_ownership() are only called
from the vfio_iommu_spapr_tce driver.
Old-style ownership is still supported allowing VFIO to run on older
P5IOC2 and IODA IO controllers.
No change in userspace-visible behaviour is expected. Since it recreates
TCE tables on each ownership change, related kernel traces will appear
more often.
This adds a pnv_pci_ioda2_setup_default_config() which is called
when PE is being configured at boot time and when the ownership is
passed from VFIO to the platform code.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This extends iommu_table_group_ops by a set of callbacks to support
dynamic DMA windows management.
create_table() creates a TCE table with specific parameters.
it receives iommu_table_group to know nodeid in order to allocate
TCE table memory closer to the PHB. The exact format of allocated
multi-level table might be also specific to the PHB model (not
the case now though).
This callback calculated the DMA window offset on a PCI bus from @num
and stores it in a just created table.
set_window() sets the window at specified TVT index + @num on PHB.
unset_window() unsets the window from specified TVT.
This adds a free() callback to iommu_table_ops to free the memory
(potentially a tree of tables) allocated for the TCE table.
create_table() and free() are supposed to be called once per
VFIO container and set_window()/unset_window() are supposed to be
called for every group in a container.
This adds IOMMU capabilities to iommu_table_group such as default
32bit window parameters and others. This makes use of new values in
vfio_iommu_spapr_tce. IODA1/P5IOC2 do not support DDW so they do not
advertise pagemasks to the userspace.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
At the moment writing new TCE value to the IOMMU table fails with EBUSY
if there is a valid entry already. However PAPR specification allows
the guest to write new TCE value without clearing it first.
Another problem this patch is addressing is the use of pool locks for
external IOMMU users such as VFIO. The pool locks are to protect
DMA page allocator rather than entries and since the host kernel does
not control what pages are in use, there is no point in pool locks and
exchange()+put_page(oldtce) is sufficient to avoid possible races.
This adds an exchange() callback to iommu_table_ops which does the same
thing as set() plus it returns replaced TCE and DMA direction so
the caller can release the pages afterwards. The exchange() receives
a physical address unlike set() which receives linear mapping address;
and returns a physical address as the clear() does.
This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement
for a platform to have exchange() implemented in order to support VFIO.
This replaces iommu_tce_build() and iommu_clear_tce() with
a single iommu_tce_xchg().
This makes sure that TCE permission bits are not set in TCE passed to
IOMMU API as those are to be calculated by platform code from
DMA direction.
This moves SetPageDirty() to the IOMMU code to make it work for both
VFIO ioctl interface in in-kernel TCE acceleration (when it becomes
available later).
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds tce_iommu_take_ownership() and tce_iommu_release_ownership
which call in a loop iommu_take_ownership()/iommu_release_ownership()
for every table on the group. As there is just one now, no change in
behaviour is expected.
At the moment the iommu_table struct has a set_bypass() which enables/
disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
which calls this callback when external IOMMU users such as VFIO are
about to get over a PHB.
The set_bypass() callback is not really an iommu_table function but
IOMMU/PE function. This introduces a iommu_table_group_ops struct and
adds take_ownership()/release_ownership() callbacks to it which are
called when an external user takes/releases control over the IOMMU.
This replaces set_bypass() with ownership callbacks as it is not
necessarily just bypass enabling, it can be something else/more
so let's give it more generic name.
The callbacks is implemented for IODA2 only. Other platforms (P5IOC2,
IODA1) will use the old iommu_take_ownership/iommu_release_ownership API.
The following patches will replace iommu_take_ownership/
iommu_release_ownership calls in IODA2 with full IOMMU table release/
create.
As we here and touching bypass control, this removes
pnv_pci_ioda2_setup_bypass_pe() as it does not do much
more compared to pnv_pci_ioda2_set_bypass. This moves tce_bypass_base
initialization to pnv_pci_ioda2_setup_dma_pe.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
So far one TCE table could only be used by one IOMMU group. However
IODA2 hardware allows programming the same TCE table address to
multiple PE allowing sharing tables.
This replaces a single pointer to a group in a iommu_table struct
with a linked list of groups which provides the way of invalidating
TCE cache for every PE when an actual TCE table is updated. This adds
pnv_pci_link_table_and_group() and pnv_pci_unlink_table_and_group()
helpers to manage the list. However without VFIO, it is still going
to be a single IOMMU group per iommu_table.
This changes iommu_add_device() to add a device to a first group
from the group list of a table as it is only called from the platform
init code or PCI bus notifier and at these moments there is only
one group per table.
This does not change TCE invalidation code to loop through all
attached groups in order to simplify this patch and because
it is not really needed in most cases. IODA2 is fixed in a later
patch.
This should cause no behavioural change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Modern IBM POWERPC systems support multiple (currently two) TCE tables
per IOMMU group (a.k.a. PE). This adds a iommu_table_group container
for TCE tables. Right now just one table is supported.
This defines iommu_table_group struct which stores pointers to
iommu_group and iommu_table(s). This replaces iommu_table with
iommu_table_group where iommu_table was used to identify a group:
- iommu_register_group();
- iommudata of generic iommu_group;
This removes @data from iommu_table as it_table_group provides
same access to pnv_ioda_pe.
For IODA, instead of embedding iommu_table, the new iommu_table_group
keeps pointers to those. The iommu_table structs are allocated
dynamically.
For P5IOC2, both iommu_table_group and iommu_table are embedded into
PE struct. As there is no EEH and SRIOV support for P5IOC2,
iommu_free_table() should not be called on iommu_table struct pointers
so we can keep it embedded in pnv_phb::p5ioc2.
For pSeries, this replaces multiple calls of kzalloc_node() with a new
iommu_pseries_alloc_group() helper and stores the table group struct
pointer into the pci_dn struct. For release, a iommu_table_free_group()
helper is added.
This moves iommu_table struct allocation from SR-IOV code to
the generic DMA initialization code in pnv_pci_ioda_setup_dma_pe and
pnv_pci_ioda2_setup_dma_pe as this is where DMA is actually initialized.
This change is here because those lines had to be changed anyway.
This should cause no behavioural change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This is to make extended ownership and multiple groups support patches
simpler for review.
This should cause no behavioural change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This is a pretty mechanical patch to make next patches simpler.
New tce_iommu_unuse_page() helper does put_page() now but it might skip
that after the memory registering patch applied.
As we are here, this removes unnecessary checks for a value returned
by pfn_to_page() as it cannot possibly return NULL.
This moves tce_iommu_disable() later to let tce_iommu_clear() know if
the container has been enabled because if it has not been, then
put_page() must not be called on TCEs from the TCE table. This situation
is not yet possible but it will after KVM acceleration patchset is
applied.
This changes code to work with physical addresses rather than linear
mapping addresses for better code readability. Following patches will
add an xchg() callback for an IOMMU table which will accept/return
physical addresses (unlike current tce_build()) which will eliminate
redundant conversions.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
At the moment DMA map/unmap requests are handled irrespective to
the container's state. This allows the user space to pin memory which
it might not be allowed to pin.
This adds checks to MAP/UNMAP that the container is enabled, otherwise
-EPERM is returned.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There moves locked pages accounting to helpers.
Later they will be reused for Dynamic DMA windows (DDW).
This reworks debug messages to show the current value and the limit.
This stores the locked pages number in the container so when unlocking
the iommu table pointer won't be needed. This does not have an effect
now but it will with the multiple tables per container as then we will
allow attaching/detaching groups on fly and we may end up having
a container with no group attached but with the counter incremented.
While we are here, update the comment explaining why RLIMIT_MEMLOCK
might be required to be bigger than the guest RAM. This also prints
pid of the current process in pr_warn/pr_debug.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This makes use of the it_page_size from the iommu_table struct
as page size can differ.
This replaces missing IOMMU_PAGE_SHIFT macro in commented debug code
as recently introduced IOMMU_PAGE_XXX macros do not include
IOMMU_PAGE_SHIFT.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This checks that the TCE table page size is not bigger that the size of
a page we just pinned and going to put its physical address to the table.
Otherwise the hardware gets unwanted access to physical memory between
the end of the actual page and the end of the aligned up TCE page.
Since compound_order() and compound_head() work correctly on non-huge
pages, there is no need for additional check whether the page is huge.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This moves page pinning (get_user_pages_fast()/put_page()) code out of
the platform IOMMU code and puts it to VFIO IOMMU driver where it belongs
to as the platform code does not deal with page pinning.
This makes iommu_take_ownership()/iommu_release_ownership() deal with
the IOMMU table bitmap only.
This removes page unpinning from iommu_take_ownership() as the actual
TCE table might contain garbage and doing put_page() on it is undefined
behaviour.
Besides the last part, the rest of the patch is mechanical.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for the vfio related changes]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Testing the driver for a PCI device is racy, it can be all but
complete in the release path and still report the driver as ours.
Therefore we can't trust drvdata to be valid. This race can sometimes
be seen when one port of a multifunction device is being unbound from
the vfio-pci driver while another function is being released by the
user and attempting a bus reset. The device in the remove path is
found as a dependent device for the bus reset of the release path
device, the driver is still set to vfio-pci, but the drvdata has
already been cleared, resulting in a null pointer dereference.
To resolve this, fix vfio_device_get_from_dev() to not take the
dev_get_drvdata() shortcut and instead traverse through the
iommu_group, vfio_group, vfio_device path to get a reference we
can trust. Once we have that reference, we know the device isn't
in transition and we can test to make sure the driver is still what
we expect, so that we don't interfere with devices we don't own.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The ARM SMMUv3 driver is compatible with the notion of a type-1 IOMMU in
VFIO.
This patch allows VFIO_IOMMU_TYPE1 to be selected if ARM_SMMU_V3=y.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The patch adds one more EEH sub-command (VFIO_EEH_PE_INJECT_ERR)
to inject the specified EEH error, which is represented by
(struct vfio_eeh_pe_err), to the indicated PE for testing purpose.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 13060b64b8 ("vfio: Add and use device request op for vfio
bus drivers") incorrectly makes use of an interruptible timeout.
When interrupted, the signal remains pending resulting in subsequent
timeouts occurring instantly. This makes the loop spin at a much
higher rate than intended.
Instead of making this completely non-interruptible, we can change
this into a sort of interruptible-once behavior and use the "once"
to log debug information. The driver API doesn't allow us to abort
and return an error code.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Fixes: 13060b64b8
Cc: stable@vger.kernel.org # v4.0
Log some clues indicating whether the user is receiving device
request interfaces or not listening. This can help indicate why a
driver unbind is blocked or explain why QEMU automatically unplugged
a device from the VM.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We can save some power by putting devices that are bound to vfio-pci
but not in use by the user in the D3hot power state. Devices get
woken into D0 when opened by the user. Resets return the device to
D0, so we need to re-apply the low power state after a bus reset.
It's tempting to try to use D3cold, but we have no reason to inhibit
hotplug of idle devices and we might get into a loop of having the
device disappear before we have a chance to try to use it.
A new module parameter allows this feature to be disabled if there are
devices that misbehave as a result of this change.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
As indicated in the comment, this is not entirely uncommon and
causes user concern for no reason.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This copies the same support from pci-stub for exactly the same
purpose, enabling a set of PCI IDs to be automatically added to the
driver's dynamic ID table at module load time. The code here is
pretty simple and both vfio-pci and pci-stub are fairly unique in
being meta drivers, capable of attaching to any device, so there's no
attempt made to generalize the code into pci-core.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If VFIO VGA access is disabled for the user, either by CONFIG option
or module parameter, we can often opt-out of VGA arbitration. We can
do this when PCI bridge control of VGA routing is possible. This
means that we must have a parent bridge and there must only be a
single VGA device below that bridge. Fortunately this is the typical
case for discrete GPUs.
Doing this allows us to minimize the impact of additional GPUs, in
terms of VGA arbitration, when they are only used via vfio-pci for
non-VGA applications.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Add a module option so that we don't require a CONFIG change and
kernel rebuild to disable VGA support. Not only can VGA support be
troublesome in itself, but by disabling it we can reduce the impact
to host devices by doing a VGA arbitration opt-out.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
An unintended consequence of commit 42ac9bd18d ("vfio: initialize
the virqfd workqueue in VFIO generic code") is that the vfio module
is renamed to vfio_core so that it can include both vfio and virqfd.
That's a user visible change that may break module loading scritps
and it imposes eventfd support as a dependency on the core vfio code,
which it's really not. virqfd is intended to be provided as a service
to vfio bus drivers, so instead of wrapping it into vfio.ko, we can
make it a stand-alone module toggled by vfio bus drivers. This has
the additional benefit of removing initialization and exit from the
core vfio code.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The next code fragment "list_for_each_entry" is not depend on "minor". With this
patch, the free of "minor" in "list_for_each_entry" can be reduced, and there is
no functional change.
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
With this patch the VFIO user will be able to set an eventfd that can be
used in order to mask and unmask IRQs of platform devices.
Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Now we have finally completely decoupled virqfd from VFIO_PCI. We can
initialize it from the VFIO generic code, in order to safely use it from
multiple independent VFIO bus drivers.
Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The virqfd functionality that is used by VFIO_PCI to implement interrupt
masking and unmasking via an eventfd, is generic enough and can be reused
by another driver. Move it to a separate file in order to allow the code
to be shared.
Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
VFIO_PCI passes the VFIO device structure *vdev via eventfd to the handler
that implements masking/unmasking of IRQs via an eventfd. We can replace
it in the virqfd infrastructure with an opaque type so we can make use
of the mechanism from other VFIO bus drivers.
Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The Virqfd code needs to keep accesses to any struct *virqfd safe, but
this comes into play only when creating or destroying eventfds, so sharing
the same spinlock with the VFIO bus driver is not necessary.
Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The functions vfio_pci_virqfd_init and vfio_pci_virqfd_exit are not really
PCI specific, since we plan to reuse the virqfd code with more VFIO drivers
in addition to VFIO_PCI.
Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
[Baptiste Reynal: Move rename vfio_pci_virqfd_init and vfio_pci_virqfd_exit
from "vfio: add a vfio_ prefix to virqfd_enable and virqfd_disable and export"]
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We want to reuse virqfd functionality in multiple VFIO drivers; before
moving these functions to core VFIO, add the vfio_ prefix to the
virqfd_enable and virqfd_disable functions, and export them so they can
be used from other modules.
Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Level sensitive interrupts are exposed as maskable and automasked
interrupts and are masked and disabled automatically when they fire.
Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
[Baptiste Reynal: Move masked interrupt initialization from "vfio/platform:
trigger an interrupt via eventfd"]
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This patch allows to set an eventfd for a platform device's interrupt,
and also to trigger the interrupt eventfd from userspace for testing.
Level sensitive interrupts are marked as maskable and are handled in
a later patch. Edge triggered interrupts are not advertised as maskable
and are implemented here using a simple and efficient IRQ handler.
Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
[Baptiste Reynal: fix masked interrupt initialization]
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This patch is a skeleton for the VFIO_DEVICE_SET_IRQS IOCTL, around which
most IRQ functionality is implemented in VFIO.
Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Return information for the interrupts exposed by the device.
This patch extends VFIO_DEVICE_GET_INFO with the number of IRQs
and enables VFIO_DEVICE_GET_IRQ_INFO.
Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Allow to memory map the MMIO regions of the device so userspace can
directly access them. PIO regions are not being handled at this point.
Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
VFIO returns a file descriptor which we can use to manipulate the memory
regions of the device. Usually, the user will mmap memory regions that are
addressable on page boundaries, however for memory regions where this is
not the case we cannot provide mmap functionality due to security concerns.
For this reason we also allow to use read and write functions to the file
descriptor pointing to the memory regions.
We implement this functionality only for MMIO regions of platform devices;
PIO regions are not being handled at this point.
Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This patch enables the IOCTLs VFIO_DEVICE_GET_REGION_INFO ioctl call,
which allows the user to learn about the available MMIO resources of
a device.
Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
A VFIO userspace driver will start by opening the VFIO device
that corresponds to an IOMMU group, and will use the ioctl interface
to get the basic device info, such as number of memory regions and
interrupts, and their properties. This patch enables the
VFIO_DEVICE_GET_INFO ioctl call.
Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
[Baptiste Reynal: added include in vfio_platform_common.c]
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Enable building the VFIO AMBA driver. VFIO_AMBA depends on VFIO_PLATFORM,
since it is sharing a portion of the code, and it is essentially implemented
as a platform device whose resources are discovered via AMBA specific APIs
in the kernel.
Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Add support for discovering AMBA devices with VFIO and handle them
similarly to Linux platform devices.
Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Enable building the VFIO PLATFORM driver that allows to use Linux platform
devices with VFIO.
Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Driver to bind to Linux platform devices, and callbacks to discover their
resources to be used by the main VFIO PLATFORM code.
Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This patch forms the common skeleton code for platform devices support
with VFIO. This will include the core functionality of VFIO_PLATFORM,
however binding to the device and discovering the device resources will
be done with the help of a separate file where any Linux platform bus
specific code will reside.
This will allow us to implement support for also discovering AMBA devices
and their resources, but still reuse a large part of the VFIO_PLATFORM
implementation.
Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
[Baptiste Reynal: added includes in vfio_platform_private.h]
Signed-off-by: Baptiste Reynal <b.reynal@virtualopensystems.com>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This adds a missing break statement to VFIO_DEVICE_SET_IRQS handler
without which vfio_pci_set_err_trigger() would never be called.
While we are here, add another "break" to VFIO_PCI_REQ_IRQ_INDEX case
so if we add more indexes later, we won't miss it.
Fixes: 6140a8f562 ("vfio-pci: Add device request interface")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Userspace can opt to receive a device request notification,
indicating that the device should be released. This is setup
the same way as the error IRQ and also supports eventfd signaling.
Future support may forcefully remove the device from the user if
the request is ignored.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We want another single vector IRQ index to support signaling of
the device request to userspace. Generalize the error reporting
IRQ index to avoid code duplication.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
When a request is made to unbind a device from a vfio bus driver,
we need to wait for the device to become unused, ie. for userspace
to release the device. However, we have a long standing TODO in
the code to do something proactive to make that happen. To enable
this, we add a request callback on the vfio bus driver struct,
which is intended to signal the user through the vfio device
interface to release the device. Instead of passively waiting for
the device to become unused, we can now pester the user to give
it up.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Move the iommu_group reference from the device to the vfio_group.
This ensures that the iommu_group persists as long as the vfio_group
remains. This can be important if all of the device from an
iommu_group are removed, but we still have an outstanding vfio_group
reference; we can still walk the empty list of devices.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
There's a small window between the vfio bus driver calling
vfio_del_group_dev() and the device being completely unbound where
the vfio group appears to be non-viable. This creates a race for
users like QEMU/KVM where the kvm-vfio module tries to get an
external reference to the group in order to match and release an
existing reference, while the device is potentially being removed
from the vfio bus driver. If the group is momentarily non-viable,
kvm-vfio may not be able to release the group reference until VM
shutdown, making the group unusable until that point.
Bridge the gap between device removal from the group and completion
of the driver unbind by tracking it in a list. The device is added
to the list before the bus driver reference is released and removed
using the existing unbind notifier.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
IOMMU operations can be expensive and it's not very difficult for a
user to give us a lot of work to do for a map or unmap operation.
Killing a large VM will vfio assigned devices can result in soft
lockups and IOMMU tracing shows that we can easily spend 80% of our
time with need-resched set. A sprinkling of conf_resched() calls
after map and unmap calls has a very tiny affect on performance
while resulting in traces with <1% of calls overflowing into needs-
resched.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We currently map invalid and reserved pages, such as often occur from
mapping MMIO regions of a VM through the IOMMU, using single pages.
There's really no reason we can't instead follow the methodology we
use for normal pages and find the largest possible physically
contiguous chunk for mapping. The only difference is that we don't
do locked memory accounting for these since they're not back by RAM.
In most applications this will be a very minor improvement, but when
graphics and GPGPU devices are in play, MMIO BARs become non-trivial.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
When unmapping DMA entries we try to rely on the IOMMU API behavior
that allows the IOMMU to unmap a larger area than requested, up to
the size of the original mapping. This works great when the IOMMU
supports superpages *and* they're in use. Otherwise, each PAGE_SIZE
increment is unmapped separately, resulting in poor performance.
Instead we can use the IOVA-to-physical-address translation provided
by the IOMMU API and unmap using the largest contiguous physical
memory chunk available, which is also how vfio/type1 would have
mapped the region. For a synthetic 1TB guest VM mapping and shutdown
test on Intel VT-d (2M IOMMU pagesize support), this achieves about
a 30% overall improvement mapping standard 4K pages, regardless of
IOMMU superpage enabling, and about a 40% improvement mapping 2M
hugetlbfs pages when IOMMU superpages are not available. Hugetlbfs
with IOMMU superpages enabled is effectively unchanged.
Unfortunately the same algorithm does not work well on IOMMUs with
fine-grained superpages, like AMD-Vi, costing about 25% extra since
the IOMMU will automatically unmap any power-of-two contiguous
mapping we've provided it. We add a routine and a domain flag to
detect this feature, leaving AMD-Vi unaffected by this unmap
optimization.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Current vfio-pci just supports normal pci device, so vfio_pci_probe() will
return if the pci device is not a normal device. While current code makes a
mistake. PCI_HEADER_TYPE is the offset in configuration space of the device
type, but we use this value to mask the type value.
This patch fixs this by do the check directly on the pci_dev->hdr_type.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: stable@vger.kernel.org # v3.6+
- s390 support (Frank Blaschka)
- Enable iommu-type1 for ARM SMMU (Will Deacon)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUka+sAAoJECObm247sIsiiy0P/iqrQpv94Z7rUKRlV3K8YtJj
8Oi5fLLnT2by9v5mS+KMElnQ5gLU/C5B/QGLMNrF2uQl8lguWSXJw37r7MkbIkpN
RDLx1NhetLqbJ4CYLjyv/Jx3vl+Wr/2nNWRVIS5ajBmMjEgKVLvjYs4SaXELc3a8
a3YzcGW10BrVFlCJgUYqYIFGS1BmKjf7fbD5YBocj8tPv6NAlCiNNYYr+0pzW8Lf
GTi39JlZ2t06hDq33eiUkrySWNjrIBn4g4PfAl7HBAscsZKS1w18MD1qVw4UXXa2
15+CBbsHU7ZLVo6G7vuZeJNCX9tdQ0WIZWQzHstQa914l86WYImTJ2tyH7Rn0ZcQ
3Mu9fzef9JgjkI56ol2zDwuOs+qttOYaLWjhhHiW4jkIxdnljnesIFlvmM3XeDGz
3Zowg09HzE3K+dt8265jVKkcNJbPLzspLvF27nPMudZNHBozcoPE0jKxU7QC3eMT
Ij36+puQq+jccUic3Np6rxk5tzTHEat1a7w3IUwXCCUP5P5QW+kuuIjbd4hqQkHn
VDRjnT6MWC3GguUCXR5VyO0zezpI20pTbWwE8u2qwnE349m0Eq/vxytj2lCLYLPR
Jjtdduf1/Ppam7tATd3PwTu6KljY3dJiUUikyOc1J0KmkgkSMw+BtR6G7qytyW4q
/fhClcsaNtxSheVh0N+b
=1L2V
-----END PGP SIGNATURE-----
Merge tag 'vfio-v3.19-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- s390 support (Frank Blaschka)
- Enable iommu-type1 for ARM SMMU (Will Deacon)
* tag 'vfio-v3.19-rc1' of git://github.com/awilliam/linux-vfio:
drivers/vfio: allow type-1 IOMMU instantiation on top of an ARM SMMU
vfio: make vfio run on s390
Rename write_msi_msg() to pci_write_msi_msg() to mark it as PCI
specific.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Yingjoe Chen <yingjoe.chen@mediatek.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The ARM SMMU driver is compatible with the notion of a type-1 IOMMU in
VFIO.
This patch allows VFIO_IOMMU_TYPE1 to be selected if ARM_SMMU=y.
Signed-off-by: Will Deacon <will.deacon@arm.com>
[aw: update for existing S390 patch]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
add Kconfig switch to hide INTx
add Kconfig switch to let vfio announce PCI BARs are not mapable
Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This pull-request includes:
* Change in the IOMMU-API to convert the former iommu_domain_capable
function to just iommu_capable
* Various fixes in handling RMRR ranges for the VT-d driver (one fix
requires a device driver core change which was acked
by Greg KH)
* The AMD IOMMU driver now assigns and deassigns complete alias groups
to fix issues with devices using the wrong PCI request-id
* MMU-401 support for the ARM SMMU driver
* Multi-master IOMMU group support for the ARM SMMU driver
* Various other small fixes all over the place
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJUPNxYAAoJECvwRC2XARrjMwMP/RLSr+oA31rGVjLXcmcCHl7Q
Uj7xpcnG19qB0aqNR1JeJuZNkK/tw44pE353MQPbz4N9UVUiogklGIVD1iJvFV53
0qm84bvpDJIof4aP35B3H3Umft2USTn/lmsQg/RklQcNTW8DzNj63b8BTNR7k/GL
G7bLg7F1BUCl0shZCCsFspOIulQPAJYN2OvHlfYBav/bfDvfouQ3lrV+loGrK44r
F2Hmp+imXlIhUCjfbiWz6wKFxvPrxZx482vm2pXBCSnXEdW4/fz6nf9VHUK/Cfsq
JAimY1CfiDo1aqH9/yVHUOw5SD/NYOXq6E5bFPg/WENbipbbae5cK2u6PX5MMBAn
CG4BM8l9xicfGPqgn5YFSRY/6qC6K7NlxMnt9U8l18QIkDVDqEtUgJQISJuce7wx
FWx6eSWaxpIe5yhq19/h2ELalUUyR/fPq+UXXjYDL1kLV/vcvC/lC3mbNAQU93zU
WK0bG2tDg88JHavc25Ewa2aOn4BVM2BpwuLbYlgQReaEmsQRnEPgtmRNyLJHqbFE
wwpCj8pBWdufsJWRyvpnXQ+CfA7oSz4e7hz1G+0/5uiDmagfvg16Ql5JtPmmuLUm
Kc3dVIiG0s1ewohZIIJETGCqprQbCSqs8CCQqB6p2zDBWFKpNT7F38lm/KlehkCz
JpAiI7Y2K9Jejp0VIPrt
=OMOt
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU updates from Joerg Roedel:
"This pull-request includes:
- change in the IOMMU-API to convert the former iommu_domain_capable
function to just iommu_capable
- various fixes in handling RMRR ranges for the VT-d driver (one fix
requires a device driver core change which was acked by Greg KH)
- the AMD IOMMU driver now assigns and deassigns complete alias
groups to fix issues with devices using the wrong PCI request-id
- MMU-401 support for the ARM SMMU driver
- multi-master IOMMU group support for the ARM SMMU driver
- various other small fixes all over the place"
* tag 'iommu-updates-v3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (41 commits)
iommu/vt-d: Work around broken RMRR firmware entries
iommu/vt-d: Store bus information in RMRR PCI device path
iommu/vt-d: Only remove domain when device is removed
driver core: Add BUS_NOTIFY_REMOVED_DEVICE event
iommu/amd: Fix devid mapping for ivrs_ioapic override
iommu/irq_remapping: Fix the regression of hpet irq remapping
iommu: Fix bus notifier breakage
iommu/amd: Split init_iommu_group() from iommu_init_device()
iommu: Rework iommu_group_get_for_pci_dev()
iommu: Make of_device_id array const
amd_iommu: do not dereference a NULL pointer address.
iommu/omap: Remove omap_iommu unused owner field
iommu: Remove iommu_domain_has_cap() API function
IB/usnic: Convert to use new iommu_capable() API function
vfio: Convert to use new iommu_capable() API function
kvm: iommu: Convert to use new iommu_capable() API function
iommu/tegra: Convert to iommu_capable() API function
iommu/msm: Convert to iommu_capable() API function
iommu/vt-d: Convert to iommu_capable() API function
iommu/fsl: Convert to iommu_capable() API function
...
Locking both the remove() and release() path results in a deadlock
that should have been obvious. To fix this we can get and hold the
vfio_device reference as we evaluate whether to do a bus/slot reset.
This will automatically block any remove() calls, allowing us to
remove the explict lock. Fixes 61d792562b.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: stable@vger.kernel.org [3.17]
The function should have been exported with EXPORT_SYMBOL_GPL()
as part of commit 92d18a6851 ("drivers/vfio: Fix EEH build error").
Suggested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The MSIx vector table lives in device memory, which may be cleared as
part of a backdoor device reset. This is the case on the IBM IPR HBA
when the BIST is run on the device. When assigned to a QEMU guest,
the guest driver does a pci_save_state(), issues a BIST, then does a
pci_restore_state(). The BIST clears the MSIx vector table, but due
to the way interrupts are configured the pci_restore_state() does not
restore the vector table as expected. Eventually this results in an
EEH error on Power platforms when the device attempts to signal an
interrupt with the zero'd table entry.
Fix the problem by restoring the host cached MSI message prior to
enabling each vector.
Reported-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
VFIO allows devices to be safely handed off to userspace by putting
them behind an IOMMU configured to ensure DMA and interrupt isolation.
This enables userspace KVM clients, such as kvmtool and qemu, to further
map the device into a virtual machine.
With IOMMUs such as the ARM SMMU, it is then possible to provide SMMU
translation services to the guest operating system, which are nested
with the existing translation installed by VFIO. However, enabling this
feature means that the IOMMU driver must be informed that the VFIO domain
is being created for the purposes of nested translation.
This patch adds a new IOMMU type (VFIO_TYPE1_NESTING_IOMMU) to the VFIO
type-1 driver. The new IOMMU type acts identically to the
VFIO_TYPE1v2_IOMMU type, but additionally sets the DOMAIN_ATTR_NESTING
attribute on its IOMMU domains.
Cc: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In PCIe r1.0, sec 5.10.2, bit 0 of the Uncorrectable Error Status, Mask,
and Severity Registers was for "Training Error." In PCIe r1.1, sec 7.10.2,
bit 0 was redefined to be "Undefined."
Rename PCI_ERR_UNC_TRAIN to PCI_ERR_UNC_UND to reflect this change.
No functional change.
[bhelgaas: changelog]
Signed-off-by: Chen, Gong <gong.chen@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
The existing vfio_pci_open() fails upon error returned from
vfio_spapr_pci_eeh_open(), which breaks POWER7's P5IOC2 PHB
support which this patch brings back.
The patch fixes the issue by dropping the return value of
vfio_spapr_pci_eeh_open().
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The VFIO related components could be built as dynamic modules.
Unfortunately, CONFIG_EEH can't be configured to "m". The patch
fixes the build errors when configuring VFIO related components
as dynamic modules as follows:
CC [M] drivers/vfio/vfio_iommu_spapr_tce.o
In file included from drivers/vfio/vfio.c:33:0:
include/linux/vfio.h:101:43: warning: ‘struct pci_dev’ declared \
inside parameter list [enabled by default]
:
WRAP arch/powerpc/boot/zImage.pseries
WRAP arch/powerpc/boot/zImage.maple
WRAP arch/powerpc/boot/zImage.pmac
WRAP arch/powerpc/boot/zImage.epapr
MODPOST 1818 modules
ERROR: ".vfio_spapr_iommu_eeh_ioctl" [drivers/vfio/vfio_iommu_spapr_tce.ko]\
undefined!
ERROR: ".vfio_spapr_pci_eeh_open" [drivers/vfio/pci/vfio-pci.ko] undefined!
ERROR: ".vfio_spapr_pci_eeh_release" [drivers/vfio/pci/vfio-pci.ko] undefined!
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Each time a device is released, mark whether a local reset was
successful or whether a bus/slot reset is needed. If a reset is
needed and all of the affected devices are bound to vfio-pci and
unused, allow the reset. This is most useful when the userspace
driver is killed and releases all the devices in an unclean state,
such as when a QEMU VM quits.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Serializing open/release allows us to fix a refcnt error if we fail
to enable the device and lets us prevent devices from being unbound
or opened, giving us an opportunity to do bus resets on release. No
restriction added to serialize binding devices to vfio-pci while the
mutex is held though.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Our current open/release path looks like this:
vfio_pci_open
vfio_pci_enable
pci_enable_device
pci_save_state
pci_store_saved_state
vfio_pci_release
vfio_pci_disable
pci_disable_device
pci_restore_state
pci_enable_device() doesn't modify PCI_COMMAND_MASTER, so if a device
comes to us with it enabled, it persists through the open and gets
stored as part of the device saved state. We then restore that saved
state when released, which can allow the device to attempt to continue
to do DMA. When the group is disconnected from the domain, this will
get caught by the IOMMU, but if there are other devices in the group,
the device may continue running and interfere with the user. Even in
the former case, IOMMUs don't necessarily behave well and a stream of
blocked DMA can result in unpleasant behavior on the host.
Explicitly disable Bus Master as we're enabling the device and
slightly re-work release to make sure that pci_disable_device() is
the last thing that touches the device.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The patch adds new IOCTL commands for sPAPR VFIO container device
to support EEH functionality for PCI devices, which have been passed
through from host to somebody else via VFIO.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Alexander Graf <agraf@suse.de>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
According PCI local bus specification, the register of Message
Control for MSI (offset: 2, length: 2) has bit#0 to enable or
disable MSI logic and it shouldn't be part contributing to the
calculation of MSI interrupt count. The patch fixes the issue.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Coverity reports use of a tained scalar used as a loop boundary.
For the most part, any values passed from userspace for a DMA mapping
size, IOVA, or virtual address are valid, with some alignment
constraints. The size is ultimately bound by how many pages the user
is able to lock, IOVA is tested by the IOMMU driver when doing a map,
and the virtual address needs to pass get_user_pages. The only
problem I can find is that we do expect the __u64 user values to fit
within our variables, which might not happen on 32bit platforms. Add
a test for this and return error on overflow. Also propagate use of
the type-correct local variables throughout the function.
The above also points to the 'end' variable, which can be zero if
we're operating at the very top of the address space. We try to
account for this, but our loop botches it. Rework the loop to use
the remaining size as our loop condition rather than the IOVA vs end.
Detected by Coverity: CID 714659
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
There's nothing we can do different if pci_load_and_free_saved_state()
fails, other than maybe print some log message, but the actual re-load
of the state is an unnecessary step here since we've only just saved
it. We can cleanup a coverity warning and eliminate the unnecessary
step by freeing the state ourselves.
Detected by Coverity: CID 753101
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
When sizing the TPH capability we store the register containing the
table size into the 'dword' variable, but then use the uninitialized
'byte' variable to analyze the size. The table size is also actually
reported as an N-1 value, so correct sizing to account for this.
The round_up() for both TPH and DPA is unnecessary, remove it.
Detected by Coverity: CID 714665 & 715156
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
So there is no point in checking its return value, which will soon
disappear.
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
- Allow the vfio-type1 IOMMU to support multiple domains within a container
- Plumb path to query whether all domains are cache-coherent
- Wire query into kvm-vfio device to avoid KVM x86 WBINVD emulation
- Always select CONFIG_ANON_INODES, vfio depends on it (Arnd)
The first patch also makes the vfio-type1 IOMMU driver completely independent
of the bus_type of the devices it's handling, which enables it to be used for
both vfio-pci and a future vfio-platform (and hopefully combinations involving
both simultaneously).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJTPbikAAoJECObm247sIsiG9cP/jnurW84YuAHmybzy3R4nMaa
8BWcQST+TyQ78GWvubFDcRt+vHmJUI4iFWPBIm1twBSVrMy6F1GcT/spmSne1Dfb
OvcfGEy59tGO+BeklHg5Qq2Hj1UzfeV9uQoy+PrpTF/sYzsyL6g5+O3Zv39SNvr0
zCwnO2JAKXIpQlkK3wVha6V13X07Z/+d2b4JSsvmON2cnlPzOUYNrgBvoSeV2IQe
3kMAU9qfAfQlpNDhRRZL/HshgWazCg4XZUp9UdNjUkOxrUp4vXHrVTUJnUGBDytk
8V9UGRKF3mik9PqpZJk4jLV5urgUVpnUR5747uqs0KF+9GWxClXvh3gp1XX8Zn7f
NCfDSMn/wrDr6fQBglKUadITaDhF49KV+J0cX083q46BbYxhAfgxv1I9UOvLreG6
SGAezlTB0mefa1BmSwJdfKwDBsuDOhbF0TsHC6zT7yFc8pqe3hl/vQzUyu3HtwuM
yPycQiQQmvOaymaiiBRwyLocwJbDNKPigDomv8NZ8Mcd97Fu9YsF4ydaxMJmmeKf
TffYS4z4tUElYiU2vv1ipB8T+ReCmdtj2cIvyfwTL9jycY9ouO4bJ56dOGWd3WNl
m7AVokbrrJQ03itUpFMuYjHAzTNUlXVvXlGr9n6L4VTR0RTL1zwJVrMVDsXdhxm4
XgDbVB5PrxinDAC1vJvI
=mnzO
-----END PGP SIGNATURE-----
Merge tag 'vfio-v3.15-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
"VFIO updates for v3.15 include:
- Allow the vfio-type1 IOMMU to support multiple domains within a
container
- Plumb path to query whether all domains are cache-coherent
- Wire query into kvm-vfio device to avoid KVM x86 WBINVD emulation
- Always select CONFIG_ANON_INODES, vfio depends on it (Arnd)
The first patch also makes the vfio-type1 IOMMU driver completely
independent of the bus_type of the devices it's handling, which
enables it to be used for both vfio-pci and a future vfio-platform
(and hopefully combinations involving both simultaneously)"
* tag 'vfio-v3.15-rc1' of git://github.com/awilliam/linux-vfio:
vfio: always select ANON_INODES
kvm/vfio: Support for DMA coherent IOMMUs
vfio: Add external user check extension interface
vfio/type1: Add extension to test DMA cache coherence of IOMMU
vfio/iommu_type1: Multi-IOMMU domain support
Enumeration
- Increment max correctly in pci_scan_bridge() (Andreas Noever)
- Clarify the "scan anyway" comment in pci_scan_bridge() (Andreas Noever)
- Assign CardBus bus number only during the second pass (Andreas Noever)
- Use request_resource_conflict() instead of insert_ for bus numbers (Andreas Noever)
- Make sure bus number resources stay within their parents bounds (Andreas Noever)
- Remove pci_fixup_parent_subordinate_busnr() (Andreas Noever)
- Check for child busses which use more bus numbers than allocated (Andreas Noever)
- Don't scan random busses in pci_scan_bridge() (Andreas Noever)
- x86: Drop pcibios_scan_root() check for bus already scanned (Bjorn Helgaas)
- x86: Use pcibios_scan_root() instead of pci_scan_bus_with_sysdata() (Bjorn Helgaas)
- x86: Use pcibios_scan_root() instead of pci_scan_bus_on_node() (Bjorn Helgaas)
- x86: Merge pci_scan_bus_on_node() into pcibios_scan_root() (Bjorn Helgaas)
- x86: Drop return value of pcibios_scan_root() (Bjorn Helgaas)
NUMA
- x86: Add x86_pci_root_bus_node() to look up NUMA node from PCI bus (Bjorn Helgaas)
- x86: Use x86_pci_root_bus_node() instead of get_mp_bus_to_node() (Bjorn Helgaas)
- x86: Remove mp_bus_to_node[], set_mp_bus_to_node(), get_mp_bus_to_node() (Bjorn Helgaas)
- x86: Use NUMA_NO_NODE, not -1, for unknown node (Bjorn Helgaas)
- x86: Remove acpi_get_pxm() usage (Bjorn Helgaas)
- ia64: Use NUMA_NO_NODE, not MAX_NUMNODES, for unknown node (Bjorn Helgaas)
- ia64: Remove acpi_get_pxm() usage (Bjorn Helgaas)
- ACPI: Fix acpi_get_node() prototype (Bjorn Helgaas)
Resource management
- i2o: Fix and refactor PCI space allocation (Bjorn Helgaas)
- Add resource_contains() (Bjorn Helgaas)
- Add %pR support for IORESOURCE_UNSET (Bjorn Helgaas)
- Mark resources as IORESOURCE_UNSET if we can't assign them (Bjorn Helgaas)
- Don't clear IORESOURCE_UNSET when updating BAR (Bjorn Helgaas)
- Check IORESOURCE_UNSET before updating BAR (Bjorn Helgaas)
- Don't try to claim IORESOURCE_UNSET resources (Bjorn Helgaas)
- Mark 64-bit resource as IORESOURCE_UNSET if we only support 32-bit (Bjorn Helgaas)
- Don't enable decoding if BAR hasn't been assigned an address (Bjorn Helgaas)
- Add "weak" generic pcibios_enable_device() implementation (Bjorn Helgaas)
- alpha, microblaze, sh, sparc, tile: Use default pcibios_enable_device() (Bjorn Helgaas)
- s390: Use generic pci_enable_resources() (Bjorn Helgaas)
- Don't check resource_size() in pci_bus_alloc_resource() (Bjorn Helgaas)
- Set type in __request_region() (Bjorn Helgaas)
- Check all IORESOURCE_TYPE_BITS in pci_bus_alloc_from_region() (Bjorn Helgaas)
- Change pci_bus_alloc_resource() type_mask to unsigned long (Bjorn Helgaas)
- Log IDE resource quirk in dmesg (Bjorn Helgaas)
- Revert "[PATCH] Insert GART region into resource map" (Bjorn Helgaas)
PCI device hotplug
- Make check_link_active() non-static (Rajat Jain)
- Use link change notifications for hot-plug and removal (Rajat Jain)
- Enable link state change notifications (Rajat Jain)
- Don't disable the link permanently during removal (Rajat Jain)
- Don't check adapter or latch status while disabling (Rajat Jain)
- Disable link notification across slot reset (Rajat Jain)
- Ensure very fast hotplug events are also processed (Rajat Jain)
- Add hotplug_lock to serialize hotplug events (Rajat Jain)
- Remove a non-existent card, regardless of "surprise" capability (Rajat Jain)
- Don't turn slot off when hot-added device already exists (Yijing Wang)
MSI
- Keep pci_enable_msi() documentation (Alexander Gordeev)
- ahci: Fix broken single MSI fallback (Alexander Gordeev)
- ahci, vfio: Use pci_enable_msi_range() (Alexander Gordeev)
- Check kmalloc() return value, fix leak of name (Greg Kroah-Hartman)
- Fix leak of msi_attrs (Greg Kroah-Hartman)
- Fix pci_msix_vec_count() htmldocs failure (Masanari Iida)
Virtualization
- Device-specific ACS support (Alex Williamson)
Freescale i.MX6
- Wait for retraining (Marek Vasut)
Marvell MVEBU
- Use Device ID and revision from underlying endpoint (Andrew Lunn)
- Fix incorrect size for PCI aperture resources (Jason Gunthorpe)
- Call request_resource() on the apertures (Jason Gunthorpe)
- Fix potential issue in range parsing (Jean-Jacques Hiblot)
Renesas R-Car
- Check platform_get_irq() return code (Ben Dooks)
- Add error interrupt handling (Ben Dooks)
- Fix bridge logic configuration accesses (Ben Dooks)
- Register each instance independently (Magnus Damm)
- Break out window size handling (Magnus Damm)
- Make the Kconfig dependencies more generic (Magnus Damm)
Synopsys DesignWare
- Fix RC BAR to be single 64-bit non-prefetchable memory (Mohit Kumar)
Miscellaneous
- Remove unused SR-IOV VF Migration support (Bjorn Helgaas)
- Enable INTx if BIOS left them disabled (Bjorn Helgaas)
- Fix hex vs decimal typo in cpqhpc_probe() (Dan Carpenter)
- Clean up par-arch object file list (Liviu Dudau)
- Set IORESOURCE_ROM_SHADOW only for the default VGA device (Sander Eikelenboom)
- ACPI, ARM, drm, powerpc, pcmcia, PCI: Use list_for_each_entry() for bus traversal (Yijing Wang)
- Fix pci_bus_b() build failure (Paul Gortmaker)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJTOdAZAAoJEFmIoMA60/r8VYUQALRrReyMBk3pjRt/fKIX4Kwi
ydSo/YJeeKTN8K93fLw8bb8bdPItJScJFTfEa4Q2SpZezR/ecGXLowisy0BBaPHK
qtOyB8EqjkLS17GfyecIe9Nd2SIAI2De/0bchK3kDtIX1YlZB/k/tD3eCPMHDnnl
m8c5kAHKPQYd8g01I+S8nrtGHk/A33grfYpJXPZbcqyhE0lWU3SI8KDAGbcKzNHE
23Do0yNyd4nHIdixWlhETcNvzHn35Q/O38JJwW9Mf1aI9gusYuml6GFefCgu/iov
lxqp3CEW7iPZgQEgNbrQ0HzWn/durL2Trd6S/Yh6f2xbm1LGYKWh3LZUFLd3AQDd
INEpUgKsyb//nF3dtiyGnZlp0QykoqFyLo2AEDrb+ILTd4up5DeRY/m1UpjAXR5p
QicBmrDksHrSivPmMZwLx1DFQYKjQbdx5lOqy9hQM/Jmsr+N3/l7QBrbQWXks3JZ
NNAyn4RZHQB7UDQS/MmVPArs+JK5qaEDQD57QuOTlqgP19VY9C9E/l/aEqefjdFo
XOAm7CwGpB/iBAkIbE6ROEDiJArigRVHEfxLYeE/jtGOdRDCD1deWk+g3S8DWD7m
ZxWSgIVB00PMAmomczdg59YVFBhocgwPUa8/cw6yqzx2QKP4mWXIFZ/Sjau5I3tn
WWoxXlUirZfTJc29XnVy
=3mNS
-----END PGP SIGNATURE-----
Merge tag 'pci-v3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI changes from Bjorn Helgaas:
"Enumeration
- Increment max correctly in pci_scan_bridge() (Andreas Noever)
- Clarify the "scan anyway" comment in pci_scan_bridge() (Andreas Noever)
- Assign CardBus bus number only during the second pass (Andreas Noever)
- Use request_resource_conflict() instead of insert_ for bus numbers (Andreas Noever)
- Make sure bus number resources stay within their parents bounds (Andreas Noever)
- Remove pci_fixup_parent_subordinate_busnr() (Andreas Noever)
- Check for child busses which use more bus numbers than allocated (Andreas Noever)
- Don't scan random busses in pci_scan_bridge() (Andreas Noever)
- x86: Drop pcibios_scan_root() check for bus already scanned (Bjorn Helgaas)
- x86: Use pcibios_scan_root() instead of pci_scan_bus_with_sysdata() (Bjorn Helgaas)
- x86: Use pcibios_scan_root() instead of pci_scan_bus_on_node() (Bjorn Helgaas)
- x86: Merge pci_scan_bus_on_node() into pcibios_scan_root() (Bjorn Helgaas)
- x86: Drop return value of pcibios_scan_root() (Bjorn Helgaas)
NUMA
- x86: Add x86_pci_root_bus_node() to look up NUMA node from PCI bus (Bjorn Helgaas)
- x86: Use x86_pci_root_bus_node() instead of get_mp_bus_to_node() (Bjorn Helgaas)
- x86: Remove mp_bus_to_node[], set_mp_bus_to_node(), get_mp_bus_to_node() (Bjorn Helgaas)
- x86: Use NUMA_NO_NODE, not -1, for unknown node (Bjorn Helgaas)
- x86: Remove acpi_get_pxm() usage (Bjorn Helgaas)
- ia64: Use NUMA_NO_NODE, not MAX_NUMNODES, for unknown node (Bjorn Helgaas)
- ia64: Remove acpi_get_pxm() usage (Bjorn Helgaas)
- ACPI: Fix acpi_get_node() prototype (Bjorn Helgaas)
Resource management
- i2o: Fix and refactor PCI space allocation (Bjorn Helgaas)
- Add resource_contains() (Bjorn Helgaas)
- Add %pR support for IORESOURCE_UNSET (Bjorn Helgaas)
- Mark resources as IORESOURCE_UNSET if we can't assign them (Bjorn Helgaas)
- Don't clear IORESOURCE_UNSET when updating BAR (Bjorn Helgaas)
- Check IORESOURCE_UNSET before updating BAR (Bjorn Helgaas)
- Don't try to claim IORESOURCE_UNSET resources (Bjorn Helgaas)
- Mark 64-bit resource as IORESOURCE_UNSET if we only support 32-bit (Bjorn Helgaas)
- Don't enable decoding if BAR hasn't been assigned an address (Bjorn Helgaas)
- Add "weak" generic pcibios_enable_device() implementation (Bjorn Helgaas)
- alpha, microblaze, sh, sparc, tile: Use default pcibios_enable_device() (Bjorn Helgaas)
- s390: Use generic pci_enable_resources() (Bjorn Helgaas)
- Don't check resource_size() in pci_bus_alloc_resource() (Bjorn Helgaas)
- Set type in __request_region() (Bjorn Helgaas)
- Check all IORESOURCE_TYPE_BITS in pci_bus_alloc_from_region() (Bjorn Helgaas)
- Change pci_bus_alloc_resource() type_mask to unsigned long (Bjorn Helgaas)
- Log IDE resource quirk in dmesg (Bjorn Helgaas)
- Revert "[PATCH] Insert GART region into resource map" (Bjorn Helgaas)
PCI device hotplug
- Make check_link_active() non-static (Rajat Jain)
- Use link change notifications for hot-plug and removal (Rajat Jain)
- Enable link state change notifications (Rajat Jain)
- Don't disable the link permanently during removal (Rajat Jain)
- Don't check adapter or latch status while disabling (Rajat Jain)
- Disable link notification across slot reset (Rajat Jain)
- Ensure very fast hotplug events are also processed (Rajat Jain)
- Add hotplug_lock to serialize hotplug events (Rajat Jain)
- Remove a non-existent card, regardless of "surprise" capability (Rajat Jain)
- Don't turn slot off when hot-added device already exists (Yijing Wang)
MSI
- Keep pci_enable_msi() documentation (Alexander Gordeev)
- ahci: Fix broken single MSI fallback (Alexander Gordeev)
- ahci, vfio: Use pci_enable_msi_range() (Alexander Gordeev)
- Check kmalloc() return value, fix leak of name (Greg Kroah-Hartman)
- Fix leak of msi_attrs (Greg Kroah-Hartman)
- Fix pci_msix_vec_count() htmldocs failure (Masanari Iida)
Virtualization
- Device-specific ACS support (Alex Williamson)
Freescale i.MX6
- Wait for retraining (Marek Vasut)
Marvell MVEBU
- Use Device ID and revision from underlying endpoint (Andrew Lunn)
- Fix incorrect size for PCI aperture resources (Jason Gunthorpe)
- Call request_resource() on the apertures (Jason Gunthorpe)
- Fix potential issue in range parsing (Jean-Jacques Hiblot)
Renesas R-Car
- Check platform_get_irq() return code (Ben Dooks)
- Add error interrupt handling (Ben Dooks)
- Fix bridge logic configuration accesses (Ben Dooks)
- Register each instance independently (Magnus Damm)
- Break out window size handling (Magnus Damm)
- Make the Kconfig dependencies more generic (Magnus Damm)
Synopsys DesignWare
- Fix RC BAR to be single 64-bit non-prefetchable memory (Mohit Kumar)
Miscellaneous
- Remove unused SR-IOV VF Migration support (Bjorn Helgaas)
- Enable INTx if BIOS left them disabled (Bjorn Helgaas)
- Fix hex vs decimal typo in cpqhpc_probe() (Dan Carpenter)
- Clean up par-arch object file list (Liviu Dudau)
- Set IORESOURCE_ROM_SHADOW only for the default VGA device (Sander Eikelenboom)
- ACPI, ARM, drm, powerpc, pcmcia, PCI: Use list_for_each_entry() for bus traversal (Yijing Wang)
- Fix pci_bus_b() build failure (Paul Gortmaker)"
* tag 'pci-v3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (108 commits)
Revert "[PATCH] Insert GART region into resource map"
PCI: Log IDE resource quirk in dmesg
PCI: Change pci_bus_alloc_resource() type_mask to unsigned long
PCI: Check all IORESOURCE_TYPE_BITS in pci_bus_alloc_from_region()
resources: Set type in __request_region()
PCI: Don't check resource_size() in pci_bus_alloc_resource()
s390/PCI: Use generic pci_enable_resources()
tile PCI RC: Use default pcibios_enable_device()
sparc/PCI: Use default pcibios_enable_device() (Leon only)
sh/PCI: Use default pcibios_enable_device()
microblaze/PCI: Use default pcibios_enable_device()
alpha/PCI: Use default pcibios_enable_device()
PCI: Add "weak" generic pcibios_enable_device() implementation
PCI: Don't enable decoding if BAR hasn't been assigned an address
PCI: Enable INTx in pci_reenable_device() only when MSI/MSI-X not enabled
PCI: Mark 64-bit resource as IORESOURCE_UNSET if we only support 32-bit
PCI: Don't try to claim IORESOURCE_UNSET resources
PCI: Check IORESOURCE_UNSET before updating BAR
PCI: Don't clear IORESOURCE_UNSET when updating BAR
PCI: Mark resources as IORESOURCE_UNSET if we can't assign them
...
Conflicts:
arch/x86/include/asm/topology.h
drivers/ata/ahci.c
The vfio code cannot be built when CONFIG_ANON_INODES is
disabled, so this enforces the symbol to be enabled through
Kconfig.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Commit bf6bddf192 ("mm: introduce compaction and migration for
ballooned pages") introduces page_count(page) into memory compaction
which dereferences page->first_page if PageTail(page).
This results in a very rare NULL pointer dereference on the
aforementioned page_count(page). Indeed, anything that does
compound_head(), including page_count() is susceptible to racing with
prep_compound_page() and seeing a NULL or dangling page->first_page
pointer.
This patch uses Andrea's implementation of compound_trans_head() that
deals with such a race and makes it the default compound_head()
implementation. This includes a read memory barrier that ensures that
if PageTail(head) is true that we return a head page that is neither
NULL nor dangling. The patch then adds a store memory barrier to
prep_compound_page() to ensure page->first_page is set.
This is the safest way to ensure we see the head page that we are
expecting, PageTail(page) is already in the unlikely() path and the
memory barriers are unfortunately required.
Hugetlbfs is the exception, we don't enforce a store memory barrier
during init since no race is possible.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This lets us check extensions, particularly VFIO_DMA_CC_IOMMU using
the external user interface, allowing KVM to probe IOMMU coherency.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Now that the type1 IOMMU backend can support IOMMU_CACHE, we need to
be able to test whether coherency is currently enforced. Add an
extension for this.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We currently have a problem that we cannot support advanced features
of an IOMMU domain (ex. IOMMU_CACHE), because we have no guarantee
that those features will be supported by all of the hardware units
involved with the domain over its lifetime. For instance, the Intel
VT-d architecture does not require that all DRHDs support snoop
control. If we create a domain based on a device behind a DRHD that
does support snoop control and enable SNP support via the IOMMU_CACHE
mapping option, we cannot then add a device behind a DRHD which does
not support snoop control or we'll get reserved bit faults from the
SNP bit in the pagetables. To add to the complexity, we can't know
the properties of a domain until a device is attached.
We could pass this problem off to userspace and require that a
separate vfio container be used, but we don't know how to handle page
accounting in that case. How do we know that a page pinned in one
container is the same page as a different container and avoid double
billing the user for the page.
The solution is therefore to support multiple IOMMU domains per
container. In the majority of cases, only one domain will be required
since hardware is typically consistent within a system. However, this
provides us the ability to validate compatibility of domains and
support mixed environments where page table flags can be different
between domains.
To do this, our DMA tracking needs to change. We currently try to
coalesce user mappings into as few tracking entries as possible. The
problem then becomes that we lose granularity of user mappings. We've
never guaranteed that a user is able to unmap at a finer granularity
than the original mapping, but we must honor the granularity of the
original mapping. This coalescing code is therefore removed, allowing
only unmaps covering complete maps. The change in accounting is
fairly small here, a typical QEMU VM will start out with roughly a
dozen entries, so it's arguable if this coalescing was ever needed.
We also move IOMMU domain creation to the point where a group is
attached to the container. An interesting side-effect of this is that
we now have access to the device at the time of domain creation and
can probe the devices within the group to determine the bus_type.
This finally makes vfio_iommu_type1 completely device/bus agnostic.
In fact, each IOMMU domain can host devices on different buses managed
by different physical IOMMUs, and present a single DMA mapping
interface to the user. When a new domain is created, mappings are
replayed to bring the IOMMU pagetables up to the state of the current
container. And of course, DMA mapping and unmapping automatically
traverse all of the configured IOMMU domains.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: Varun Sethi <Varun.Sethi@freescale.com>
pci_enable_msix() and pci_enable_msi_block() have been deprecated; use
pci_enable_msix_range() and pci_enable_msi_range() instead.
[bhelgaas: changelog]
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Pull powerpc updates from Ben Herrenschmidt:
"So here's my next branch for powerpc. A bit late as I was on vacation
last week. It's mostly the same stuff that was in next already, I
just added two patches today which are the wiring up of lockref for
powerpc, which for some reason fell through the cracks last time and
is trivial.
The highlights are, in addition to a bunch of bug fixes:
- Reworked Machine Check handling on kernels running without a
hypervisor (or acting as a hypervisor). Provides hooks to handle
some errors in real mode such as TLB errors, handle SLB errors,
etc...
- Support for retrieving memory error information from the service
processor on IBM servers running without a hypervisor and routing
them to the memory poison infrastructure.
- _PAGE_NUMA support on server processors
- 32-bit BookE relocatable kernel support
- FSL e6500 hardware tablewalk support
- A bunch of new/revived board support
- FSL e6500 deeper idle states and altivec powerdown support
You'll notice a generic mm change here, it has been acked by the
relevant authorities and is a pre-req for our _PAGE_NUMA support"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (121 commits)
powerpc: Implement arch_spin_is_locked() using arch_spin_value_unlocked()
powerpc: Add support for the optimised lockref implementation
powerpc/powernv: Call OPAL sync before kexec'ing
powerpc/eeh: Escalate error on non-existing PE
powerpc/eeh: Handle multiple EEH errors
powerpc: Fix transactional FP/VMX/VSX unavailable handlers
powerpc: Don't corrupt transactional state when using FP/VMX in kernel
powerpc: Reclaim two unused thread_info flag bits
powerpc: Fix races with irq_work
Move precessing of MCE queued event out from syscall exit path.
pseries/cpuidle: Remove redundant call to ppc64_runlatch_off() in cpu idle routines
powerpc: Make add_system_ram_resources() __init
powerpc: add SATA_MV to ppc64_defconfig
powerpc/powernv: Increase candidate fw image size
powerpc: Add debug checks to catch invalid cpu-to-node mappings
powerpc: Fix the setup of CPU-to-Node mappings during CPU online
powerpc/iommu: Don't detach device without IOMMU group
powerpc/eeh: Hotplug improvement
powerpc/eeh: Call opal_pci_reinit() on powernv for restoring config space
powerpc/eeh: Add restore_config operation
...
- Remove unnecessary and dangerous use of device_lock
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJS4T1aAAoJECObm247sIsiLMkQAKn11qSLk880dJO+cfW2VpqL
vm3WreShOxNsKTEQfcbrnJtS+kYGkiyo8+Vve377QDIg+xeXpNZa0PbAEfh4HyuJ
NH+vn0FYDf3rzF4Z9snm2HOHWCAsqLCE0guyMwwhUYH+fks4+YTJWCm4YWZOXbIS
aI++G+5LBxGzMRbotoovqUhQy474fpQN/GFIlIZHP2bGtdeXiRuH0qH3BxVKafOF
50RtAgJsLm1Nb+S20Umsz+llL3sop1wcPLKOlgCXqd+545CJAoq/qqEPjUtOrr+S
qXvXhb0XM7zo73H2WzkVmS09HZWX4sMoHBYp8D6ToOIFF01LOVsXBKduPCPpTD7E
zohhjgag9/RQ99l2B153IIIv0Bsg7b4YXBQ6qkWFoJolPL94LTq1PXk0f9mNhyPo
mSqatcAeI5qtjqu9ncSd4YYTdBgp7SJcgwji8fI44tvzzpz8iheVUrpwcU1UumAK
TpXLBoTLXllK1op3u9xgFrF4ISBMl7lZeZp3c1/1YRga7Ch6SdlB0tcLPfuSBRF0
1Qb5jQrWz4qt/oZSkapcFALXQDgwoK8am9I0a5YXdhy9gSYm+o/TLwG5pTEnF4In
xxuibmmJAmNcTJ9q6rUKjLLU/TqRKin+vyqW6is41IredgvNX8XDS3YokA5Fgv01
mu1s7odFi08xjPRuYvmq
=t9+R
-----END PGP SIGNATURE-----
Merge tag 'vfio-v3.14-rc1' of git://github.com/awilliam/linux-vfio
Pull vfio update from Alex Williamson:
- convert to misc driver to support module auto loading
- remove unnecessary and dangerous use of device_lock
* tag 'vfio-v3.14-rc1' of git://github.com/awilliam/linux-vfio:
vfio-pci: Don't use device_lock around AER interrupt setup
vfio: Convert control interface to misc driver
misc: Reserve minor for VFIO
PCI resets will attempt to take the device_lock for any device to be
reset. This is a problem if that lock is already held, for instance
in the device remove path. It's not sufficient to simply kill the
user process or skip the reset if called after .remove as a race could
result in the same deadlock. Instead, we handle all resets as "best
effort" using the PCI "try" reset interfaces. This prevents the user
from being able to induce a deadlock by triggering a reset.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
device_lock is much too prone to lockups. For instance if we have a
pending .remove then device_lock is already held. If userspace
attempts to modify AER signaling after that point, a deadlock occurs.
eventfd setup/teardown is already protected in vfio with the igate
mutex. AER is not a high performance interrupt, so we can also use
the same mutex to protect signaling versus setup races.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The powerpc iommu uses a hardcoded page size of 4K. This patch changes
the name of the IOMMU_PAGE_* macros to reflect the hardcoded values. A
future patch will use the existing names to support dynamic page
sizes.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This change allows us to support module auto loading using devname
support in userspace tools. With this, /dev/vfio/vfio will always
be present and opening it will cause the vfio module to load. This
should avoid needing to configure the system to statically load
vfio in order to get libvirt to correctly detect support for it.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
These are set of two capability registers, it's pretty much given that
they're registers, so reflect their purpose in the name.
Suggested-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
In vfio_iommu_type1.c there is a bug in vfio_dma_do_map, when checking
that pages are not already mapped. Since the check is being done in a
for loop nested within the main loop, breaking out of it does not create
the intended behavior. If the underlying IOMMU driver returns a non-NULL
value, this will be ignored and mapping the DMA range will be attempted
anyway, leading to unpredictable behavior.
This interracts badly with the ARM SMMU driver issue fixed in the patch
that was submitted with the title:
"[PATCH 2/2] ARM: SMMU: return NULL on error in arm_smmu_iova_to_phys"
Both fixes are required in order to use the vfio_iommu_type1 driver
with an ARM SMMU.
This patch refactors the function slightly, in order to also make this
kind of bug less likely.
Signed-off-by: Antonios Motakis <a.motakis@virtualopensystems.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The current VFIO_DEVICE_RESET interface only maps to PCI use cases
where we can isolate the reset to the individual PCI function. This
means the device must support FLR (PCIe or AF), PM reset on D3hot->D0
transition, device specific reset, or be a singleton device on a bus
for a secondary bus reset. FLR does not have widespread support,
PM reset is not very reliable, and bus topology is dictated by the
system and device design. We need to provide a means for a user to
induce a bus reset in cases where the existing mechanisms are not
available or not reliable.
This device specific extension to VFIO provides the user with this
ability. Two new ioctls are introduced:
- VFIO_DEVICE_PCI_GET_HOT_RESET_INFO
- VFIO_DEVICE_PCI_HOT_RESET
The first provides the user with information about the extent of
devices affected by a hot reset. This is essentially a list of
devices and the IOMMU groups they belong to. The user may then
initiate a hot reset by calling the second ioctl. We must be
careful that the user has ownership of all the affected devices
found via the first ioctl, so the second ioctl takes a list of file
descriptors for the VFIO groups affected by the reset. Each group
must have IOMMU protection established for the ioctl to succeed.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Having PCIe/PCI-X capability isn't enough to assume that there are
extended capabilities. Both specs define that the first capability
header is all zero if there are no extended capabilities. Testing
for this avoids an erroneous message about hiding capability 0x0 at
offset 0x100.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
eventfd_fget() tests to see whether the file is an eventfd file, which
we then immediately pass to eventfd_ctx_fileget(), which again tests
whether the file is an eventfd file. Simplify slightly by using
fdget() so that we only test that we're looking at an eventfd once.
fget() could also be used, but fdget() makes use of fget_light() for
another slight optimization.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Add the default O_CLOEXEC flag for device file descriptors. This is
generally considered a safer option as it allows the user a race free
option to decide whether file descriptors are inherited across exec,
with the default avoiding file descriptor leaks.
Reported-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Macro get_unused_fd() is used to allocate a file descriptor with
default flags. Those default flags (0) can be "unsafe":
O_CLOEXEC must be used by default to not leak file descriptor
across exec().
Instead of macro get_unused_fd(), functions anon_inode_getfd()
or get_unused_fd_flags() should be used with flags given by userspace.
If not possible, flags should be set to O_CLOEXEC to provide userspace
with a default safe behavor.
In a further patch, get_unused_fd() will be removed so that
new code start using anon_inode_getfd() or get_unused_fd_flags()
with correct flags.
This patch replaces calls to get_unused_fd() with equivalent call to
get_unused_fd_flags(0) to preserve current behavor for existing code.
The hard coded flag value (0) should be reviewed on a per-subsystem basis,
and, if possible, set to O_CLOEXEC.
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Link: http://lkml.kernel.org/r/cover.1376327678.git.ydroneaud@opteya.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
VFIO is designed to be used via ioctls on file descriptors
returned by VFIO.
However in some situations support for an external user is required.
The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to
use the existing VFIO groups for exclusive access in real/virtual mode
on a host to avoid passing map/unmap requests to the user space which
would made things pretty slow.
The protocol includes:
1. do normal VFIO init operation:
- opening a new container;
- attaching group(s) to it;
- setting an IOMMU driver for a container.
When IOMMU is set for a container, all groups in it are
considered ready to use by an external user.
2. User space passes a group fd to an external user.
The external user calls vfio_group_get_external_user()
to verify that:
- the group is initialized;
- IOMMU is set for it.
If both checks passed, vfio_group_get_external_user()
increments the container user counter to prevent
the VFIO group from disposal before KVM exits.
3. The external user calls vfio_external_user_iommu_id()
to know an IOMMU ID. PPC64 KVM uses it to link logical bus
number (LIOBN) with IOMMU ID.
4. When the external KVM finishes, it calls
vfio_group_put_external_user() to release the VFIO group.
This call decrements the container user counter.
Everything gets released.
The "vfio: Limit group opens" patch is also required for the consistency.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If an attempt is made to unbind a device from vfio-pci while that
device is in use, the request is blocked until the device becomes
unused. Unfortunately, that unbind path still grabs the device_lock,
which certain things like __pci_reset_function() also want to take.
This means we need to try to acquire the locks ourselves and use the
pre-locked version, __pci_reset_function_locked().
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Remove debugging WARN_ON if we get a spurious notify for a group that
no longer exists. No reports of anyone hitting this, but it would
likely be a race and not a bug if they did.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
BUS_NOTIFY_DEL_DEVICE triggers IOMMU drivers to remove devices from
their iommu group, but there's really nothing we can do about it at
this point. If the device is in use, then the vfio sub-driver will
block the device_del from completing until it's released. If the
device is not in use or not owned by a vfio sub-driver, then we
really don't care that it's being removed.
The current code can be triggered just by unloading an sr-iov driver
(ex. igb) while the VFs are attached to vfio-pci because it makes an
incorrect assumption about the ordering of driver remove callbacks
vs the DEL_DEVICE notification.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Largely hugepage support for vfio/type1 iommu and surrounding cleanups and fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iQIcBAABAgAGBQJR2uNvAAoJECObm247sIsiJRYQAJK15MfXgJq2PtBABNvFUAOG
nqUvLgBgM5Ow1NI0Rzh9jkNohNqCvXDFGaWXXnsaX83hIpi59GFK31W2E3SiFCj3
xISA9SUnm7Kjt9LAF6HTNz805zBkshIOk4MCx6HlezVWSRlWwT3rZzI4dI2fMvl8
iPRk1Ion3QSQui99HWfXv/rtezAIzgZqsFqPC6DjWRfN7LcdEtKtcQwnrSb5GGY9
3TIRY9IRYTSfJ2yjSz5f5258JxoDG5sR8dTMkgG2Gm92iGvGcPGpzQWPzVc4t+TO
PdTqtv9ftEyAJKsYTFjPIod8XbzJBa1FSPadVAIfwF0JCDcsSFjoWGp+RzMQQSF8
MK3VsnQ/pqJfs2nJHDQbWbKu0qWYPntvOCdojZ4679ceDTd0t515npfYeDQuX8yU
fAA5rB46mDXjyxikTP574NdnkcGjbAj7EOCp7s+WTsVPGQQ3mId/3fQw0Wg7bE6v
jaJqdRj70SNTRHs8DFLQhvSZgpef4RzepE4sRBZqzY4vWd4riNcAC3Got+F2rQy3
X4hcHHU/5LGLoGMxOJQmuBfKVM8RAgikq6w2RfttVMLeKCknKtJ29OnotKilvILh
W8nAOGxRnkmONFfHakNJtLl5tQJ4FQXc2cG8OeIIhHgheJjUxL72/zv8bBxOo7rY
jUBjtZ5riQXc/ck4FEGI
=9+Jh
-----END PGP SIGNATURE-----
Merge tag 'vfio-v3.11' of git://github.com/awilliam/linux-vfio
Pull vfio updates from Alex Williamson:
"Largely hugepage support for vfio/type1 iommu and surrounding cleanups
and fixes"
* tag 'vfio-v3.11' of git://github.com/awilliam/linux-vfio:
vfio/type1: Fix leak on error path
vfio: Limit group opens
vfio/type1: Fix missed frees and zero sized removes
vfio: fix documentation
vfio: Provide module option to disable vfio_iommu_type1 hugepage support
vfio: hugepage support for vfio_iommu_type1
vfio: Convert type1 iommu to use rbtree
Pull powerpc updates from Ben Herrenschmidt:
"This is the powerpc changes for the 3.11 merge window. In addition to
the usual bug fixes and small updates, the main highlights are:
- Support for transparent huge pages by Aneesh Kumar for 64-bit
server processors. This allows the use of 16M pages as transparent
huge pages on kernels compiled with a 64K base page size.
- Base VFIO support for KVM on power by Alexey Kardashevskiy
- Wiring up of our nvram to the pstore infrastructure, including
putting compressed oopses in there by Aruna Balakrishnaiah
- Move, rework and improve our "EEH" (basically PCI error handling
and recovery) infrastructure. It is no longer specific to pseries
but is now usable by the new "powernv" platform as well (no
hypervisor) by Gavin Shan.
- I fixed some bugs in our math-emu instruction decoding and made it
usable to emulate some optional FP instructions on processors with
hard FP that lack them (such as fsqrt on Freescale embedded
processors).
- Support for Power8 "Event Based Branch" facility by Michael
Ellerman. This facility allows what is basically "userspace
interrupts" for performance monitor events.
- A bunch of Transactional Memory vs. Signals bug fixes and HW
breakpoint/watchpoint fixes by Michael Neuling.
And more ... I appologize in advance if I've failed to highlight
something that somebody deemed worth it."
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits)
pstore: Add hsize argument in write_buf call of pstore_ftrace_call
powerpc/fsl: add MPIC timer wakeup support
powerpc/mpic: create mpic subsystem object
powerpc/mpic: add global timer support
powerpc/mpic: add irq_set_wake support
powerpc/85xx: enable coreint for all the 64bit boards
powerpc/8xx: Erroneous double irq_eoi() on CPM IRQ in MPC8xx
powerpc/fsl: Enable CONFIG_E1000E in mpc85xx_smp_defconfig
powerpc/mpic: Add get_version API both for internal and external use
powerpc: Handle both new style and old style reserve maps
powerpc/hw_brk: Fix off by one error when validating DAWR region end
powerpc/pseries: Support compression of oops text via pstore
powerpc/pseries: Re-organise the oops compression code
pstore: Pass header size in the pstore write callback
powerpc/powernv: Fix iommu initialization again
powerpc/pseries: Inform the hypervisor we are using EBB regs
powerpc/perf: Add power8 EBB support
powerpc/perf: Core EBB support for 64-bit book3s
powerpc/perf: Drop MMCRA from thread_struct
powerpc/perf: Don't enable if we have zero events
...
We also don't handle unpinning zero pages as an error on other exits
so we can fix that inconsistency by rolling in the next conditional
return.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vfio_group_fops_open attempts to limit concurrent sessions by
disallowing opens once group->container is set. This really doesn't
do what we want and allow for inconsistent behavior, for instance a
group can be opened twice, then a container set giving the user two
file descriptors to the group. But then it won't allow more to be
opened. There's not much reason to have the group opened multiple
times since most access is through devices or the container, so
complete what the original code intended and only allow a single
instance.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
With hugepage support we can only properly aligned and sized ranges.
We only guarantee that we can unmap the same ranges mapped and not
arbitrary sub-ranges. This means we might not free anything or might
free more than requested. The vfio unmap interface started storing
the unmapped size to return to userspace to handle this. This patch
fixes a few places where we don't properly handle those cases, moves
a memory allocation to a place where failure is an option and checks
our loops to make sure we don't get into an infinite loop trying to
remove an overlap.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Add a module option to vfio_iommu_type1 to disable IOMMU hugepage
support. This causes iommu_map to only be called with single page
mappings, disabling the IOMMU driver's ability to use hugepages.
This option can be enabled by loading vfio_iommu_type1 with
disable_hugepages=1 or dynamically through sysfs. If enabled
dynamically, only new mappings are restricted.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We currently send all mappings to the iommu in PAGE_SIZE chunks,
which prevents the iommu from enabling support for larger page sizes.
We still need to pin pages, which means we step through them in
PAGE_SIZE chunks, but we can batch up contiguous physical memory
chunks to allow the iommu the opportunity to use larger pages. The
approach here is a bit different that the one currently used for
legacy KVM device assignment. Rather than looking at the vma page
size and using that as the maximum size to pass to the iommu, we
instead simply look at whether the next page is physically
contiguous. This means we might ask the iommu to map a 4MB region,
while legacy KVM might limit itself to a maximum of 2MB.
Splitting our mapping path also allows us to be smarter about locked
memory because we can more easily unwind if the user attempts to
exceed the limit. Therefore, rather than assuming that a mapping
will result in locked memory, we test each page as it is pinned to
determine whether it locks RAM vs an mmap'd MMIO region. This should
result in better locking granularity and less locked page fudge
factors in userspace.
The unmap path uses the same algorithm as legacy KVM. We don't want
to track the pfn for each mapping ourselves, but we need the pfn in
order to unpin pages. We therefore ask the iommu for the iova to
physical address translation, ask it to unpin a page, and see how many
pages were actually unpinned. iommus supporting large pages will
often return something bigger than a page here, which we know will be
physically contiguous and we can unpin a batch of pfns. iommus not
supporting large mappings won't see an improvement in batching here as
they only unmap a page at a time.
With this change, we also make a clarification to the API for mapping
and unmapping DMA. We can only guarantee unmaps at the same
granularity as used for the original mapping. In other words,
unmapping a subregion of a previous mapping is not guaranteed and may
result in a larger or smaller unmapping than requested. The size
field in the unmapping structure is updated to reflect this.
Previously this was unmodified on mapping, always returning the the
requested unmap size. This is now updated to return the actual unmap
size on success, allowing userspace to appropriately track mappings.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We need to keep track of all the DMA mappings of an iommu container so
that it can be automatically unmapped when the user releases the file
descriptor. We currently do this using a simple list, where we merge
entries with contiguous iovas and virtual addresses. Using a tree for
this is a bit more efficient and allows us to use common code instead
of inventing our own.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The enables VFIO on the pSeries platform, enabling user space
programs to access PCI devices directly.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
VFIO implements platform independent stuff such as
a PCI driver, BAR access (via read/write on a file descriptor
or direct mapping when possible) and IRQ signaling.
The platform dependent part includes IOMMU initialization
and handling. This implements an IOMMU driver for VFIO
which does mapping/unmapping pages for the guest IO and
provides information about DMA window (required by a POWER
guest).
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
devtmpfs_delete_node() calls devnode() callback with mode==NULL but
vfio still tries to write there.
The patch fixes this.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Changes include extension to support PCI AER notification to userspace, byte granularity of PCI config space and access to unarchitected PCI config space, better protection around IOMMU driver accesses, default file mode fix, and a few misc cleanups.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iQIcBAABAgAGBQJRgox9AAoJECObm247sIsizJwP/23KprR6BuRt168Obo+xC/lp
kj1E4Hd4XHx6T/5XRexa4GqhmyLqgoqbS19uj/K6ebY9rouVKo0V6OYNFDQiFR/F
yEGjYIBKV9/eBZMQI4RbZpZKGDXTeFrXyjsac1m51JrR3st8zsvZlNPE/TjzhSfz
jMnsCC99ZwIFjmptw/yWwgInswy1n9H9iICS14Xn1x7v71iyOE32reTG6M9HsPHe
Xm5F0K5iLQX8qoISHAomrTnFmCLxp5Y2qhh7nZWmC2gCbPBEnGdqx4prwzJayAaC
DAN8osaYldoKQuwI4wFYMICHxxCEFk8xU58FeKCmeQJZgomefVAoRKf86gAEsSLR
XHptBQOktWVKNFhzZWyOuAX3iua/hmWcOa05I6inycV1x2ZAGlNhvJnj1iKIphLk
+neBFAD9wDcAJZdXkfudq0jJ9jJTSAWBv8d+hozNfkjTVhF+wRnhZ7V6dZfb9+W6
kPGbgwqKApLoOabQbxbZ5Ftr6S7344prB/HAMywGa+xZfoxoYPxBHCi0rSTvUTWy
oWLtzrKRbjBDc0qF4eEK9RZlmVt2ZmCUnUB3kbWED4mrJAHu4LlFghnAloePrIZ3
kXlMARWbyw/E+1WIMO4Ito5+1s3zVsJJgzVsAQDJkTQw2aYDhns73/Y25Dj7x7QS
BKebrXnh2xVCN6OIu+eX
=F0Cc
-----END PGP SIGNATURE-----
Merge tag 'vfio-for-v3.10' of git://github.com/awilliam/linux-vfio
Pull vfio updates from Alex Williamson:
"Changes include extension to support PCI AER notification to
userspace, byte granularity of PCI config space and access to
unarchitected PCI config space, better protection around IOMMU driver
accesses, default file mode fix, and a few misc cleanups."
* tag 'vfio-for-v3.10' of git://github.com/awilliam/linux-vfio:
vfio: Set container device mode
vfio: Use down_reads to protect iommu disconnects
vfio: Convert container->group_lock to rwsem
PCI/VFIO: use pcie_flags_reg instead of access PCI-E Capabilities Register
vfio-pci: Enable raw access to unassigned config space
vfio-pci: Use byte granularity in config map
vfio: make local function vfio_pci_intx_unmask_handler() static
VFIO-AER: Vfio-pci driver changes for supporting AER
VFIO: Wrapper for getting reference to vfio_device
Minor 0 is the VFIO container device (/dev/vfio/vfio). On it's own
the container does not provide a user with any privileged access. It
only supports API version check and extension check ioctls. Only by
attaching a VFIO group to the container does it gain any access. Set
the mode of the container to allow access.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If a group or device is released or a container is unset from a group
it can race against file ops on the container. Protect these with
down_reads to allow concurrent users.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reported-by: Michael S. Tsirkin <mst@redhat.com>
All current users are writers, maintaining current mutual exclusion.
This lets us add read users next.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We now cache the MSI/MSI-X capability offsets in the struct pci_dev,
so no need to find the capabilities again.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
PCI_MSIX_FLAGS_BIRMASK is mis-named because the BIR mask is in the
Table Offset register, not the flags ("Message Control" per spec)
register.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Currently, we use pcie_flags_reg to cache PCI-E Capabilities Register,
because PCI-E Capabilities Register bits are almost read-only. This patch
use pcie_caps_reg() instead of another access PCI-E Capabilities Register.
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Devices like be2net hide registers between the gaps in capabilities
and architected regions of PCI config space. Our choices to support
such devices is to either build an ever growing and unmanageable white
list or rely on hardware isolation to protect us. These registers are
really no different than MMIO or I/O port space registers, which we
don't attempt to regulate, so treat PCI config space in the same way.
Reported-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Gavin Shan <shangw@linux.vnet.ibm.com>
The config map previously used a byte per dword to map regions of
config space to capabilities. Modulo a bug where we round the length
of capabilities down instead of up, this theoretically works well and
saves space so long as devices don't try to hide registers in the gaps
between capabilities. Unfortunately they do exactly that so we need
byte granularity on our config space map. Increase the allocation of
the config map and split accesses at capability region boundaries.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Gavin Shan <shangw@linux.vnet.ibm.com>
The VFIO_DEVICE_SET_IRQS ioctl takes a start and count parameter, both
of which are unsigned. We attempt to bounds check these, but fail to
account for the case where start is a very large number, allowing
start + count to wrap back into the valid range. Bounds check both
start and start + count.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vfio_pci_intx_unmask_handler() was not declared. It should be static.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The vfio drivers call kmalloc or kzalloc, but do not
include <linux/slab.h>, which causes build errors on
ARM.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: kvm@vger.kernel.org
- New VFIO_SET_IRQ ioctl option to pass the eventfd that is signaled when
an error occurs in the vfio_pci_device
- Register pci_error_handler for the vfio_pci driver
- When the device encounters an error, the error handler registered by
the vfio_pci driver gets invoked by the AER infrastructure
- In the error handler, signal the eventfd registered for the device.
- This results in the qemu eventfd handler getting invoked and
appropriate action taken for the guest.
Signed-off-by: Vijay Mohan Pandarathil <vijaymohan.pandarathil@hp.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
- Added vfio_device_get_from_dev() as wrapper to get
reference to vfio_device from struct device.
- Added vfio_device_data() as a wrapper to get device_data from
vfio_device.
Signed-off-by: Vijay Mohan Pandarathil <vijaymohan.pandarathil@hp.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Convert to the much saner new idr interface.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
PCI defines display class VGA regions at I/O port address 0x3b0, 0x3c0
and MMIO address 0xa0000. As these are non-overlapping, we can ignore
the I/O port vs MMIO difference and expose them both in a single
region. We make use of the VGA arbiter around each access to
configure chipset access as necessary.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We give the user access to change the power state of the device but
certain transitions result in an uninitialized state which the user
cannot resolve. To fix this we need to mark the PowerState field of
the PMCSR register read-only and effect the requested change on behalf
of the user. This has the added benefit that pdev->current_state
remains accurate while controlled by the user.
The primary example of this bug is a QEMU guest doing a reboot where
the device it put into D3 on shutdown and becomes unusable on the next
boot because the device did a soft reset on D3->D0 (NoSoftRst-).
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
pcieport does nice things like manage AER and we know it doesn't do
DMA or expose any user accessible devices on the host. It also keeps
the Memory, I/O, and Busmaster bits enabled, which is pretty handy
when trying to use anyting below it. Devices owned by pcieport cannot
be given to users via vfio, but we can tolerate them not being owned
by vfio-pci.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vfio_dev_present is meant to give us a wait_event callback so that we
can block removing a device from vfio until it becomes unused. The
root of this check depends on being able to get the iommu group from
the device. Unfortunately if the BUS_NOTIFY_DEL_DEVICE notifier has
fired then the device-group reference is no longer searchable and we
fail the lookup.
We don't need to go to such extents for this though. We have a
reference to the device, from which we can acquire a reference to the
group. We can then use the group reference to search for the device
and properly block removal.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We can actually handle MMIO and I/O port from the same access function
since PCI already does abstraction of this. The ROM BAR only requires
a minor difference, so it gets included too. vfio_pci_config_readwrite
gets renamed for consistency.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The read and write functions are nearly identical, combine them
and convert to a switch statement. This also makes it easy to
narrow the scope of when we use the io/mem accessors in case new
regions are added.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
A read from a range hidden from the user (ex. MSI-X vector table)
attempts to fill the user buffer up to the end of the excluded range
instead of up to the requested count. Fix it.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: stable@vger.kernel.org
Devices making use of PM reset are getting incorrectly identified as
not supporting reset because pci_pm_reset() fails unless the device is
in D0 power state. When first attached to vfio_pci devices are
typically in an unknown power state. We can fix this by explicitly
setting the power state or simply calling pci_enable_device() before
attempting a pci_reset_function(). We need to enable the device
anyway, so move this up in our vfio_pci_enable() function, which also
simplifies the error path a bit.
Note that pci_disable_device() does not explicitly set the power
state, so there's no need to re-order vfio_pci_disable().
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The two labels for error recovery in function vfio_pci_init() is out of
order, so fix it.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Comments from dev_driver_string(),
/* dev->driver can change to NULL underneath us because of unbinding,
* so be careful about accessing it.
*/
So use ACCESS_ONCE() to guard access to dev->driver field.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
On error recovery path in function vfio_create_group(), it should
unregister the IOMMU notifier for the new VFIO group. Otherwise it may
cause invalid memory access later when handling bus notifications.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Move the device reset to the end of our disable path, the device
should already be stopped from pci_disable_device(). This also allows
us to manipulate the save/restore to avoid the save/reset/restore +
save/restore that we had before.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The virq_disabled flag tracks the userspace view of INTx masking
across interrupt mode changes, but we're not consistently applying
this to the interrupt and masking handler notion of the device.
Currently if the user sets DisINTx while in MSI or MSIX mode, then
returns to INTx mode (ex. rebooting a qemu guest), the hardware has
DisINTx+, but the management of INTx thinks it's enabled, making it
impossible to actually clear DisINTx. Fix this by updating the
handler state when INTx is re-enabled.
Cc: stable@vger.kernel.org
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We need to be ready to recieve an interrupt as soon as we call
request_irq, so our eventfd context setting needs to be moved
earlier. Without this, an interrupt from our device or one
sharing the interrupt line can pass a NULL into eventfd_signal
and oops.
Cc: stable@vger.kernel.org
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Our mmap path mistakely relied on vma->vm_pgoff to get set in
remap_pfn_range. After b3b9c293, that path only applies to
copy-on-write mappings. Set it in our own code.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The VM_RESERVED flag was killed off in commit 314e51b985 ("mm: kill
vma flag VM_RESERVED and mm->reserved_vm counter"), and replaced by the
proper semantic flags (eg "don't core-dump" etc). But there was a new
use of VM_RESERVED that got missed by the merge.
Fix the remaining use of VM_RESERVED in the vfio_pci driver, replacing
the VM_RESERVED flag with VM_DONTEXPAND | VM_DONTDUMP.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation,org>
vfoi-pci supports a mechanism like KVM's irqfd for unmasking an
interrupt through an eventfd. There are two ways to shutdown this
interface: 1) close the eventfd, 2) ioctl (such as disabling the
interrupt). Both of these do the release through a workqueue,
which can result in a segfault if two jobs get queued for the same
virqfd.
Fix this by protecting the pointer to these virqfds by a spinlock.
The vfio pci device will therefore no longer have a reference to it
once the release job is queued under lock. On the ioctl side, we
still flush the workqueue to ensure that any outstanding releases
are completed.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
It's not critical (anymore) since another thread closing the file will block
on ->device_lock before it gets to dropping the final reference, but it's
definitely cleaner that way...
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
we really need to make sure that dropping the last reference happens
under the group->device_lock; otherwise a loop (under device_lock)
might find vfio_device instance that is being freed right now, has
already dropped the last reference and waits on device_lock to exclude
the sucker from the list.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add PCI device support for VFIO. PCI devices expose regions
for accessing config space, I/O port space, and MMIO areas
of the device. PCI config access is virtualized in the kernel,
allowing us to ensure the integrity of the system, by preventing
various accesses while reducing duplicate support across various
userspace drivers. I/O port supports read/write access while
MMIO also supports mmap of sufficiently sized regions. Support
for INTx, MSI, and MSI-X interrupts are provided using eventfds to
userspace.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This VFIO IOMMU backend is designed primarily for AMD-Vi and Intel
VT-d hardware, but is potentially usable by anything supporting
similar mapping functionality. We arbitrarily call this a Type1
backend for lack of a better name. This backend has no IOVA
or host memory mapping restrictions for the user and is optimized
for relatively static mappings. Mapped areas are pinned into system
memory.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
VFIO is a secure user level driver for use with both virtual machines
and user level drivers. VFIO makes use of IOMMU groups to ensure the
isolation of devices in use, allowing unprivileged user access. It's
intended that VFIO will replace KVM device assignment and UIO drivers
(in cases where the target platform includes a sufficiently capable
IOMMU).
New in this version of VFIO is support for IOMMU groups managed
through the IOMMU core as well as a rework of the API, removing the
group merge interface. We now go back to a model more similar to
original VFIO with UIOMMU support where the file descriptor obtained
from /dev/vfio/vfio allows access to the IOMMU, but only after a
group is added, avoiding the previous privilege issues with this type
of model. IOMMU support is also now fully modular as IOMMUs have
vastly different interface requirements on different platforms. VFIO
users are able to query and initialize the IOMMU model of their
choice.
Please see the follow-on Documentation commit for further description
and usage example.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>