By default VFIO disables mapping of MSIX BAR to the userspace as
the userspace may program it in a way allowing spurious interrupts;
instead the userspace uses the VFIO_DEVICE_SET_IRQS ioctl.
In order to eliminate guessing from the userspace about what is
mmapable, VFIO also advertises a sparse list of regions allowed to mmap.
This works fine as long as the system page size equals to the MSIX
alignment requirement which is 4KB. However with a bigger page size
the existing code prohibits mapping non-MSIX parts of a page with MSIX
structures so these parts have to be emulated via slow reads/writes on
a VFIO device fd. If these emulated bits are accessed often, this has
serious impact on performance.
This allows mmap of the entire BAR containing MSIX vector table.
This removes the sparse capability for PCI devices as it becomes useless.
As the userspace needs to know for sure whether mmapping of the MSIX
vector containing data can succeed, this adds a new capability -
VFIO_REGION_INFO_CAP_MSIX_MAPPABLE - which explicitly tells the userspace
that the entire BAR can be mmapped.
This does not touch the MSIX mangling in the BAR read/write handlers as
we are doing this just to enable direct access to non MSIX registers.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw - fixup whitespace, trim function name]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The vfio_info_add_capability() helper requires the caller to pass a
capability ID, which it then uses to fill in header fields, assuming
hard coded versions. This makes for an awkward and rigid interface.
The only thing we want this helper to do is allocate sufficient
space in the caps buffer and chain this capability into the list.
Reduce it to that simple task.
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
At the moment VFIO rightfully assumes that INTx is supported if
the interrupt pin is not set to zero in the device config space.
However if that is not the case (the pin is not zero but pdev->irq is),
vfio_intx_enable() fails.
In order to prevent the userspace from trying to enable INTx when we know
that it cannot work, let's mask the PCI_INTERRUPT_PIN register.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
MRRS defines the maximum read request size a device is allowed to
make. Drivers will often increase this to allow more data transfer
with a single request. Completions to this request are bound by the
MPS setting for the bus. Aside from device quirks (none known), it
doesn't seem to make sense to set an MRRS value less than MPS, yet
this is a likely scenario given that user drivers do not have a
system-wide view of the PCI topology. Virtualize MRRS such that the
user can set MRRS >= MPS, but use MPS as the floor value that we'll
write to hardware.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
With virtual PCI-Express chipsets, we now see userspace/guest drivers
trying to match the physical MPS setting to a virtual downstream port.
Of course a lone physical device surrounded by virtual interconnects
cannot make a correct decision for a proper MPS setting. Instead,
let's virtualize the MPS control register so that writes through to
hardware are disallowed. Userspace drivers like QEMU assume they can
write anything to the device and we'll filter out anything dangerous.
Since mismatched MPS can lead to AER and other faults, let's add it
to the kernel side rather than relying on userspace virtualization to
handle it.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Root complex integrated endpoints do not have a link and therefore may
use a smaller PCIe capability in config space than we expect when
building our config map. Add a case for these to avoid reporting an
erroneous overlap.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Device lock bites again; if a device .remove() callback races a user
calling ioctl(VFIO_GROUP_GET_DEVICE_FD), the unbind request will hold
the device lock, but the user ioctl may have already taken a vfio_device
reference. In the case of a PCI device, the initial open will attempt
to reset the device, which again attempts to get the device lock,
resulting in deadlock. Use the trylock PCI reset interface and return
error on the open path if reset fails due to lock contention.
Link: https://lkml.org/lkml/2017/7/25/381
Reported-by: Wen Congyang <wencongyang2@huawei.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
XXV710 has the same broken INTx behavior as the rest of the X/XL710
series, the interrupt status register is not wired to report pending
INTx interrupts, thus we never associate the interrupt to the device.
Extend the device IDs to include these so that we hide that the
device supports INTx at all to the user.
Reported-by: Stefan Assmann <sassmann@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Here, pci_iomap can fail, handle this case release selected
pci regions and return -ENOMEM.
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Using ancient compilers (gcc-4.5 or older) on ARM, we get a link
failure with the vfio-pci driver:
ERROR: "__aeabi_lcmp" [drivers/vfio/pci/vfio-pci.ko] undefined!
The reason is that the compiler tries to do a comparison of
a 64-bit range. This changes it to convert to a 32-bit number
explicitly first, as newer compilers do for themselves.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJYUt1vAAoJEFmIoMA60/r8abgP/3R+5Lsk5/kfAHk5/2Mtqbvg
mZ0eDUpY9GbUeMjSq84Nr2H8u7d+1AJCCu8KtDJYZCmjZpnSp2SuE2PS5JoGC7zC
fintD24jlIF4/J5+HeVXXmbfr3xATxvpTuiSLEi8sLBRJ3KRIswhMSwoPwOyeTQw
v/EclWKPGYcI5Zp0oigY9/Jd3q3lQ17KXppi/0dDoLh7PNOFvEHItXWzmf++u/NP
iYT9R1xmzEsy0/HRd6hiwPT2xA8YsAXxgobhHooUgh1FWmZ02Tg1WjgDemOW4lVh
kNIUcsLczh7wZCceogrrJ+pwb9+NyyIyKuHPv6OG3ieyz1IZdznaj1fAE5HJYiPo
eVS7cP1S6DyV3Y5qFj5F2dSRS7T4GXdXG5mNhmeCpUHs0vfzSCG36jLmhTy8UIxs
1rCf5oFa+uU9q0okfH8VtcGOXqWjGgyxTSGGfF71HUMLnPbsci2fxC2cO6svzIX7
wDY0uxOzpyMIYMuQR6iz7VqvAwEaZ+7pfMIrWWdDcQ9/5tCNJ49cLuKaThPL4bVu
juiGBQtnTLg8tjrhjDL9tQiJpuVIweVXyyQ1fvZoVXkMLlhVCF2ttirvwFUit2PB
84OlevQZ+9QdE/qalrWbv4qzhesuiwu0avkzjGoqg6tWTF0epu2AHI2vqy6UBYEG
tcfJPEcz1019PKZNSvWy
=ut0k
-----END PGP SIGNATURE-----
Merge tag 'pci-v4.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI updates from Bjorn Helgaas:
"PCI changes:
- add support for PCI on ARM64 boxes with ACPI. We already had this
for theoretical spec-compliant hardware; now we're adding quirks
for the actual hardware (Cavium, HiSilicon, Qualcomm, X-Gene)
- add runtime PM support for hotplug ports
- enable runtime suspend for Intel UHCI that uses platform-specific
wakeup signaling
- add yet another host bridge registration interface. We hope this is
extensible enough to subsume the others
- expose device revision in sysfs for DRM
- to avoid device conflicts, make sure any VF BAR updates are done
before enabling the VF
- avoid unnecessary link retrains for ASPM
- allow INTx masking on Mellanox devices that support it
- allow access to non-standard VPD for Chelsio devices
- update Broadcom iProc support for PAXB v2, PAXC v2, inbound DMA,
etc
- update Rockchip support for max-link-speed
- add NVIDIA Tegra210 support
- add Layerscape LS1046a support
- update R-Car compatibility strings
- add Qualcomm MSM8996 support
- remove some uninformative bootup messages"
* tag 'pci-v4.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (115 commits)
PCI: Enable access to non-standard VPD for Chelsio devices (cxgb3)
PCI: Expand "VPD access disabled" quirk message
PCI: pciehp: Remove loading message
PCI: hotplug: Remove hotplug core message
PCI: Remove service driver load/unload messages
PCI/AER: Log AER IRQ when claiming Root Port
PCI/AER: Log errors with PCI device, not PCIe service device
PCI/AER: Remove unused version macros
PCI/PME: Log PME IRQ when claiming Root Port
PCI/PME: Drop unused support for PMEs from Root Complex Event Collectors
PCI: Move config space size macros to pci_regs.h
x86/platform/intel-mid: Constify mid_pci_platform_pm
PCI/ASPM: Don't retrain link if ASPM not possible
PCI: iproc: Skip check for legacy IRQ on PAXC buses
PCI: pciehp: Leave power indicator on when enabling already-enabled slot
PCI: pciehp: Prioritize data-link event over presence detect
PCI: rcar: Add gen3 fallback compatibility string for pcie-rcar
PCI: rcar: Use gen2 fallback compatibility last
PCI: rcar-gen2: Use gen2 fallback compatibility last
PCI: rockchip: Move the deassert of pm/aclk/pclk after phy_init()
..
Move PCI configuration space size macros (PCI_CFG_SPACE_SIZE and
PCI_CFG_SPACE_EXP_SIZE) from drivers/pci/pci.h to
include/uapi/linux/pci_regs.h so they can be used by more drivers and
eliminate duplicate definitions.
[bhelgaas: Expand comment to include PCI-X details]
Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
As of commit d97ffe2368 ("PCI: Fix return value from
pci_user_{read,write}_config_*()") it's unnecessary to call
pcibios_err_to_errno() to fixup the return value from these functions.
pcibios_err_to_errno() already does simple passthrough of -errno values,
therefore no functional change is expected.
[aw: changelog]
Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Update msix_sparse_mmap_cap() to use vfio_info_add_capability()
Update region type capability to use vfio_info_add_capability()
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The VFIO_DEVICE_SET_IRQS ioctl did not sufficiently sanitize
user-supplied integers, potentially allowing memory corruption. This
patch adds appropriate integer overflow checks, checks the range bounds
for VFIO_IRQ_SET_DATA_NONE, and also verifies that only single element
in the VFIO_IRQ_SET_DATA_TYPE_MASK bitmask is set.
VFIO_IRQ_SET_ACTION_TYPE_MASK is already correctly checked later in
vfio_pci_set_irqs_ioctl().
Furthermore, a kzalloc is changed to a kcalloc because the use of a
kzalloc with an integer multiplication allowed an integer overflow
condition to be reached without this patch. kcalloc checks for overflow
and should prevent a similar occurrence.
Signed-off-by: Vlad Tsyrklevich <vlad@tsyrklevich.net>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Simplify the interrupt setup by using the new PCI layer helpers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The MSI/X shutdown path can gratuitously enable INTx, which is not
something we want to happen if we're dealing with broken INTx device.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We use a BAR restore trick to try to detect when a user has performed
a device reset, possibly through FLR or other backdoors, to put things
back into a working state. This is important for backdoor resets, but
we can actually just virtualize the "front door" resets provided via
PCIe and AF FLR. Set these bits as virtualized + writable, allowing
the default write to set them in vconfig, then we can simply check the
bit, perform an FLR of our own, and clear the bit. We don't actually
have the granularity in PCI to specify the type of reset we want to
do, but generally devices don't implement both PCIe and AF FLR and
we'll favor these over other types of reset, so we should generally
lineup. We do test whether the device provides the requested FLR type
to stay consistent with hardware capabilities though.
This seems to fix several instance of devices getting into bad states
with userspace drivers, like dpdk, running inside a VM.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Greg Rose <grose@lightfleet.com>
There are multiple cases in vfio_pci_set_ctx_trigger_single() where
we assume we can safely read from our data pointer without actually
checking whether the user has passed any data via the count field.
VFIO_IRQ_SET_DATA_NONE in particular is entirely broken since we
attempt to pull an int32_t file descriptor out before even checking
the data type. The other data types assume the data pointer contains
one element of their type as well.
In part this is good news because we were previously restricted from
doing much sanitization of parameters because it was missed in the
past and we didn't want to break existing users. Clearly DATA_NONE
is completely broken, so it must not have any users and we can fix
it up completely. For DATA_BOOL and DATA_EVENTFD, we'll just
protect ourselves, returning error when count is zero since we
previously would have oopsed.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reported-by: Chris Thompson <the_cartographer@hotmail.com>
Cc: stable@vger.kernel.org
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Current vfio-pci implementation disallows to mmap
sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio
page may be shared with other BARs. This will cause some
performance issues when we passthrough a PCI device with
this kind of BARs. Guest will be not able to handle the mmio
accesses to the BARs which leads to mmio emulations in host.
However, not all sub-page BARs will share page with other BARs.
We should allow to mmap the sub-page MMIO BARs which we can
make sure will not share page with other BARs.
This patch adds support for this case. And we try to add a
dummy resource to reserve the remainder of the page which
hot-add device's BAR might be assigned into. But it's not
necessary to handle the case when the BAR is not page aligned.
Because we can't expect the BAR will be assigned into the same
location in a page in guest when we passthrough the BAR. And
it's hard to access this BAR in userspace because we have
no way to get the BAR's location in a page.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The size of the VPD area is not necessarily 4-byte aligned, so a
pci_vpd_read() might return less than 4 bytes. Zero our buffer and
accept anything other than an error. Intel X710 NICs exercise this.
Fixes: 4e1a635552 ("vfio/pci: Use kernel VPD access functions")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Both the INTx and MSI/X disable paths do an eventfd_ctx_put() for the
trigger eventfd before calling vfio_virqfd_disable() any potential
mask and unmask eventfds. This opens a use-after-free race where an
inopportune irqfd can reference the freed signalling eventfd. Reorder
to avoid this possibility.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
PCI-Express spec says that reading 4 bytes at offset 100h should return
zero if there is no extended capability so VFIO reads this dword to
know if there are extended capabilities.
However it is not always possible to access the extended space so
generic PCI code in pci_cfg_space_size_ext() checks if
pci_read_config_dword() can read beyond 100h and if the check fails,
it sets the config space size to 100h.
VFIO does its own extended capabilities check by reading at offset 100h
which may produce 0xffffffff which VFIO treats as the extended config
space presense and calls vfio_ecap_init() which fails to parse
capabilities (which is expected) but right before the exit, it writes
zero at offset 100h which is beyond the buffer allocated for
vdev->vconfig (which is 256 bytes) which leads to random memory
corruption.
This makes VFIO only check for the extended capabilities if
the discovered config size is more than 256 bytes.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If a device is reset without the memory or i/o bits enabled in the
command register we may not detect it, potentially leaving the device
without valid BAR programming. Add an additional test to check the
BARs on each write to the command register.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
INTx masking has two components, the first is that we need the ability
to prevent the device from continuing to assert INTx. This is
provided via the DisINTx bit in the command register and is the only
thing we can really probe for when testing if INTx masking is
supported. The second component is that the device needs to indicate
if INTx is asserted via the interrupt status bit in the device status
register. With these two features we can generically determine if one
of the devices we own is asserting INTx, signal the user, and mask the
interrupt while the user services the device.
Generally if one or both of these components is broken we resort to
APIC level interrupt masking, which requires an exclusive interrupt
since we have no way to determine the source of the interrupt in a
shared configuration. This often makes it difficult or impossible to
configure the system for userspace use of the device, for an interrupt
mode that the user may not need.
One possible configuration of broken INTx masking is that the DisINTx
support is fully functional, but the interrupt status bit never
signals interrupt assertion. In this case we do have the ability to
prevent the device from asserting INTx, but lack the ability to
identify the interrupt source. For this case we can simply pretend
that the device lacks INTx support entirely, keeping DisINTx set on
the physical device, virtualizing this bit for the user, and
virtualizing the interrupt pin register to indicate no INTx support.
We already support virtualization of the DisINTx bit and already
virtualize the interrupt pin for platforms without INTx support. By
tying these components together, setting DisINTx on open and reset,
and identifying devices broken in this particular way, we can provide
support for them w/o the handicap of APIC level INTx masking.
Intel i40e (XL710/X710) 10/20/40GbE NICs have been identified as being
broken in this specific way. We leave the vfio-pci.nointxmask option
as a mechanism to bypass this support, enabling INTx on the device
with all the requirements of APIC level masking.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: John Ronciak <john.ronciak@intel.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Various enablers for assignment of Intel graphics devices and future
support of vGPU devices (Alex Williamson). This includes
- Handling the vfio type1 interface as an API rather than a specific
implementation, allowing multiple type1 providers.
- Capability chains, similar to PCI device capabilities, that allow
extending ioctls. Extensions here include device specific regions
and sparse mmap descriptions. The former is used to expose non-PCI
regions for IGD, including the OpRegion (particularly the Video
BIOS Table), and read only PCI config access to the host and LPC
bridge as drivers often depend on identifying those devices.
Sparse mmaps here are used to describe the MSIx vector table,
which vfio has always protected from mmap, but never had an API to
explicitly define that protection. In future vGPU support this is
expected to allow the description of PCI BARs that may mix direct
access and emulated access within a single region.
- The ability to expose the shadow ROM as an option ROM as IGD use
cases may rely on the ROM even though the physical device does not
make use of a PCI option ROM BAR.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJW6aT1AAoJECObm247sIsiiP4P/1xf7Z08/2QWVFQzex9CLcZk
+/iJlyb/fTpPVQE+NTKPz3Qh5h6ZhSd/57s85IUqq0T6tgVPkoGx8kkyCjBaw2y1
yMezXZlQqJdZqGzQNI4OiHWvO+/vGxYKjQMfUnMlDM6dJgz4lGncGFoSouFPa3Vp
mB12hGxrlk1cfIdb+C1KbfZcEdS0WhtigQtz8flBKgOfO+hYWmUO+CClJBhVw8Z4
RNcWNAxFfLuwUPVsPb6uOLG2g65SC2vmQ9k0Tnknf1znV3PFFVjITf0aM6uChLNP
S3SgqtPX+6yOFyCuSEs8UKhhmCbeQmAyKgt5BpxV3Rw3OMP4PsVAehr82vQmSj6g
2o96pR2s8MDPBr8eG7gdRe4DQe3PonpLkpDfaghcpYqhkGEqNVeW5/GjiOzGQqD3
xMshzxJ1Iz7DOHkQRUVqOfupDB0TusJmTVKwvXe6yIYL9pjkUS/sbN9U563HYSES
JTV68TMj0VKfKwD3XKYXvGH3km1sL4i5NMlAUrsDtsMkGlXEswoGbj82Mjc8+jUo
BvWQTJb+kouJQ88VhsO2abg1UrO9E6u82iHFHy9fEObxE8KH7pvROlS93ihMT1Wv
WQNuUcltdpHMRVX0BDknaPs3YtC3/TGgm3RcU5SZPbv/ys1471ZmJxMlAAKcfITr
SuvkMTYElF5b1pigv46c
=/lJn
-----END PGP SIGNATURE-----
Merge tag 'vfio-v4.6-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
"Various enablers for assignment of Intel graphics devices and future
support of vGPU devices (Alex Williamson). This includes
- Handling the vfio type1 interface as an API rather than a specific
implementation, allowing multiple type1 providers.
- Capability chains, similar to PCI device capabilities, that allow
extending ioctls. Extensions here include device specific regions
and sparse mmap descriptions. The former is used to expose non-PCI
regions for IGD, including the OpRegion (particularly the Video
BIOS Table), and read only PCI config access to the host and LPC
bridge as drivers often depend on identifying those devices.
Sparse mmaps here are used to describe the MSIx vector table, which
vfio has always protected from mmap, but never had an API to
explicitly define that protection. In future vGPU support this is
expected to allow the description of PCI BARs that may mix direct
access and emulated access within a single region.
- The ability to expose the shadow ROM as an option ROM as IGD use
cases may rely on the ROM even though the physical device does not
make use of a PCI option ROM BAR"
* tag 'vfio-v4.6-rc1' of git://github.com/awilliam/linux-vfio:
vfio/pci: return -EFAULT if copy_to_user fails
vfio/pci: Expose shadow ROM as PCI option ROM
vfio/pci: Intel IGD host and LCP bridge config space access
vfio/pci: Intel IGD OpRegion support
vfio/pci: Enable virtual register in PCI config space
vfio/pci: Add infrastructure for additional device specific regions
vfio: Define device specific region type capability
vfio/pci: Include sparse mmap capability for MSI-X table regions
vfio: Define sparse mmap capability for regions
vfio: Add capability chain helpers
vfio: Define capability chains
vfio: If an IOMMU backend fails, keep looking
vfio/pci: Fix unsigned comparison overflow
Calling return copy_to_user(...) in an ioctl will not
do the right thing if there's a pagefault:
copy_to_user returns the number of bytes not copied
in this case.
Fix up vfio to do
return copy_to_user(...)) ?
-EFAULT : 0;
everywhere.
Cc: stable@vger.kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The copy_to_user() function returns the number of bytes that were not
copied but we want to return -EFAULT on error here.
Fixes: 188ad9d6cb ('vfio/pci: Include sparse mmap capability for MSI-X table regions')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Integrated graphics may have their ROM shadowed at 0xc0000 rather than
implement a PCI option ROM. Make this ROM appear to the user using
the ROM BAR.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Provide read-only access to PCI config space of the PCI host bridge
and LPC bridge through device specific regions. This may be used to
configure a VM with matching register contents to satisfy driver
requirements. Providing this through the vfio file descriptor removes
an additional userspace requirement for access through pci-sysfs and
removes the CAP_SYS_ADMIN requirement that doesn't appear to apply to
the specific devices we're accessing.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This is the first consumer of vfio device specific resource support,
providing read-only access to the OpRegion for Intel graphics devices.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Typically config space for a device is mapped out into capability
specific handlers and unassigned space. The latter allows direct
read/write access to config space. Sometimes we know about registers
living in this void space and would like an easy way to virtualize
them, similar to how BAR registers are managed. To do this, create
one more pseudo (fake) PCI capability to be handled as purely virtual
space. Reads and writes are serviced entirely from virtual config
space.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Add support for additional regions with indexes started after the
already defined fixed regions. Device specific code can register
these regions with the new vfio_pci_register_dev_region() function.
The ops structure per region currently only includes read/write
access and a release function, allowing automatic cleanup when the
device is closed. mmap support is only missing here because it's
not needed by the first user queued for this support.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
vfio-pci has never allowed the user to directly mmap the MSI-X vector
table, but we've always relied on implicit knowledge of the user that
they cannot do this. Now that we have capability chains that we can
expose in the region info ioctl and a sparse mmap capability that
represents the sub-areas within the region that can be mmap'd, we can
make the mmap constraints more explicit.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed versus unsigned comparisons are implicitly cast to unsigned,
which result in a couple possible overflows. For instance (start +
count) might overflow and wrap, getting through our validation test.
Also when unwinding setup, -1 being compared as unsigned doesn't
produce the intended stop condition. Fix both of these and also fix
vfio_msi_set_vector_signal() to validate parameters before using the
vector index, though none of the callers should pass bad indexes
anymore.
Reported-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Tested-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
There is really no way to safely give a user full access to a DMA
capable device without an IOMMU to protect the host system. There is
also no way to provide DMA translation, for use cases such as device
assignment to virtual machines. However, there are still those users
that want userspace drivers even under those conditions. The UIO
driver exists for this use case, but does not provide the degree of
device access and programming that VFIO has. In an effort to avoid
code duplication, this introduces a No-IOMMU mode for VFIO.
This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
the "enable_unsafe_noiommu_mode" option on the vfio driver. This
should make it very clear that this mode is not safe. Additionally,
CAP_SYS_RAWIO privileges are necessary to work with groups and
containers using this mode. Groups making use of this support are
named /dev/vfio/noiommu-$GROUP and can only make use of the special
VFIO_NOIOMMU_IOMMU for the container. Use of this mode, specifically
binding a device without a native IOMMU group to a VFIO bus driver
will taint the kernel and should therefore not be considered
supported. This patch includes no-iommu support for the vfio-pci bus
driver only.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Revert commit 033291eccb ("vfio: Include No-IOMMU mode") due to lack
of a user. This was originally intended to fill a need for the DPDK
driver, but uptake has been slow so rather than support an unproven
kernel interface revert it and revisit when userspace catches up.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This pci_error_handlers structure is never modified, like all the other
pci_error_handlers structures, so declare it as const.
Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
- Use kernel interfaces for VPD emulation (Alex Williamson)
- Platform fix for releasing IRQs (Eric Auger)
- Type1 IOMMU always advertises PAGE_SIZE support when smaller
mapping sizes are available (Eric Auger)
- Platform fixes for incorrectly using copies of structures rather
than pointers to structures (James Morse)
- Rework platform reset modules, fix leak, and add AMD xgbe reset
module (Eric Auger)
- Fix vfio_device_get_from_name() return value (Joerg Roedel)
- No-IOMMU interface (Alex Williamson)
- Fix potential out of bounds array access in PCI config handling
(Dan Carpenter)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWRg8+AAoJECObm247sIsitkwP+wc6cRzBpeGxiufZl9Ci7JV9
G0qNHBm8tAYDwm1uJATcyZC303ad2B3gaYSO6msTFUXTg3d9ZDtUOWhoTNPs/px1
5dF38DREzqYHpyC9HT2Qj3i9G+ejvgg+SoyxBOTOIw/dmq3tGZhUz1Sj+wuvQTwr
XXBKsWltZ+8wwXmQXmOWI4L3m7Xhs8NwAl5iLJ3UiltpW9zZzuPtoKnQCfMYUcmh
hIJg52t0WPSLyn47UvecUQcqxaO+QYELa7UN84fnQAihk+ewMpzg5blTebayFdu3
f2WC3ivxbamebw74LaRfQFjx4mT+DI0aXYtraC600PVe7gdVXB66QMNNpPhBwAy5
wpfeFpTKU5gC+LHmrIMUS2/A4sdNfUBw44CS8+Lm2D6bQAblPv/C5xQV1rz9HADv
f4/D3Y0TUKSYArewtBHTC0mnXdkZwetttBoy6/zQBl8vkelhoJ3GPcVa8FEZCIuT
2MSS17I3ftJ1enfynicF+Wstn/H5lWcuRBdg5wTLHIuhFn6MiEVxfIuSEx9JfjIb
NGZO7y5JiJ0b5QRCG0tFznwceU/cql/3oRqOGXqaf1cQ1Ag3JOIAUzxknFoJQUj1
XYe+Im1eMaugjj39J3+m5EYNKT3nh/bBLD/V3iWYpgoZtQrmQQm5nu0JsQo88/JR
je0BuJioCuPlO/Wj/KYw
=bI62
-----END PGP SIGNATURE-----
Merge tag 'vfio-v4.4-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Use kernel interfaces for VPD emulation (Alex Williamson)
- Platform fix for releasing IRQs (Eric Auger)
- Type1 IOMMU always advertises PAGE_SIZE support when smaller mapping
sizes are available (Eric Auger)
- Platform fixes for incorrectly using copies of structures rather than
pointers to structures (James Morse)
- Rework platform reset modules, fix leak, and add AMD xgbe reset
module (Eric Auger)
- Fix vfio_device_get_from_name() return value (Joerg Roedel)
- No-IOMMU interface (Alex Williamson)
- Fix potential out of bounds array access in PCI config handling (Dan
Carpenter)
* tag 'vfio-v4.4-rc1' of git://github.com/awilliam/linux-vfio:
vfio/pci: make an array larger
vfio: Include No-IOMMU mode
vfio: Fix bug in vfio_device_get_from_name()
VFIO: platform: reset: AMD xgbe reset module
vfio: platform: reset: calxedaxgmac: fix ioaddr leak
vfio: platform: add dev_info on device reset
vfio: platform: use list of registered reset function
vfio: platform: add compat in vfio_platform_device
vfio: platform: reset: calxedaxgmac: add reset function registration
vfio: platform: introduce module_vfio_reset_handler macro
vfio: platform: add capability to register a reset function
vfio: platform: introduce vfio-platform-base module
vfio/platform: store mapped memory in region, instead of an on-stack copy
vfio/type1: handle case where IOMMU does not support PAGE_SIZE size
VFIO: platform: clear IRQ_NOAUTOEN when de-assigning the IRQ
vfio/pci: Use kernel VPD access functions
vfio: Whitelist PCI bridges
Smatch complains about a possible out of bounds error:
drivers/vfio/pci/vfio_pci_config.c:1241 vfio_cap_init()
error: buffer overflow 'pci_cap_length' 20 <= 20
The problem is that pci_cap_length[] was defined as large enough to
hold "PCI_CAP_ID_AF + 1" elements. The code in vfio_cap_init() assumes
it has PCI_CAP_ID_MAX + 1 elements. Originally, PCI_CAP_ID_AF and
PCI_CAP_ID_MAX were the same but then we introduced PCI_CAP_ID_EA in
commit f80b0ba959 ("PCI: Add Enhanced Allocation register entries")
so now the array is too small.
Let's fix this by making the array size PCI_CAP_ID_MAX + 1. And let's
make a similar change to pci_ext_cap_length[] for consistency. Also
both these arrays can be made const.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
There is really no way to safely give a user full access to a DMA
capable device without an IOMMU to protect the host system. There is
also no way to provide DMA translation, for use cases such as device
assignment to virtual machines. However, there are still those users
that want userspace drivers even under those conditions. The UIO
driver exists for this use case, but does not provide the degree of
device access and programming that VFIO has. In an effort to avoid
code duplication, this introduces a No-IOMMU mode for VFIO.
This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
the "enable_unsafe_noiommu_mode" option on the vfio driver. This
should make it very clear that this mode is not safe. Additionally,
CAP_SYS_RAWIO privileges are necessary to work with groups and
containers using this mode. Groups making use of this support are
named /dev/vfio/noiommu-$GROUP and can only make use of the special
VFIO_NOIOMMU_IOMMU for the container. Use of this mode, specifically
binding a device without a native IOMMU group to a VFIO bus driver
will taint the kernel and should therefore not be considered
supported. This patch includes no-iommu support for the vfio-pci bus
driver only.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
The PCI VPD capability operates on a set of window registers in PCI
config space. Writing to the address register triggers either a read
or write, depending on the setting of the PCI_VPD_ADDR_F bit within
the address register. The data register provides either the source
for writes or the target for reads.
This model is susceptible to being broken by concurrent access, for
which the kernel has adopted a set of access functions to serialize
these registers. Additionally, commits like 932c435cab ("PCI: Add
dev_flags bit to access VPD through function 0") and 7aa6ca4d39
("PCI: Add VPD function 0 quirk for Intel Ethernet devices") indicate
that VPD registers can be shared between functions on multifunction
devices creating dependencies between otherwise independent devices.
Fortunately it's quite easy to emulate the VPD registers, simply
storing copies of the address and data registers in memory and
triggering a VPD read or write on writes to the address register.
This allows vfio users to avoid seeing spurious register changes from
accesses on other devices and enables the use of shared quirks in the
host kernel. We can theoretically still race with access through
sysfs, but the window of opportunity is much smaller.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Mark Rustad <mark.d.rustad@intel.com>
This patch adds the registration/unregistration of an
irq_bypass_producer for MSI/MSIx on vfio pci devices.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Testing the driver for a PCI device is racy, it can be all but
complete in the release path and still report the driver as ours.
Therefore we can't trust drvdata to be valid. This race can sometimes
be seen when one port of a multifunction device is being unbound from
the vfio-pci driver while another function is being released by the
user and attempting a bus reset. The device in the remove path is
found as a dependent device for the bus reset of the release path
device, the driver is still set to vfio-pci, but the drvdata has
already been cleared, resulting in a null pointer dereference.
To resolve this, fix vfio_device_get_from_dev() to not take the
dev_get_drvdata() shortcut and instead traverse through the
iommu_group, vfio_group, vfio_device path to get a reference we
can trust. Once we have that reference, we know the device isn't
in transition and we can test to make sure the driver is still what
we expect, so that we don't interfere with devices we don't own.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Log some clues indicating whether the user is receiving device
request interfaces or not listening. This can help indicate why a
driver unbind is blocked or explain why QEMU automatically unplugged
a device from the VM.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We can save some power by putting devices that are bound to vfio-pci
but not in use by the user in the D3hot power state. Devices get
woken into D0 when opened by the user. Resets return the device to
D0, so we need to re-apply the low power state after a bus reset.
It's tempting to try to use D3cold, but we have no reason to inhibit
hotplug of idle devices and we might get into a loop of having the
device disappear before we have a chance to try to use it.
A new module parameter allows this feature to be disabled if there are
devices that misbehave as a result of this change.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
As indicated in the comment, this is not entirely uncommon and
causes user concern for no reason.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>