Two defects work together result in KVM device passthrough randomly can't
work:
1. iommu_snooping is not initialized to zero when vm_iommu_init() called.
So it is possible to get a random value.
2. One line added by commit 2c2e2c38("IOMMU Identity Mapping Support")
change the code path, let it bypass domain_update_iommu_cap(), as well as
missing the increment of domain iommu reference count.
The latter is also likely to cause a leak of domains on repeated VMM
assignment and deassignment.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
The physical address passed to domain_pfn_mapping() should be rounded
down to the start of the MM page, not the VT-d page.
This issue causes kernel panic on PAGE_SIZE>VTD_PAGE_SIZE platforms e.g. ia64
platforms.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
In domain_sg_mapping(), use aligned_nrpages() instead of hand-coded
rounding code for calculating the size of each sg elem. This means that
on IA64 we correctly round up to the MM page size, not just to the VT-d
page size.
Also remove the incorrect mm_to_dma_pfn() when intel_map_sg() calls
domain_sg_mapping() -- the 'size' variable is in VT-d pages already.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
This function has traditionally used "insert_resource()", because before
commit cebd78a8c5 ("Fix pci_claim_resource") it used to just insert the
resource into whatever root resource tree that was indicated by
"pcibios_select_root()".
So there Matthew fixed it to actually look up the proper parent
resource, which means that now it's actively wrong to then traverse the
resource tree any more: we already know exactly where the new resource
should go.
And when we then did commit a76117dfd6 ("x86: Use pci_claim_resource"),
which changed the x86 PCI code from the open-coded
pr = pci_find_parent_resource(dev, r);
if (!pr || request_resource(pr, r) < 0) {
to using
if (pci_claim_resource(dev, idx) < 0) {
that "insert_resource()" now suddenly became a problem, and causes a
regression covered by
http://bugzilla.kernel.org/show_bug.cgi?id=13891
which this fixes.
Reported-and-tested-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Andrew Patterson <andrew.patterson@hp.com>
Cc: Linux PCI <linux-pci@vger.kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Remove smp_lock.h from files which don't need it (including some headers!)
* Add smp_lock.h to files which do need it
* Make smp_lock.h include conditional in hardirq.h
It's needed only for one kernel_locked() usage which is under CONFIG_PREEMPT
This will make hardirq.h inclusion cheaper for every PREEMPT=n config
(which includes allmodconfig/allyesconfig, BTW)
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After some API change, intel_iommu_unmap_range() introduced a assumption that
parameter size != 0, otherwise the dma_pte_clean_range() would have a
overflowed argument. But the user like KVM don't have this assumption before,
then some BUG() triggered.
Fix it by ignoring size = 0.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6:
PCI: Fix IRQ swizzling for ARI-enabled devices
ia64/PCI: adjust section annotation for pcibios_setup()
x86/PCI: get root CRS before scanning children
x86/PCI: fix boundary checking when using root CRS
PCI MSI: Fix restoration of MSI/MSI-X mask states in suspend/resume
PCI MSI: Unmask MSI if setup failed
PCI MSI: shorten PCI_MSIX_ENTRY_* symbol names
PCI: make pci_name() take const argument
PCI: More PATA quirks for not entering D3
PCI: fix kernel-doc warnings
PCI: check if bus has a proper bridge device before triggering SBR
PCI: remove pci_dac_dma_... APIs on mn10300
PCI ECRC: Remove unnecessary semicolons
PCI MSI: Return if alloc_msi_entry for MSI-X failed
Our current strategy for pass-through mode is to put all devices into
the 1:1 domain at startup (which is before we know what their dma_mask
will be), and only _later_ take them out of that domain, if it turns out
that they really can't address all of memory.
However, when there are a bunch of PCI devices behind a bridge, they all
end up with the same source-id on their DMA transactions, and hence in
the same IOMMU domain. This means that we _can't_ easily move them from
the 1:1 domain into their own domain at runtime, because there might be DMA
in-flight from their siblings.
So we have to adjust our pass-through strategy: For PCI devices not on
the root bus, and for the bridges which will take responsibility for
their transactions, we have to start up _out_ of the 1:1 domain, just in
case.
This fixes the BUG() we see when we have 32-bit-capable devices behind a
PCI-PCI bridge, and use the software identity mapping.
It does mean that we might end up using 'normal' mapping mode for some
devices which could actually live with the faster 1:1 mapping -- but
this is only for PCI devices behind bridges, which presumably aren't the
devices for which people are most concerned about performance.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
At boot time, the dma_mask won't have been set on any devices, so we
assume that all devices will be 64-bit capable (and thus get a 1:1 map).
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
This should fix kernel.org bug #11821, where the dcdbas driver makes up
a platform device and then uses dma_alloc_coherent() on it, in an
attempt to get memory < 4GiB.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
We need to give people a little more time to fix the broken drivers.
Re-introduce this, but tied in properly with the 'iommu=pt' support this
time. Change the config option name and make it default to 'no' too.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
We do this twice, and it's about to get more complicated. This makes the
code slightly clearer about what it's doing, too.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
When we reattach a device to the si_domain (because it's been removed
from a VM), we weren't calling domain_context_mapping() to actually tell
the hardware about that.
We should really put the call to domain_context_mapping() into
domain_add_dev_info() -- we never call the latter without also doing the
former, and we can keep the error paths simple that way. But that's a
cleanup which can wait for 2.6.32 now.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
We should check iommu_dummy() _first_, because that means it's attached
to an iommu that we've just disabled completely. At the moment, we might
try to put the device into the identity mapping domain.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
The aligned_nrpages() function rounds up to the next VM page, but
returns its result as a number of DMA pages.
Purely theoretical except on IA64, which doesn't boot with VT-d right
now anyway.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
* git://git.infradead.org/iommu-2.6: (38 commits)
intel-iommu: Don't keep freeing page zero in dma_pte_free_pagetable()
intel-iommu: Introduce first_pte_in_page() to simplify PTE-setting loops
intel-iommu: Use cmpxchg64_local() for setting PTEs
intel-iommu: Warn about unmatched unmap requests
intel-iommu: Kill superfluous mapping_lock
intel-iommu: Ensure that PTE writes are 64-bit atomic, even on i386
intel-iommu: Make iommu=pt work on i386 too
intel-iommu: Performance improvement for dma_pte_free_pagetable()
intel-iommu: Don't free too much in dma_pte_free_pagetable()
intel-iommu: dump mappings but don't die on pte already set
intel-iommu: Combine domain_pfn_mapping() and domain_sg_mapping()
intel-iommu: Introduce domain_sg_mapping() to speed up intel_map_sg()
intel-iommu: Simplify __intel_alloc_iova()
intel-iommu: Performance improvement for domain_pfn_mapping()
intel-iommu: Performance improvement for dma_pte_clear_range()
intel-iommu: Clean up iommu_domain_identity_map()
intel-iommu: Remove last use of PHYSICAL_PAGE_MASK, for reserving PCI BARs
intel-iommu: Make iommu_flush_iotlb_psi() take pfn as argument
intel-iommu: Change aligned_size() to aligned_nrpages()
intel-iommu: Clean up intel_map_sg(), remove domain_page_mapping()
...
Check dma_pte_present() and only free the page if there _is_ one.
Kind of surprising that there was no warning about this.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
On Wed, 2009-07-01 at 16:59 -0700, Linus Torvalds wrote:
> I also _really_ hate how you do
>
> (unsigned long)pte >> VTD_PAGE_SHIFT ==
> (unsigned long)first_pte >> VTD_PAGE_SHIFT
Kill this, in favour of just looking to see if the incremented pte
pointer has 'wrapped' onto the next page. Which means we have to check
it _after_ incrementing it, not before.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
For many purposes, including interrupt-swizzling, devices with ARI
enabled behave as if they have one device (number 0) and 256 functions.
This probably hasn't bitten us in practice because all ARI devices I've
seen are also IOV devices, and IOV devices are required to use MSI.
This isn't guaranteed, and there are legitimate reasons to use ARI
without IOV, and hence potentially use pin-based interrupts.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
This would have found the bug in i386 pci_unmap_addr() a long time ago.
We shouldn't just silently return without doing anything.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Since we're using cmpxchg64() anyway (because that's the only way to do
an atomic 64-bit store on i386), we might as well ditch the extra
locking and just use cmpxchg64() to ensure that we don't add the page
twice.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
This fixes kernel.org bug #13584. The IOVA code attempted to optimise
the insertion of new ranges into the rbtree, with the unfortunate result
that some ranges just didn't get inserted into the tree at all. Then
those ranges would be handed out more than once, and things kind of go
downhill from there.
Introduced after 2.6.25 by ddf02886cb
("PCI: iova RB tree setup tweak").
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: mark gross <mgross@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As with other functions, batch the CPU data cache flushes and don't keep
recalculating PTE addresses.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
The loop condition was wrong -- we should free a PMD only if its
_entire_ range is within the range we're intending to clear. The
early-termination condition was right, but not the loop.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Instead of calling domain_pfn_mapping() repeatedly with single or
small numbers of pages, just pass the sglist in. It can optimise the
number of cache flushes like domain_pfn_mapping() does, and gives a huge
speedup for large scatterlists.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
There are 2 problems on mask states in suspend/resume.
[1]:
It is better to restore the mask states of MSI/MSI-X to initial states
(MSI is unmasked, MSI-X is masked) when we release the device.
The pci_msi_shutdown() does the restoration of mask states for MSI,
while the msi_free_irqs() does it for MSI-X. In other words, in the
"disable" path both of MSI and MSI-X are handled, but in the "shutdown"
path only MSI is handled.
MSI:
pci_disable_msi()
=> pci_msi_shutdown()
[ mask states for MSI restored ]
=> msi_set_enable(dev, pos, 0);
=> msi_free_irqs()
MSI-X:
pci_disable_msix()
=> pci_msix_shutdown()
=> msix_set_enable(dev, 0);
=> msix_free_all_irqs
=> msi_free_irqs()
[ mask states for MSI-X restored ]
This patch moves the masking for MSI-X from msi_free_irqs() to
pci_msix_shutdown().
This change has some positive side effects:
- It prevents OS from touching mask states before reading preserved
bits in the register, which can be happen if msi_free_irqs() is
called from error path in msix_capability_init().
- It also prevents touching the register after turning off MSI-X in
"disable" path, which can be a problem on some devices.
[2]:
We have cache of the mask state in msi_desc, which is automatically
updated when msi/msix_mask_irq() is called. This cached states are
used for the resume.
But since what need to be restored in the resume is the states before
the shutdown on the suspend, calling msi/msix_mask_irq() from
pci_msi/msix_shutdown() is not appropriate.
This patch introduces __msi/msix_mask_irq() that do mask as same
as msi/msix_mask_irq() but does not update cached state, for use
in pci_msi/msix_shutdown().
[updated: get rid of msi/msix_mask_irq_nocache() (proposed by Matthew Wilcox)]
Reviewed-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
The initial state of mask register of MSI is unmasked. We set it
masked before calling arch_setup_msi_irqs(). If arch_setup_msi_irq()
fails, it is better to restore the state of the mask register.
Reviewed-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
These names are too long! Drop _OFFSET to save some bytes/lines.
Reviewed-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
The ALi loses some state if it goes into D3. Unfortunately even with the
chipset documents I can't figure out how to restore some bits of it.
The VIA one saves/restores apparently fine but the ACPI _GTM methods break
on some platforms if we do this and this causes cable misdetections.
These are both effectively regressions as historically nothing matched the
devices and then decided not to bind to them. Nowdays something is binding
to all sorts of devices and a result they get dumped into D3.
Signed-off-by: Alan Cox <alan@linux.intel.com>
Acked-by: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
For devices attached to the root bus, we can't trigger Secondary Bus
Reset because there is no bridge device associated with the bus. So
need to check bus->self again NULL first before using it.
Reviewed-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
Signed-off-by: Yu Zhao <yu.zhao@intel.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
In current code it continues setup even if alloc_msi_entry() for MSI-X
is failed due to lack of memory. It means arch_setup_msi_irqs() might
be called with msi_desc entries less than its argument nvec.
At least x86's arch_setup_msi_irqs() uses list_for_each_entry() for
dev->msi_list that suspected to have entries same numbers as nvec, and
it doesn't check the number of allocated vectors and passed arg nvec.
Therefore it will result in success of pci_enable_msix(), with less
vectors allocated than requested.
This patch fixes the error route to return -ENOMEM, instead of continuing
the setup (proposed by Matthew Wilcox).
Note that there is no iounmap in msi_free_irqs() if no msi_disc is
allocated.
Reviewed-by: Matthew Wilcox <matthew@wil.cx>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
There's no need for the separate iommu_alloc_iova() function, and
certainly not for it to be global. Remove the underscores while we're at
it.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
As with dma_pte_clear_range(), don't keep flushing a single PTE at a
time. And also micro-optimise the setting of PTE values rather than
using the helper functions to do all the masking.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
It's a bit silly to repeatedly call domain_flush_cache() for each PTE
individually, as we clear it. Instead, batch them up and flush a whole
range at a time. We might as well refrain from recalculating the PTE
address from scratch each time round the loop too.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
This is fairly broken anyway -- it doesn't take hotplug into account.
We should probably be checking page_is_ram() instead.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Most of its callers are having to shift for themselves anyway, so we might
as well do it in iommu_flush_iotlb_psi().
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
... and use it in the trivial cases; the other callers want individual
(and bisectable) attention, since I screwed them up the first time...
Make the BUG_ON() happen on too-large virtual address rather than
physical address, too. That's the one we care about.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>