Add support for dumping the kernel address space layout to the console.
User can enable CONFIG_DEBUG_VM to dump the virtual memory region into
dmesg buffer during boot-up.
Signed-off-by: Yash Shah <yash.shah@sifive.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
[paul.walmsley@sifive.com: dropped .init/.text/.data/.bss prints;
added PCI legacy I/O region display]
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
Enable more debugging options in the RISC-V defconfigs to help kernel
developers catch problems with patches earlier in the development
cycle.
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>
Since MIPS architecture has a sparse syscall array, select the
HAVE_SPARSE_SYSCALL_NR to save space.
Link: http://lkml.kernel.org/r/20191115234314.21599-2-hnaveed@wavecomp.com
Signed-off-by: Hassan Naveed <hnaveed@wavecomp.com>
Reviewed-by: Paul Burton <paulburton@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Currently, a lot of memory is wasted for architectures like MIPS when
init_ftrace_syscalls() allocates the array for syscalls using kcalloc.
This is because syscalls numbers start from 4000, 5000 or 6000 and
array elements up to that point are unused.
Fix this by using a data structure more suited to storing sparsely
populated arrays. The XARRAY data structure, implemented using radix
trees, is much more memory efficient for storing the syscalls in
question.
Link: http://lkml.kernel.org/r/20191115234314.21599-1-hnaveed@wavecomp.com
Signed-off-by: Hassan Naveed <hnaveed@wavecomp.com>
Reviewed-by: Paul Burton <paulburton@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Minor conflict in drivers/s390/net/qeth_l2_main.c, kept the lock
from commit c8183f5489 ("s390/qeth: fix potential deadlock on
workqueue flush"), removed the code which was removed by commit
9897d583b0 ("s390/qeth: consolidate some duplicated HW cmd code").
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
JZ4760/JZ4770/JZ4775/X1000/X1500 has an abandoned huge page tlb,
this mode is not compatible with the MIPS standard, it will cause
tlbmiss and into an infinite loop (line 21 in the tlb-funcs.S)
when starting the init process. write 0xa9000000 to cp0 register 5
sel 4 to disable this function to prevent getting stuck. Confirmed
by Ingenic, this operation will not adversely affect processors
without HPTLB function.
Signed-off-by: Zhou Yanjie <zhouyanjie@zoho.com>
Acked-by: Paul Cercueil <paul@crapouillou.net>
Signed-off-by: Paul Burton <paulburton@kernel.org>
Cc: linux-mips@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: ralf@linux-mips.org
Cc: jhogan@kernel.org
Cc: jiaxun.yang@flygoat.com
Cc: gregkh@linuxfoundation.org
Cc: malat@debian.org
Cc: tglx@linutronix.de
Cc: chenhc@lemote.com
Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:
$ sed -e 's/^ /\t/' -i */Kconfig
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Signed-off-by: Stafford Horne <shorne@gmail.com>
debugfs is broadly useful, so enable it in the RISC-V defconfigs.
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>
If number of CPUs are limited by the kernel commandline parameter nr_cpus
assignment of interrupts accourding to numa rules might not be possibe.
As a fallback use one of the online CPUs as interrupt destination.
Fixes: 69a07a41d9 ("MIPS: SGI-IP27: rework HUB interrupts")
Signed-off-by: Thomas Bogendoerfer <tbogendoerfer@suse.de>
Signed-off-by: Paul Burton <paulburton@kernel.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: linux-mips@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Implement primitives necessary for the 4th level folding, add walks of p4d
level where appropriate, replace 5leve-fixup.h with pgtable-nop4d.h and
drop usage of __ARCH_USE_5LEVEL_HACK.
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Paul Burton <paulburton@kernel.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: linux-mips@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Mike Rapoport <rppt@kernel.org>
The __pXd_offset() macros are identical to the pXd_index() macros and there
is no point to keep both of them. All architectures define and use
pXd_index() so let's keep only those to make mips consistent with the rest
of the kernel.
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Paul Burton <paulburton@kernel.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: linux-mips@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Mike Rapoport <rppt@kernel.org>
With CONFIG_MIPS_VA_BITS_48=y the build fails miserably:
CC arch/mips/kernel/asm-offsets.s
In file included from arch/mips/include/asm/pgtable.h:644,
from include/linux/mm.h:99,
from arch/mips/kernel/asm-offsets.c:15:
include/asm-generic/pgtable.h:16:2: error: #error CONFIG_PGTABLE_LEVELS is not consistent with __PAGETABLE_{P4D,PUD,PMD}_FOLDED
#error CONFIG_PGTABLE_LEVELS is not consistent with __PAGETABLE_{P4D,PUD,PMD}_FOLDED
^~~~~
include/asm-generic/pgtable.h:390:28: error: unknown type name 'p4d_t'; did you mean 'pmd_t'?
static inline int p4d_same(p4d_t p4d_a, p4d_t p4d_b)
^~~~~
pmd_t
[ ... more such errors ... ]
scripts/Makefile.build:99: recipe for target 'arch/mips/kernel/asm-offsets.s' failed
make[2]: *** [arch/mips/kernel/asm-offsets.s] Error 1
This happens because when CONFIG_MIPS_VA_BITS_48 enables 4th level of the
page tables, but neither pgtable-nop4d.h nor 5level-fixup.h are included to
cope with the 5th level.
Replace #ifdef conditions around includes of the pgtable-nop{m,u}d.h with
explicit CONFIG_PGTABLE_LEVELS and add include of 5level-fixup.h for the
case when CONFIG_PGTABLE_LEVELS==4
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Paul Burton <paulburton@kernel.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: linux-mips@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Mike Rapoport <rppt@kernel.org>
This changes system reboot for socfpga to issue a cold reboot by
default instead of a warm reboot.
Warm reboot can still be used by setting reboot_mode to
REBOOT_WARM (e.g. via kernel command line 'reboot='), but this
patch ensures cold reboot is issued for both REBOOT_COLD and
REBOOT_HARD.
Also, cold reboot is more fail safe than warm reboot has some
issues at least fo CSEL=0 and BSEL=qspi, where the boot rom does
not set the qspi clock to a valid range.
Signed-off-by: Simon Goldschmidt <simon.k.r.goldschmidt@gmail.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
It's not valid to call crypto_unregister_skciphers() without a prior
call to crypto_register_skciphers().
Fixes: 84e03fa39f ("crypto: x86/chacha - expose SIMD ChaCha routine as library function")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The API will be used by the hv_balloon and hv_vmbus drivers.
Balloon up/down and hot-add of memory must not be active if the user
wants the Linux VM to support hibernation, because they are incompatible
with hibernation according to Hyper-V team, e.g. upon suspend the
balloon VSP doesn't save any info about the ballooned-out pages (if any);
so, after Linux resumes, Linux balloon VSC expects that the VSP will
return the pages if Linux is under memory pressure, but the VSP will
never do that, since the VSP thinks it never stole the pages from the VM.
So, if the user wants Linux VM to support hibernation, Linux must forbid
balloon up/down and hot-add, and the only functionality of the balloon VSC
driver is reporting the VM's memory pressure to the host.
Ideally, when Linux detects that the user wants it to support hibernation,
the balloon VSC should tell the VSP that it does not support ballooning
and hot-add. However, the current version of the VSP requires the VSC
should support these capabilities, otherwise the capability negotiation
fails and the VSC can not load at all, so with the later changes to the
VSC driver, Linux VM still reports to the VSP that the VSC supports these
capabilities, but the VSC ignores the VSP's requests of balloon up/down
and hot add, and reports an error to the VSP, when applicable. BTW, in
the future the balloon VSP driver will allow the VSC to not support the
capabilities of balloon up/down and hot add.
The ACPI S4 state is not a must for hibernation to work, because Linux is
able to hibernate as long as the system can shut down. However in practice
we decide to artificially use the presence of the virtual ACPI S4 state as
an indicator of the user's intent of using hibernation, because Linux VM
must find a way to know if the user wants to use the hibernation feature
or not.
By default, Hyper-V does not enable the virtual ACPI S4 state; on recent
Hyper-V hosts (e.g. RS5, 19H1), the administrator is able to enable the
state for a VM by WMI commands.
Once all the vmbus and VSC patches for the hibernation feature are
accepted, an extra patch will be submitted to forbid hibernation if the
virtual ACPI S4 state is absent, i.e. hv_is_hibernation_supported() is
false.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Hyper-V assumes page size to be 4K. While this assumption holds true on
x86 architecture, it might not be true for ARM64 architecture. Hence
define hyper-v specific function to allocate a zeroed page which can
have a different implementation on ARM64 architecture to handle the
conflict between hyper-v's assumed page size and actual guest page size.
Signed-off-by: Himadri Pandya <himadri18.07@gmail.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Commit d3c6161956 ("ARM: 8788/1: ftrace: remove old mcount support")
removed the old mcount support, but forget to remove these three
declarations. This patch removes them.
Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Since $(NM) variable can be easily overridden for the whole build, it's
better to use it instead of $(CROSS_COMPILE)nm. The use of $(CROSS_COMPILE)
prefixed variables where their calculated equivalents can be used is
incorrect. This fixes issues with builds where $(NM) is set to llvm-nm.
Link: https://github.com/ClangBuiltLinux/linux/issues/766
Signed-off-by: Dmitry Golovin <dima@golovin.in>
Suggested-by: Nick Desaulniers <ndesaulniers@google.com>
Cc: Matthias Maennich <maennich@google.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
- Ensure PAN is re-enabled following user fault in uaccess routines
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAl3WdTEQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNE43B/9ow5C7aV1Jmb9dk3zfkDkVuviCkv9STpNW
YosKSAHwGvsfmBd/f+kd/ipqnt/fBLbXHqRderbayDW7xundCP8Qgl51N2RwJ92O
si3rPHW62xjbv0U3XqY9h74wH4jfox3KAT8c5EBcLuZtmV3gTb1txWDQ0PfQZMpN
0cmfTRPHB7LKJ/hupEyq6iVPYx+mV1ajenW1uzBOQhu+c8jC5Uf9kDn/D93rCYCg
7NL88odhP0Oyt6HDcR5LB+nl7fz+q0jXzxyF8qHDWNitV4aovU9rLMpDpn4YVvYd
DcFEyGdwYSxExLFrBuvmevZ7+2biT3qWQcPuNQGCcqWQSu38ULFK
=r7nI
-----END PGP SIGNATURE-----
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fix from Will Deacon:
"Ensure PAN is re-enabled following user fault in uaccess routines.
After I thought we were done for 5.4, we had a report this week of a
nasty issue that has been shown to leak data between different user
address spaces thanks to corruption of entries in the TLB. In
hindsight, we should have spotted this in review when the PAN code was
merged back in v4.3, but hindsight is 20/20 and I'm trying not to beat
myself up too much about it despite being fairly miserable.
Anyway, the fix is "obvious" but the actual failure is more more
subtle, and is described in the commit message. I've included a fairly
mechanical follow-up patch here as well, which moves this checking out
into the C wrappers which is what we do for {get,put}_user() already
and allows us to remove these bloody assembly macros entirely. The
patches have passed kernelci [1] [2] [3] and CKI [4] tests over night,
as well as some targetted testing [5] for this particular issue.
The first patch is tagged for stable and should be applied to 4.14,
4.19 and 5.3. I have separate backports for 4.4 and 4.9, which I'll
send out once this has landed in your tree (although the original
patch applies cleanly, it won't build for those two trees).
Thanks to Pavel Tatashin for reporting this and Mark Rutland for
helping to diagnose the issue and review/test the solution"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: uaccess: Remove uaccess_*_not_uao asm macros
arm64: uaccess: Ensure PAN is re-enabled after unhandled uaccess fault
Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:
$ sed -e 's/^ /\t/' -i */Kconfig
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>
[paul.walmsley@sifive.com: use two leading spaces for help text to align
with common arch/ practice]
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
When the NMI lands on an ESPFIX_SS, we are on the entry stack and must
swizzle, otherwise we'll run do_nmi() on the entry stack, which is
BAD.
Also, similar to the normal exception path, we need to correct the
ESPFIX magic before leaving the entry stack, otherwise pt_regs will
present a non-flat stack pointer.
Tested by running sigreturn_32 concurrent with perf-record.
Fixes: e5862d0515 ("x86/entry/32: Leave the kernel via trampoline stack")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: stable@kernel.org
Right now, we do some fancy parts of the exception entry path while SS
might have a nonzero base: we fill in regs->ss and regs->sp, and we
consider switching to the kernel stack. This results in regs->ss and
regs->sp referring to a non-flat stack and it may result in
overflowing the entry stack. The former issue means that we can try to
call iret_exc on a non-flat stack, which doesn't work.
Tested with selftests/x86/sigreturn_32.
Fixes: 45d7b25574 ("x86/entry/32: Enter the kernel via trampoline stack")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@kernel.org
This will allow us to get percpu access working before FIXUP_FRAME,
which will allow us to unwind ESPFIX earlier.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@kernel.org
When re-building the IRET frame we use %eax as an destination %esp,
make sure to then also match the segment for when there is a nonzero
SS base (ESPFIX).
[peterz: Changelog and minor edits]
Fixes: 3c88c692c2 ("x86/stackframe/32: Provide consistent pt_regs")
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@kernel.org
As reported by Lai, the commit 3c88c692c2 ("x86/stackframe/32:
Provide consistent pt_regs") wrecked the IRET EXTABLE entry by making
.Lirq_return not point at IRET.
Fix this by placing IRET_FRAME in RESTORE_REGS, to mirror how
FIXUP_FRAME is part of SAVE_ALL.
Fixes: 3c88c692c2 ("x86/stackframe/32: Provide consistent pt_regs")
Reported-by: Lai Jiangshan <laijs@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: stable@kernel.org
The entry stack in the cpu entry area is protected against overflow by the
readonly GDT on 64-bit, but on 32-bit the GDT needs to be writeable and
therefore does not trigger a fault on stack overflow.
Add a guard page.
Fixes: c482feefe1 ("x86/entry/64: Make cpu_entry_area.tss read-only")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@kernel.org
Commit 945fd17ab6 ("x86/cpu_entry_area: Sync cpu_entry_area to
initial_page_table") introduced the sync for the initial page table for
32bit.
sync_initial_page_table() uses clone_pgd_range() which does the update for
the kernel page table. If PTI is enabled it also updates the user space
page table counterpart, which is assumed to be in the next page after the
target PGD.
At this point in time 32-bit did not have PTI support, so the user space
page table update was not taking place.
The support for PTI on 32-bit which was introduced later on, did not take
that into account and missed to add the user space counter part for the
initial page table.
As a consequence sync_initial_page_table() overwrites any data which is
located in the page behing initial_page_table causing random failures,
e.g. by corrupting doublefault_tss and wreckaging the doublefault handler
on 32bit.
Fix it by adding a "user" page table right after initial_page_table.
Fixes: 7757d607c6 ("x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Cc: stable@kernel.org
The double fault TSS was missing GS setup, which is needed for stack
canaries to work.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@kernel.org
Drop the rbt_memtype_*() call rbt_ prefix, as we no longer use
an rbtree directly.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20191121011601.20611-4-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Get rid of the passing the rb_root down the helper calls; there
is only one: &memtype_rbroot.
No change in functionality.
[ mingo: Fixed the changelog which described a different version of the patch. ]
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20191121011601.20611-3-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With some considerations, the custom pat_rbtree implementation can be
simplified to use most of the generic interval_tree machinery:
- The tree inorder traversal can slightly differ when there are key
('start') collisions in the tree due to one going left and another right.
This, however, only affects the output of debugfs' pat_memtype_list file.
- Generic interval trees are now fully closed [a, b], for which we need
to adjust the last endpoint (ie: end - 1).
- Erasing logic must remain untouched as well.
- In order for the types to remain u64, the 'memtype_interval' calls are
introduced, as opposed to simply using struct interval_tree.
In addition, the PAT tree might potentially also benefit by the fast overlap
detection for the insertion case when looking up the first overlapping node
in the tree.
No change in behavior is intended.
Finally, I've tested this on various servers, via sanity warnings, running
side by side with the current version and so far see no differences in the
returned pointer node when doing memtype_rb_lowest_match() lookups.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20191121011601.20611-2-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Using a mask to represent bus DMA constraints has a set of limitations.
The biggest one being it can only hold a power of two (minus one). The
DMA mapping code is already aware of this and treats dev->bus_dma_mask
as a limit. This quirk is already used by some architectures although
still rare.
With the introduction of the Raspberry Pi 4 we've found a new contender
for the use of bus DMA limits, as its PCIe bus can only address the
lower 3GB of memory (of a total of 4GB). This is impossible to represent
with a mask. To make things worse the device-tree code rounds non power
of two bus DMA limits to the next power of two, which is unacceptable in
this case.
In the light of this, rename dev->bus_dma_mask to dev->bus_dma_limit all
over the tree and treat it as such. Note that dev->bus_dma_limit should
contain the higher accessible DMA address.
Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
In addition to using vcsi regulator for the display, looks like droid4 is
using vcsi regulator to trigger off mode internally with the PMIC firmware
when the SoC enters deeper idle states. This is configured in the Motorola
Mapphone Linux kernel sources as "zerov_regulator".
As we currently don't support off mode during idle for omap4, we must
prevent vcsi from being disabled when the display is blanked to prevent
the PMIC change to off mode. Otherwise the device will hang on entering
idle when the display is blanked.
Before commit 089b3f61ec ("regulator: core: Let boot-on regulators be
powered off"), the boot-on regulators never got disabled like they should
and vcsi did not get turned off on idle.
Let's fix the issue by setting vcsi to always-on for now. Later on we may
want to claim the vcsi regulator also in the PM code if needed.
Fixes: 089b3f61ec ("regulator: core: Let boot-on regulators be powered off")
Cc: Merlijn Wajer <merlijn@wizzup.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Sebastian Reichel <sre@kernel.org>
Signed-off-by: Tony Lindgren <tony@atomide.com>
The AMD FCH USB XHCI Controller advertises support for generating PME#
while in D0. When in D0, it does signal PME# for USB 3.0 connect events,
but not for USB 2.0 or USB 1.1 connect events, which means the controller
doesn't wake correctly for those events.
00:10.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] FCH USB XHCI Controller [1022:7914] (rev 20) (prog-if 30 [XHCI])
Subsystem: Dell FCH USB XHCI Controller [1028:087e]
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Clear PCI_PM_CAP_PME_D0 in dev->pme_support to indicate the device will not
assert PME# from D0 so we don't rely on it.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203673
Link: https://lore.kernel.org/r/20190902145252.32111-1-kai.heng.feng@canonical.com
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org
Update arch/x86/pci/Makefile replacing the deprecated EXTRA_CFLAGS with the
ccflags-y matching recommendation per section 3.7 of
Documentation/kbuild/makefiles.txt.
Link: https://lore.kernel.org/r/20190819060532.17093-1-kw@linux.com
Signed-off-by: Krzysztof Wilczynski <kw@linux.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Add SPDX GPL-2.0 to numachip.c, which referred to the kernel default
"COPYING" file, which specifies GPL version 2.
Remove the boilerplate language referring to the GPL and "COPYING", relying
on the assertion in b24413180f ("License cleanup: add SPDX GPL-2.0
license identifier to files with no license") that the SPDX identifier may
be used instead of the full boilerplate text.
[bhelgaas: split to separate patch]
Link: https://lore.kernel.org/r/20190828135322.10370-1-kw@linux.com
Signed-off-by: Krzysztof Wilczynski <kw@linux.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Preparatory work for shattering mmu.c into multiple files. Besides making it easier
to follow, this will also make it possible to write unit tests for various parts.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
According to Intel SDM section 28.3.3.3/28.3.3.4 Guidelines for Use
of the INVVPID/INVEPT Instruction, the hypervisor needs to execute
INVVPID/INVEPT X in case CPU executes VMEntry with VPID/EPTP X and
either: "Virtualize APIC accesses" VM-execution control was changed
from 0 to 1, OR the value of apic_access_page was changed.
In the nested case, the burden falls on L1, unless L0 enables EPT in
vmcs02 but L1 enables neither EPT nor VPID in vmcs12. For this reason
prepare_vmcs02() and load_vmcs12_host_state() have special code to
request a TLB flush in case L1 does not use EPT but it uses
"virtualize APIC accesses".
This special case however is not necessary. On a nested vmentry the
physical TLB will already be flushed except if all the following apply:
* L0 uses VPID
* L1 uses VPID
* L0 can guarantee TLB entries populated while running L1 are tagged
differently than TLB entries populated while running L2.
If the first condition is false, the processor will flush the TLB
on vmentry to L2. If the second or third condition are false,
prepare_vmcs02() will request KVM_REQ_TLB_FLUSH. However, even
if both are true, no extra TLB flush is needed to handle the APIC
access page:
* if L1 doesn't use VPID, the second condition doesn't hold and the
TLB will be flushed anyway.
* if L1 uses VPID, it has to flush the TLB itself with INVVPID and
section 28.3.3.3 doesn't apply to L0.
* even INVEPT is not needed because, if L0 uses EPT, it uses different
EPTP when running L2 than L1 (because guest_mode is part of mmu-role).
In this case SDM section 28.3.3.4 doesn't apply.
Similarly, examining nested_vmx_vmexit()->load_vmcs12_host_state(),
one could note that L0 won't flush TLB only in cases where SDM sections
28.3.3.3 and 28.3.3.4 don't apply. In particular, if L0 uses different
VPIDs for L1 and L2 (i.e. vmx->vpid != vmx->nested.vpid02), section
28.3.3.3 doesn't apply.
Thus, remove this flush from prepare_vmcs02() and nested_vmx_vmexit().
Side-note: This patch can be viewed as removing parts of commit
fb6c819843 ("kvm: vmx: Flush TLB when the APIC-access address changes”)
that is not relevant anymore since commit
1313cc2bd8 ("kvm: mmu: Add guest_mode to kvm_mmu_page_role”).
i.e. The first commit assumes that if L0 use EPT and L1 doesn’t use EPT,
then L0 will use same EPTP for both L0 and L1. Which indeed required
L0 to execute INVEPT before entering L2 guest. This assumption is
not true anymore since when guest_mode was added to mmu-role.
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fixes gcc '-Wunused-but-set-variable' warning:
arch/x86/kvm/x86.c: In function kvm_make_scan_ioapic_request_mask:
arch/x86/kvm/x86.c:7911:7: warning: variable called set but not
used [-Wunused-but-set-variable]
It is not used since commit 7ee30bc132 ("KVM: x86: deliver KVM
IOAPIC scan request to target vCPUs")
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Fixes: 7ee30bc132 ("KVM: x86: deliver KVM IOAPIC scan request to target vCPUs")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
vmcs->apic_access_page is simply a token that the hypervisor puts into
the PFN of a 4KB EPTE (or PTE if using shadow-paging) that triggers
APIC-access VMExit or APIC virtualization logic whenever a CPU running
in VMX non-root mode read/write from/to this PFN.
As every write either triggers an APIC-access VMExit or write is
performed on vmcs->virtual_apic_page, the PFN pointed to by
vmcs->apic_access_page should never actually be touched by CPU.
Therefore, there is no need to mark vmcs02->apic_access_page as dirty
after unpin it on L2->L1 emulated VMExit or when L1 exit VMX operation.
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If X86_FEATURE_RTM is disabled, the guest should not be able to access
MSR_IA32_TSX_CTRL. We can therefore use it in KVM to force all
transactions from the guest to abort.
Tested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The current guest mitigation of TAA is both too heavy and not really
sufficient. It is too heavy because it will cause some affected CPUs
(those that have MDS_NO but lack TAA_NO) to fall back to VERW and
get the corresponding slowdown. It is not really sufficient because
it will cause the MDS_NO bit to disappear upon microcode update, so
that VMs started before the microcode update will not be runnable
anymore afterwards, even with tsx=on.
Instead, if tsx=on on the host, we can emulate MSR_IA32_TSX_CTRL for
the guest and let it run without the VERW mitigation. Even though
MSR_IA32_TSX_CTRL is quite heavyweight, and we do not want to write
it on every vmentry, we can use the shared MSR functionality because
the host kernel need not protect itself from TSX-based side-channels.
Tested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Because KVM always emulates CPUID, the CPUID clear bit
(bit 1) of MSR_IA32_TSX_CTRL must be emulated "manually"
by the hypervisor when performing said emulation.
Right now neither kvm-intel.ko nor kvm-amd.ko implement
MSR_IA32_TSX_CTRL but this will change in the next patch.
Reviewed-by: Jim Mattson <jmattson@google.com>
Tested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
"Shared MSRs" are guest MSRs that are written to the host MSRs but
keep their value until the next return to userspace. They support
a mask, so that some bits keep the host value, but this mask is
only used to skip an unnecessary MSR write and the value written
to the MSR is always the guest MSR.
Fix this and, while at it, do not update smsr->values[slot].curr if
for whatever reason the wrmsr fails. This should only happen due to
reserved bits, so the value written to smsr->values[slot].curr
will not match when the user-return notifier and the host value will
always be restored. However, it is untidy and in rare cases this
can actually avoid spurious WRMSRs on return to userspace.
Cc: stable@vger.kernel.org
Reviewed-by: Jim Mattson <jmattson@google.com>
Tested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM does not implement MSR_IA32_TSX_CTRL, so it must not be presented
to the guests. It is also confusing to have !ARCH_CAP_TSX_CTRL_MSR &&
!RTM && ARCH_CAP_TAA_NO: lack of MSR_IA32_TSX_CTRL suggests TSX was not
hidden (it actually was), yet the value says that TSX is not vulnerable
to microarchitectural data sampling. Fix both.
Cc: stable@vger.kernel.org
Tested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Allow non-ISV data aborts to be reported to userspace
- Allow injection of data aborts from userspace
- Expose stolen time to guests
- GICv4 performance improvements
- vgic ITS emulation fixes
- Simplify FWB handling
- Enable halt pool counters
- Make the emulated timer PREEMPT_RT compliant
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAl3VZ0EPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDwikQAJ/lQT97zEKV3dnpD/jjmEic/3QvTGljS4p+
pbwZAzyoSMc09lAK2pkaGRRc7euPnp4uLdRS4SenToVzmUQCzuxpQEEMdV/wjp4V
WLQ1WnTEAhYkm7k5MVo4uy3eD7nVWHWXgfQJvzL4EYZ5R/gd9NzBrnAc6LLV6hp9
0eLXIYrGFa1GESzF6P6sBDJhYpqVUcQlTI8I43kZH3iCC4+OsBxIkhHREZYsELhW
MZJIM9ZCskg2tvPC4UysaFiGjBYUJNJ0V+fFOrhyGzludP8i8rNRwgA60mznwvNw
V4N6/gLlkGK7nLqP+noUcU2wTBnIu389TcWWsX47CuDUzawCd8Fb9kX2zYauQSyS
ujE0uzoo/nhPFysh9OVVeLUZ6o/wotXoMp2t32t1c5h9N1hISEJvAWavMxTY6KzF
NEn9hWFjNcgBoArz9GKn9p2nBQpCDvu+2SlI4nL/qgZ7lPC4O3U1uq9myCOLj/gu
Can/u5EAwgyIBDVcEPHV+vP2GjyeERdXprGiG2VJTYlbHsdjgISTR+5Fy32KdGlP
YygeZxJtzretr3AYsWqD6Mri30FDSoYy9rUOWBpa+ZHbJPac0M+uKOqntV1OxPX9
QUkmNEdJcDr8fkcKxnEZ/MaZxFTGPp4vfhiT4A7dUkWTFq7ajvGo8IwN1d7PvWxS
LFMij1Js
=SxBH
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm updates for Linux 5.5:
- Allow non-ISV data aborts to be reported to userspace
- Allow injection of data aborts from userspace
- Expose stolen time to guests
- GICv4 performance improvements
- vgic ITS emulation fixes
- Simplify FWB handling
- Enable halt pool counters
- Make the emulated timer PREEMPT_RT compliant
Conflicts:
include/uapi/linux/kvm.h
Move the static keyword to the front of declaration of
cpu_pmu_of_device_ids, and resolve the following compiler
warning that can be seen when building with warnings
enabled (W=1):
arch/nds32/kernel/perf_event_cpu.c:1122:1: warning:
‘static’ is not at beginning of declaration [-Wold-style-declaration]
Signed-off-by: Krzysztof Wilczynski <kw@linux.com>
Acked-by: Greentime Hu <green.hu@gmail.com>
Signed-off-by: Greentime Hu <green.hu@gmail.com>
This patch fixes some spelling typo in Kconfig.cpu
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Greentime Hu <green.hu@gmail.com>
Signed-off-by: Greentime Hu <green.hu@gmail.com>
These patterns are cleaned-up by the top-level Makefile
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Greentime Hu <green.hu@gmail.com>
Signed-off-by: Greentime Hu <green.hu@gmail.com>
We need to check the host page size is big enough to accomodate the
EQ. Let's do this before taking a reference on the EQ page to avoid
a potential leak if the check fails.
Cc: stable@vger.kernel.org # v5.2
Fixes: 13ce3297c5 ("KVM: PPC: Book3S HV: XIVE: Add controls for the EQ configuration")
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The EQ page is allocated by the guest and then passed to the hypervisor
with the H_INT_SET_QUEUE_CONFIG hcall. A reference is taken on the page
before handing it over to the HW. This reference is dropped either when
the guest issues the H_INT_RESET hcall or when the KVM device is released.
But, the guest can legitimately call H_INT_SET_QUEUE_CONFIG several times,
either to reset the EQ (vCPU hot unplug) or to set a new EQ (guest reboot).
In both cases the existing EQ page reference is leaked because we simply
overwrite it in the XIVE queue structure without calling put_page().
This is especially visible when the guest memory is backed with huge pages:
start a VM up to the guest userspace, either reboot it or unplug a vCPU,
quit QEMU. The leak is observed by comparing the value of HugePages_Free in
/proc/meminfo before and after the VM is run.
Ideally we'd want the XIVE code to handle the EQ page de-allocation at the
platform level. This isn't the case right now because the various XIVE
drivers have different allocation needs. It could maybe worth introducing
hooks for this purpose instead of exposing XIVE internals to the drivers,
but this is certainly a huge work to be done later.
In the meantime, for easier backport, fix both vCPU unplug and guest reboot
leaks by introducing a wrapper around xive_native_configure_queue() that
does the necessary cleanup.
Reported-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org # v5.2
Fixes: 13ce3297c5 ("KVM: PPC: Book3S HV: XIVE: Add controls for the EQ configuration")
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Greg Kurz <groug@kaod.org>
Tested-by: Lijun Pan <ljp@linux.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
On PowerNV the PCIe topology is (currently) managed by the powernv platform
code in Linux in cooperation with the platform firmware. Linux's native
PCIe port service drivers operate independently of both and this can cause
problems.
The main issue is that the portbus driver will conflict with the platform
specific hotplug driver (pnv_php) over ownership of the MSI used to notify
the host when a hotplug event occurs. The portbus driver claims this MSI on
behalf of the individual port services because the same interrupt is used
for hotplug events, PMEs (on root ports), and link bandwidth change
notifications. The portbus driver will always claim the interrupt even if
the individual port service drivers, such as pciehp, are compiled out.
The second, bigger, problem is that the hotplug port service driver
fundamentally does not work on PowerNV. The platform assumes that all
PCI devices have a corresponding arch-specific handle derived from the DT
node for the device (pci_dn) and without one the platform will not allow
a PCI device to be enabled. This problem is largely due to historical
baggage, but it can't be resolved without significant re-factoring of the
platform PCI support.
We can fix these problems in the interim by setting the
"pcie_ports_disabled" flag during platform initialisation. The flag
indicates the platform owns the PCIe ports which stops the portbus driver
from being registered.
This does have the side effect of disabling all port services drivers
that is: AER, PME, BW notifications, hotplug, and DPC. However, this is
not a huge disadvantage on PowerNV since these services are either unused
or handled through other means.
Fixes: 66725152fb ("PCI/hotplug: PowerPC PowerNV PCI hotplug driver")
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191118065553.30362-1-oohall@gmail.com
There is a config item CONFIG_SIMPLE_GPIO which
provides simple memory mapped GPIOs specific to powerpc.
However, the only platform which selects this option is
mpc5200, and this platform doesn't use it.
There are three boards calling simple_gpiochip_init(), but
as they don't select CONFIG_SIMPLE_GPIO, this is just a nop.
Simple_gpio is just redundant with the generic MMIO GPIO
driver which can be found in driver/gpio/ and selected via
CONFIG_GPIO_GENERIC_PLATFORM, so drop simple_gpio driver.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/bf930402613b41b42d0441b784e0cc43fc18d1fb.1572529632.git.christophe.leroy@c-s.fr
Daniel Borkmann says:
====================
pull-request: bpf-next 2019-11-20
The following pull-request contains BPF updates for your *net-next* tree.
We've added 81 non-merge commits during the last 17 day(s) which contain
a total of 120 files changed, 4958 insertions(+), 1081 deletions(-).
There are 3 trivial conflicts, resolve it by always taking the chunk from
196e8ca74886c433:
<<<<<<< HEAD
=======
void *bpf_map_area_mmapable_alloc(u64 size, int numa_node);
>>>>>>> 196e8ca748
<<<<<<< HEAD
void *bpf_map_area_alloc(u64 size, int numa_node)
=======
static void *__bpf_map_area_alloc(u64 size, int numa_node, bool mmapable)
>>>>>>> 196e8ca748
<<<<<<< HEAD
if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
=======
/* kmalloc()'ed memory can't be mmap()'ed */
if (!mmapable && size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
>>>>>>> 196e8ca748
The main changes are:
1) Addition of BPF trampoline which works as a bridge between kernel functions,
BPF programs and other BPF programs along with two new use cases: i) fentry/fexit
BPF programs for tracing with practically zero overhead to call into BPF (as
opposed to k[ret]probes) and ii) attachment of the former to networking related
programs to see input/output of networking programs (covering xdpdump use case),
from Alexei Starovoitov.
2) BPF array map mmap support and use in libbpf for global data maps; also a big
batch of libbpf improvements, among others, support for reading bitfields in a
relocatable manner (via libbpf's CO-RE helper API), from Andrii Nakryiko.
3) Extend s390x JIT with usage of relative long jumps and loads in order to lift
the current 64/512k size limits on JITed BPF programs there, from Ilya Leoshkevich.
4) Add BPF audit support and emit messages upon successful prog load and unload in
order to have a timeline of events, from Daniel Borkmann and Jiri Olsa.
5) Extension to libbpf and xdpsock sample programs to demo the shared umem mode
(XDP_SHARED_UMEM) as well as RX-only and TX-only sockets, from Magnus Karlsson.
6) Several follow-up bug fixes for libbpf's auto-pinning code and a new API
call named bpf_get_link_xdp_info() for retrieving the full set of prog
IDs attached to XDP, from Toke Høiland-Jørgensen.
7) Add BTF support for array of int, array of struct and multidimensional arrays
and enable it for skb->cb[] access in kfree_skb test, from Martin KaFai Lau.
8) Fix AF_XDP by using the correct number of channels from ethtool, from Luigi Rizzo.
9) Two fixes for BPF selftest to get rid of a hang in test_tc_tunnel and to avoid
xdping to be run as standalone, from Jiri Benc.
10) Various BPF selftest fixes when run with latest LLVM trunk, from Yonghong Song.
11) Fix a memory leak in BPF fentry test run data, from Colin Ian King.
12) Various smaller misc cleanups and improvements mostly all over BPF selftests and
samples, from Daniel T. Lee, Andre Guedes, Anders Roxell, Mao Wenan, Yue Haibing.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The valid memory address check in dma_capable only makes sense when mapping
normal memory, not when using dma_map_resource to map a device resource.
Add a new boolean argument to dma_capable to exclude that check for the
dma_map_resource case.
Fixes: b12d66278d ("dma-direct: check for overflows on 32 bit DMA addresses")
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Support for calling the DMA API functions without a valid device pointer
was removed a while ago, so remove the stale support for that from the
powerpc __phys_to_dma / __dma_to_phys helpers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Currently each architectures that wants to override dma_to_phys and
phys_to_dma also has to provide dma_capable. But there isn't really
any good reason for that. powerpc and mips just have copies of the
generic one minus the latests fix, and the arm one was the inspiration
for said fix, but misses the bus_dma_mask handling.
Make all architectures use the generic version instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Reviewed-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
These are pure cache maintainance routines, so drop the unused
struct device argument.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Add CONFIG_TI_CPSW_SWITCHDEV option to enable new cpsw switchdev driver
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add DT nodes for new cpsw switchdev driver for am571x-idk board for now to
enable testing of the new solution.
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add DT nodes for new cpsw switch dev driver.
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is safer and simpler to drop the uaccess assembly macros in favour of
inline C functions. Although this bloats the Image size slightly, it
aligns our user copy routines with '{get,put}_user()' and generally
makes the code a lot easier to reason about.
Cc: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
[will: tweaked commit message and changed temporary variable names]
Signed-off-by: Will Deacon <will@kernel.org>
A number of our uaccess routines ('__arch_clear_user()' and
'__arch_copy_{in,from,to}_user()') fail to re-enable PAN if they
encounter an unhandled fault whilst accessing userspace.
For CPUs implementing both hardware PAN and UAO, this bug has no effect
when both extensions are in use by the kernel.
For CPUs implementing hardware PAN but not UAO, this means that a kernel
using hardware PAN may execute portions of code with PAN inadvertently
disabled, opening us up to potential security vulnerabilities that rely
on userspace access from within the kernel which would usually be
prevented by this mechanism. In other words, parts of the kernel run the
same way as they would on a CPU without PAN implemented/emulated at all.
For CPUs not implementing hardware PAN and instead relying on software
emulation via 'CONFIG_ARM64_SW_TTBR0_PAN=y', the impact is unfortunately
much worse. Calling 'schedule()' with software PAN disabled means that
the next task will execute in the kernel using the page-table and ASID
of the previous process even after 'switch_mm()', since the actual
hardware switch is deferred until return to userspace. At this point, or
if there is a intermediate call to 'uaccess_enable()', the page-table
and ASID of the new process are installed. Sadly, due to the changes
introduced by KPTI, this is not an atomic operation and there is a very
small window (two instructions) where the CPU is configured with the
page-table of the old task and the ASID of the new task; a speculative
access in this state is disastrous because it would corrupt the TLB
entries for the new task with mappings from the previous address space.
As Pavel explains:
| I was able to reproduce memory corruption problem on Broadcom's SoC
| ARMv8-A like this:
|
| Enable software perf-events with PERF_SAMPLE_CALLCHAIN so userland's
| stack is accessed and copied.
|
| The test program performed the following on every CPU and forking
| many processes:
|
| unsigned long *map = mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE,
| MAP_SHARED | MAP_ANONYMOUS, -1, 0);
| map[0] = getpid();
| sched_yield();
| if (map[0] != getpid()) {
| fprintf(stderr, "Corruption detected!");
| }
| munmap(map, PAGE_SIZE);
|
| From time to time I was getting map[0] to contain pid for a
| different process.
Ensure that PAN is re-enabled when returning after an unhandled user
fault from our uaccess routines.
Cc: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Cc: <stable@vger.kernel.org>
Fixes: 338d4f49d6 ("arm64: kernel: Add support for Privileged Access Never")
Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
[will: rewrote commit message]
Signed-off-by: Will Deacon <will@kernel.org>
These changes just clean up few typos, and there is one non-critical
correction of missing put_device() after calling of_platform_populate()
for display controller.
Naturally none of this is urgent and can be merged when suitable.
-----BEGIN PGP SIGNATURE-----
iQJFBAABCAAvFiEEkgNvrZJU/QSQYIcQG9Q+yVyrpXMFAl3K6rMRHHRvbnlAYXRv
bWlkZS5jb20ACgkQG9Q+yVyrpXPrZg//Vt0zrGT7F7dyiyl+BCxUrnoejBa4aWJc
OUmdvWv8umKCepzn4QwrOTogLBgZes1Le15mPfbktmU8lSIH9KuZGm/KmZB0mO0S
uJCuoHY6gMqRF9hAhMXFcfsXcuBc0ytZVA9yOLhUP9VAsxySo7kKnfMLJ/W6xjWr
4bLW894/lDPJ3ENUrJcUR1breV6VRnYOJ6dqE3cNLSA3BNNcGIwfOJK7Ob/owkXN
k/3nTI+g6h6Pzm5WlJxt78cyhgRY1AGOZ8+UlBywgSMJmja4SaysZ1Ga+6rwjKQ9
mJFWe6b13lAr9gIu1f88Yw/zb3jYVpjhEdXUGf8VeFDTxgrFnDt6N8FuM3LbTagU
L2jsjSegC78qqYUDiFDOD+2w9V7LIxqu7G9wy27BXJqB15beE5aGFMJGnMTfhid6
xfVBUl9Q+G4dZi4TNG+SeT3pJJqmBuv80hlcsCVSC5mNoF6bvXa9zJo99DUhjsHX
tjyIILl1BsAImSbQq2LzDE+CtSjRKpovBNxtGO+2C9MeBKW7LI0+lNvTUIcKvHVJ
xvEMFdEAUE7SY0/3zKczLkHlIHP+NVolo+7+YgwAjQ28rKqcgKQ1nTLZIOT2wnQw
NDiKUJ5nTy3To8NPUzwdY645vzaY5NI+XRqRhUbk5OAgqRjkcq6T8cyA7FY6Db0b
3R99xi1BcQI=
=2Ql6
-----END PGP SIGNATURE-----
Merge tag 'omap-for-v5.5/soc-late-signed' into omap-for-v5.5/ti-sysc-late
Few late SoC changes for v5.5 merge window
These changes just clean up few typos, and there is one non-critical
correction of missing put_device() after calling of_platform_populate()
for display controller.
Naturally none of this is urgent and can be merged when suitable.
The DRA7 CPSW MDIO functional clock (gmac_clkctrl DRA7_GMAC_GMAC_CLKCTRL 0)
is specified incorrectly, which is caused incorrect MDIO bus clock
configuration MDCLK. The correct CPSW MDIO functional clock is
gmac_main_clk (125MHz), which is the same as CPSW fck. Hence fix it.
Fixes: 1faa415c9c ("ARM: dts: Add fck for cpsw mdio for omap variants")
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
am57xx-beagle-x15 revb1 and revc have 3.3V connected to the eMMC I/O
lines. Update the pinmux name to reflect this.
Signed-off-by: Faiz Abbas <faiz_abbas@ti.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
The MMC card detection GPIO polarity is active low on TAO3530, like in many
other similar boards. Now the card is not detected and it is unable to
mount rootfs from an SD card.
Fix this by using the correct polarity.
This incorrect polarity was defined already in the commit 30d95c6d70
("ARM: dts: omap3: Add Technexion TAO3530 SOM omap3-tao3530.dtsi") in v3.18
kernel and later changed to use defined GPIO constants in v4.4 kernel by
the commit 3a637e008e ("ARM: dts: Use defined GPIO constants in flags
cell for OMAP2+ boards").
While the latter commit did not introduce the issue I'm marking it with
Fixes tag due the v4.4 kernels still being maintained.
Fixes: 3a637e008e ("ARM: dts: Use defined GPIO constants in flags cell for OMAP2+ boards")
Cc: linux-stable <stable@vger.kernel.org> # 4.4+
Signed-off-by: Jarkko Nikula <jarkko.nikula@bitmer.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
kmemleak is used internally for a long time and as there isn't
any issue with it we can finally enable it in upstream.
Signed-off-by: Eugeniy Paltsev <Eugeniy.Paltsev@synopsys.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Now have pixel clock PLL driver and we can change pixel clock rate
so we don't need to enforce one exact video mode. Moreover enforcing
video mode is harmful in case of we enforce mode which isn't
supported by the monitor we are using.
Signed-off-by: Eugeniy Paltsev <Eugeniy.Paltsev@synopsys.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Use PLL driver instead of fixed-clock for PGU pixel clock.
That allows us to support wider range of graphic modes.
Signed-off-by: Eugeniy Paltsev <Eugeniy.Paltsev@synopsys.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Linux-next commit titled "perf/core: Optimize perf_init_event()"
changed the semantics of PMU device driver registration.
It was done to speed up the lookup/handling of PMU device driver
specific events. It also enforces that only one PMU device
driver will be registered of type PERF_EVENT_RAW.
This change added these line in function perf_pmu_register():
...
+ ret = idr_alloc(&pmu_idr, pmu, max, 0, GFP_KERNEL);
+ if (ret < 0)
goto free_pdc;
+
+ WARN_ON(type >= 0 && ret != type);
The warn_on generates a message. We have 3 PMU device drivers,
each registered as type PERF_TYPE_RAW.
The cf_diag device driver (arch/s390/kernel/perf_cpumf_cf_diag.c)
always hits the WARN_ON because it is the second PMU device driver
(after sampling device driver arch/s390/kernel/perf_cpumf_sf.c)
which is registered as type 4 (PERF_TYPE_RAW).
So when the sampling device driver is registered, ret has value 4.
When cf_diag device driver is registered with type 4,
ret has value of 5 and WARN_ON fires.
Adjust the PMU device drivers for s390 to support the new
semantics required by perf_pmu_register().
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Since commit 1313cc2bd8 ("kvm: mmu: Add guest_mode to kvm_mmu_page_role"),
guest_mode was added to mmu-role and therefore if L0 use EPT, it will
always run L1 and L2 with different EPTP. i.e. EPTP01!=EPTP02.
Because TLB entries are tagged with EP4TA, KVM can assume
TLB entries populated while running L2 are tagged differently
than TLB entries populated while running L1.
Therefore, update nested_has_guest_tlb_tag() to consider if
L0 use EPT instead of if L1 use EPT.
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The function is only used in kvm.ko module.
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When L1 guest uses 5-level paging, it fails vm-entry to L2 due to
invalid host-state. It needs to add CR4_LA57 bit to nested CR4_FIXED1
MSR.
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Not zeroing the bitmap used for identifying the destination vCPUs for an
IOAPIC scan request in fixed delivery mode could lead to waking up unwanted
vCPUs. This patch zeroes the vCPU bitmap before passing it to
kvm_bitmap_or_dest_vcpus(), which is responsible for setting the bitmap
with the bits corresponding to the destination vCPUs.
Fixes: 7ee30bc132c6("KVM: x86: deliver KVM IOAPIC scan request to target vCPUs")
Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If an SMT capable system is not IPL'ed from the first CPU the setup of
the physical to logical CPU mapping is broken: the IPL core gets CPU
number 0, but then the next core gets CPU number 1. Correct would be
that all SMT threads of CPU 0 get the subsequent logical CPU numbers.
This is important since a lot of code (like e.g. the CPU topology
code) assumes that CPU maps are setup like this. If the mapping is
broken the system will not IPL due to broken topology masks:
[ 1.716341] BUG: arch topology broken
[ 1.716342] the SMT domain not a subset of the MC domain
[ 1.716343] BUG: arch topology broken
[ 1.716344] the MC domain not a subset of the BOOK domain
This scenario can usually not happen since LPARs are always IPL'ed
from CPU 0 and also re-IPL is intiated from CPU 0. However older
kernels did initiate re-IPL on an arbitrary CPU. If therefore a re-IPL
from an old kernel into a new kernel is initiated this may lead to
crash.
Fix this by setting up the physical to logical CPU mapping correctly.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
vdso_per_cpu_data lowcore value is only needed for fully functional
exception handlers, which are activated in setup_lowcore_dat_off. The
same function does init vdso_per_cpu_data via vdso_alloc_boot_cpu.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Currently if the kernel is built with CONFIG_TRACE_IRQFLAGS and KASAN
and used as crash kernel it crashes itself due to
trace_hardirqs_off/trace_hardirqs_on being called with DAT off. This
happens because trace_hardirqs_off/trace_hardirqs_on are instrumented and
kasan code tries to perform access to shadow memory to validate memory
accesses. Kasan shadow memory is populated with vmemmap, so all accesses
require DAT on.
memcpy_real could be called with DAT on or off (with kasan enabled DAT
is set even before early code is executed).
Make sure that trace_hardirqs_off/trace_hardirqs_on are called with DAT
on and only actual __memcpy_real is called with DAT off.
Also annotate __memcpy_real and _memcpy_real with __no_sanitize_address
to avoid further problems due to switching DAT off.
Reviewed-by: Philipp Rudo <prudo@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
s390_crypto_shash_parmsize() return type is int, it
should not be stored in a unsigned variable, which
compared with zero.
Reported-by: Hulk Robot <hulkci@huawei.com>
Fixes: 3c2eb6b76c ("s390/crypto: Support for SHA3 via CPACF (MSA6)")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Joerg Schmidbauer <jschmidb@linux.vnet.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
This can be had with two instead of six insns, by just checking the high
CS.RPL bit.
Also adjust the comment - there would be no #GP in the mentioned cases, as
there's no segment limit violation or alike. Instead there'd be #PF, but
that one reports the target EIP of said branch, not the address of the
branch insn itself.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Link: https://lkml.kernel.org/r/a5986837-01eb-7bf8-bf42-4d3084d6a1f5@suse.com
Now that SS:ESP always get saved by SAVE_ALL, this also needs to be
accounted for in xen_iret_crit_fixup(). Otherwise the old_ax value gets
interpreted as EFLAGS, and hence VM86 mode appears to be active all the
time, leading to random "vm86_32: no user_vm86: BAD" log messages alongside
processes randomly crashing.
Since following the previous model (sitting after SAVE_ALL) would further
complicate the code _and_ retain the dependency of xen_iret_crit_fixup() on
frame manipulations done by entry_32.S, switch things around and do the
adjustment ahead of SAVE_ALL.
Fixes: 3c88c692c2 ("x86/stackframe/32: Provide consistent pt_regs")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Stable Team <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/32d8713d-25a7-84ab-b74b-aa3e88abce6b@suse.com
Once again RPL checks have been introduced which don't account for a 32-bit
kernel living in ring 1 when running in a PV Xen domain. The case in
FIXUP_FRAME has been preventing boot.
Adjust BUG_IF_WRONG_CR3 as well to guard against future uses of the macro
on a code path reachable when running in PV mode under Xen; I have to admit
that I stopped at a certain point trying to figure out whether there are
present ones.
Fixes: 3c88c692c2 ("x86/stackframe/32: Provide consistent pt_regs")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Stable Team <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/0fad341f-b7f5-f859-d55d-f0084ee7087e@suse.com
A 'struct device_type' instance can carry default attributes for the
device. Use this facility to remove the export of
nvdimm_bus_attribute_group and put the responsibility on the core rather
than leaf implementations to define this attribute.
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Oliver O'Halloran" <oohall@gmail.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Link: https://lore.kernel.org/r/157309903815.1582359.6418211876315050283.stgit@dwillia2-desk3.amr.corp.intel.com
A 'struct device_type' instance can carry default attributes for the
device. Use this facility to remove the export of
nvdimm_attribute_group and put the responsibility on the core rather
than leaf implementations to define this attribute.
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Oliver O'Halloran" <oohall@gmail.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Link: https://lore.kernel.org/r/157309903201.1582359.10966209746585062329.stgit@dwillia2-desk3.amr.corp.intel.com
A 'struct device_type' instance can carry default attributes for the
device. Use this facility to remove the export of
nd_mapping_attribute_group and put the responsibility on the core rather
than leaf implementations to define this attribute.
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Oliver O'Halloran" <oohall@gmail.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Link: https://lore.kernel.org/r/157309902686.1582359.6749533709859492704.stgit@dwillia2-desk3.amr.corp.intel.com
A 'struct device_type' instance can carry default attributes for the
device. Use this facility to remove the export of
nd_region_attribute_group and put the responsibility on the core rather
than leaf implementations to define this attribute.
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Oliver O'Halloran" <oohall@gmail.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Link: https://lore.kernel.org/r/157309902169.1582359.16828508538444551337.stgit@dwillia2-desk3.amr.corp.intel.com
A 'struct device_type' instance can carry default attributes for the
device. Use this facility to remove the export of
nd_numa_attribute_group and put the responsibility on the core rather
than leaf implementations to define this attribute.
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Oliver O'Halloran" <oohall@gmail.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Link: https://lore.kernel.org/r/157401269537.43284.14411189404186877352.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
x86/insn:
Adrian Hunter:
- Add some more Intel instructions to the opcode map:
cldemote, encls, enclu, enclv, enqcmd, enqcmds, movdir64b,
movdiri, pconfig, tpause, umonitor, umwait, wbnoinvd.
- The instruction decoding can be tested using the perf tools'
"x86 instruction decoder - new instructions" test as folllows:
$ perf test -v "new " 2>&1 | grep -i cldemote
Decoded ok: 0f 1c 00 cldemote (%eax)
Decoded ok: 0f 1c 05 78 56 34 12 cldemote 0x12345678
Decoded ok: 0f 1c 84 c8 78 56 34 12 cldemote 0x12345678(%eax,%ecx,8)
Decoded ok: 0f 1c 00 cldemote (%rax)
Decoded ok: 41 0f 1c 00 cldemote (%r8)
Decoded ok: 0f 1c 04 25 78 56 34 12 cldemote 0x12345678
Decoded ok: 0f 1c 84 c8 78 56 34 12 cldemote 0x12345678(%rax,%rcx,8)
Decoded ok: 41 0f 1c 84 c8 78 56 34 12 cldemote 0x12345678(%r8,%rcx,8)
$ perf test -v "new " 2>&1 | grep -i tpause
Decoded ok: 66 0f ae f3 tpause %ebx
Decoded ok: 66 0f ae f3 tpause %ebx
Decoded ok: 66 41 0f ae f0 tpause %r8d
callchains:
Adrian Hunter:
- Fix segfault in thread__resolve_callchain_sample().
perf probe:
- Line fixes to show only lines where probes can be used with 'perf probe -L',
and when reporting them via 'perf probe -l'.
- Support multiprobe events.
perf scripts python:
Adrian Hunter:
- Fix use of TRUE with SQLite < 3.23 in exported-sql-viewer.py.
perf maps:
- Trim 'struct map' by removing the rb_node member for sorting
by map name, as that is only needed for processing kernel maps,
and only when classifying symbols by section at load time.
Sort them by name using qsort() and do lookups using bsearch()
when map_groups__find_by_name() is used.
perf parse:
Ian Rogers:
- Report initial event parsing error, providing a less cryptic message
to state that a PMU wasn't found in the system.
perf vendor events:
James Clark:
- Fix commas so that PMU event files for arm64, power8 and power nine
become valid JSON.
libtraceevent:
Konstantin Khlebnikov:
- Fix parsing of event %o and %X argument types.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQR2GiIUctdOfX2qHhGyPKLppCJ+JwUCXdPSQwAKCRCyPKLppCJ+
J1FQAQD4pSumHTHczHm9xoYuOxFAoNlT7VaML6YXJb82+pvcCwD9FuVBmWUcBoa6
/Bmn3Iy4j37aZhQCDLdByaaKVF8n9g0=
=ST42
-----END PGP SIGNATURE-----
Merge tag 'perf-core-for-mingo-5.5-20191119' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
x86/insn:
Adrian Hunter:
- Add some more Intel instructions to the opcode map:
cldemote, encls, enclu, enclv, enqcmd, enqcmds, movdir64b,
movdiri, pconfig, tpause, umonitor, umwait, wbnoinvd.
- The instruction decoding can be tested using the perf tools'
"x86 instruction decoder - new instructions" test as folllows:
$ perf test -v "new " 2>&1 | grep -i cldemote
Decoded ok: 0f 1c 00 cldemote (%eax)
Decoded ok: 0f 1c 05 78 56 34 12 cldemote 0x12345678
Decoded ok: 0f 1c 84 c8 78 56 34 12 cldemote 0x12345678(%eax,%ecx,8)
Decoded ok: 0f 1c 00 cldemote (%rax)
Decoded ok: 41 0f 1c 00 cldemote (%r8)
Decoded ok: 0f 1c 04 25 78 56 34 12 cldemote 0x12345678
Decoded ok: 0f 1c 84 c8 78 56 34 12 cldemote 0x12345678(%rax,%rcx,8)
Decoded ok: 41 0f 1c 84 c8 78 56 34 12 cldemote 0x12345678(%r8,%rcx,8)
$ perf test -v "new " 2>&1 | grep -i tpause
Decoded ok: 66 0f ae f3 tpause %ebx
Decoded ok: 66 0f ae f3 tpause %ebx
Decoded ok: 66 41 0f ae f0 tpause %r8d
callchains:
Adrian Hunter:
- Fix segfault in thread__resolve_callchain_sample().
perf probe:
- Line fixes to show only lines where probes can be used with 'perf probe -L',
and when reporting them via 'perf probe -l'.
- Support multiprobe events.
perf scripts python:
Adrian Hunter:
- Fix use of TRUE with SQLite < 3.23 in exported-sql-viewer.py.
perf maps:
- Trim 'struct map' by removing the rb_node member for sorting
by map name, as that is only needed for processing kernel maps,
and only when classifying symbols by section at load time.
Sort them by name using qsort() and do lookups using bsearch()
when map_groups__find_by_name() is used.
perf parse:
Ian Rogers:
- Report initial event parsing error, providing a less cryptic message
to state that a PMU wasn't found in the system.
perf vendor events:
James Clark:
- Fix commas so that PMU event files for arm64, power8 and power nine
become valid JSON.
libtraceevent:
Konstantin Khlebnikov:
- Fix parsing of event %o and %X argument types.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 99e98d3fb1 ("cpuidle: Consolidate disabled state checks")
overlooked the fact that the imx6q and tegra20 cpuidle drivers use
the "disabled" field in struct cpuidle_state for quirks which trigger
after the initialization of cpuidle, so reading the initial value of
that field is not sufficient for those drivers.
In order to allow them to implement the quirks without using the
"disabled" field in struct cpuidle_state, introduce a new helper
function and modify them to use it.
Fixes: 99e98d3fb1 ("cpuidle: Consolidate disabled state checks")
Reported-by: Len Brown <lenb@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
On mpc83xx with a QE, IMMR is 2Mbytes and aligned on 2Mbytes boundarie.
On mpc83xx without a QE, IMMR is 1Mbyte and 1Mbyte aligned.
Each driver will map a part of it to access the registers it needs.
Some drivers will map the same part of IMMR as other drivers.
In order to reduce TLB misses, map the full IMMR with a BAT. If it is
2Mbytes aligned, map 2Mbytes. If there is no QE, the upper part will
remain unused, but it doesn't harm as it is mapped as guarded memory.
When the IMMR is not aligned on a 2Mbytes boundarie, only map 1Mbyte.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Acked-by: Scott Wood <oss@buserror.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/269a00951328fb6fa1be2fa3cbc76c19745019b7.1568665466.git.christophe.leroy@c-s.fr
Modify back __set_fixmap() to using __fix_to_virt() instead
of fix_to_virt() otherwise the following happens because it
seems GCC doesn't see idx as a builtin const.
CC mm/early_ioremap.o
In file included from ./include/linux/kernel.h:11:0,
from mm/early_ioremap.c:11:
In function ‘fix_to_virt’,
inlined from ‘__set_fixmap’ at ./arch/powerpc/include/asm/fixmap.h:87:2,
inlined from ‘__early_ioremap’ at mm/early_ioremap.c:156:4:
./include/linux/compiler.h:350:38: error: call to ‘__compiletime_assert_32’ declared with attribute error: BUILD_BUG_ON failed: idx >= __end_of_fixed_addresses
_compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
^
./include/linux/compiler.h:331:4: note: in definition of macro ‘__compiletime_assert’
prefix ## suffix(); \
^
./include/linux/compiler.h:350:2: note: in expansion of macro ‘_compiletime_assert’
_compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
^
./include/linux/build_bug.h:39:37: note: in expansion of macro ‘compiletime_assert’
#define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
^
./include/linux/build_bug.h:50:2: note: in expansion of macro ‘BUILD_BUG_ON_MSG’
BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
^
./include/asm-generic/fixmap.h:32:2: note: in expansion of macro ‘BUILD_BUG_ON’
BUILD_BUG_ON(idx >= __end_of_fixed_addresses);
^
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Fixes: 4cfac2f9c7 ("powerpc/mm: Simplify __set_fixmap()")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/f4984c615f90caa3277775a68849afeea846850d.1568295907.git.christophe.leroy@c-s.fr
Since commit f86ef74ed9 ("powerpc/8xx: Fix vaddr for IMMR early
remap"), the IMMR area has been mapped at startup with fixmap.
Use that fixmap directly instead of calling ioremap(), this
avoids calling ioremap() early before the slab is available.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/f816ccdbd15b97cf43c5a8c7cc8dfa8db58ff036.1568294935.git.christophe.leroy@c-s.fr
Now that jump and long displacement ranges are no longer a problem,
remove the limit on JITed image size. In practice it's still limited by
2G, but with verifier allowing "only" 1M instructions, it's not an
issue.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191118180340.68373-7-iii@linux.ibm.com
If literal pool grows past 524287 mark, it's no longer possible to use
long displacement to reference literal pool entries. In JIT setting
maintaining multiple literal pool registers is next to impossible, since
we operate on one instruction at a time.
Therefore, fall back to loading literal pool entry using PC-relative
addressing, and then using a register-register form of the following
machine instruction.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191118180340.68373-6-iii@linux.ibm.com
lg and lgrl have the same performance characteristics, but the former
requires a base register and is subject to long displacement range
limits, while the latter does not. Therefore, lgrl is totally superior
to lg and should be used instead whenever possible.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191118180340.68373-5-iii@linux.ibm.com
Currently literal pool register is loaded using basr, which makes it
point not to the beginning of the literal pool, but rather to the next
instruction. In case JITed code is larger than 512k, this renders
literal pool register absolutely useless due to long displacement range
restrictions.
The solution is to use larl to make literal pool register point to the
very beginning of the literal pool. This makes it always possible to
address 512k worth of literal pool entries using long displacement.
However, for short programs, in which the entire literal pool is covered
by basr-generated base, it is still beneficial to use basr, since it is
4 bytes shorter than larl.
Detect situations when basr-generated base does not cover the entire
literal pool, and in such cases use larl instead.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191118180340.68373-4-iii@linux.ibm.com
When literal pool size exceeds 512k, it's no longer possible to
reference all the entries in it using a single base register and long
displacement. Therefore, PC-relative lgfrl and lgrl instructions need to
be used.
Unfortunately, they require their arguments to be aligned to 4- and
8-byte boundaries respectively. This generates certain overhead due to
necessary padding bytes. Grouping 4- and 8-byte entries together reduces
the maximum overhead to 6 bytes (2 for aligning 4-byte entries and 4 for
aligning 8-byte entries).
While in theory it is possible to detect whether or not alignment is
needed by comparing the literal pool size with 512k, in practice this
leads to having two ways of emitting constants, making the code more
complicated.
Prefer code simplicity over trivial size saving, and always group and
align literal pool entries.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191118180340.68373-3-iii@linux.ibm.com
Currently maximum JITed code size is limited to 64k, because JIT can
emit only relative short branches, whose range is limited by 64k in both
directions.
Teach JIT to use relative long branches. There are no compare+branch
relative long instructions, so using relative long branches consumes
more space due to having to having to emit an explicit comparison
instruction. Therefore do this only when relative short branch is not
enough.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20191118180340.68373-2-iii@linux.ibm.com
Most 8xx registers have specific names, so just include
reg_8xx.h all the time in reg.h in order to have them defined
even when CONFIG_PPC_8xx is not selected. This will avoid
the need for #ifdefs in C code.
Guard SPRN_ICTRL in an #ifdef CONFIG_PPC_8xx as this register
has same name but different meaning and different spr number as
another register in the mpc7450.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/dd82934ad91aab607d0eb7e626c14e6ac0d654eb.1567068137.git.christophe.leroy@c-s.fr
Commit d2f15e0979 ("powerpc/32: always populate page tables for
Abatron BDI.") wrongly sets page tables for any PPC32 for using BDI,
and does't update them after init (remove RX on init section, set
text and rodata read-only)
Only the 8xx requires page tables to be populated for using the BDI.
They also need to be populated in order to see the mappings in
/sys/kernel/debug/kernel_page_tables
On BOOK3S_32, pages that are not mapped by page tables are mapped
by BATs. The BDI knows BATs and they can be viewed in
/sys/kernel/debug/powerpc/block_address_translation
Only set pagetables for RAM and IMMR on the 8xx and properly update
them at the end of init.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c8610942203e0d93fcb02ad20c57edd3adb4c9d3.1566554029.git.christophe.leroy@c-s.fr
powerpc always selects CONFIG_MMU and CONFIG_MMU is not checked
anywhere else in powerpc code.
Drop the #ifdef and the alternative part of is_ioremap_addr()
Fixes: 9bd3bb6703 ("mm/nvdimm: add is_ioremap_addr and use that to check ioremap address")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/de395e444fb8dd7a6365c3314d78e15ebb3d7d1b.1566382245.git.christophe.leroy@c-s.fr
The removed calgary IOMMU driver was the only user of this header file.
Reported-by: Jon Mason <jdmason@kudzu.us>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Hellwig <hch@lst.de>
Autoloading of Falcon IDE driver modules requires converting these
drivers to platform drivers.
Add platform device for Falcon IDE interface in Atari platform setup
code. Use this in the pata_falcon driver in place of the simple
platform device set up on the fly.
Convert falconide driver to use the same platform device that is used
by pata_falcon also. (With the introduction of a platform device for
the Atari Falcon IDE interface, the old Falcon IDE driver no longer
loads (resource already claimed by the platform device)).
Tested (as built-in driver) on my Atari Falcon.
Signed-off-by: Michael Schmitz <schmitzmic@gmail.com>
Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Link: https://lore.kernel.org/r/1573008449-8226-1-git-send-email-schmitzmic@gmail.com
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
BIOSen -> BIOSes; paing -> paging. Append to 640 its proper unit "Kb".
encomapssing -> encompassing.
[ bp: Merge into a single patch, fix one more typo, massage. ]
Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Robert Richter <rrichter@marvell.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Lendacky <Thomas.Lendacky@amd.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191118070012.27850-1-caoj.fnst@cn.fujitsu.com
This allows just loading the kernel at a pre-set address without
qemu going bonkers trying to map the ELF file.
Contains a contribution from Aurabindo Jayamohanan to reuse the
PAGE_OFFSET definition.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anup Patel <anup@brainfault.org>
[paul.walmsley@sifive.com: fixed checkpatch issue; minor commit
message fix]
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
The kernel runs in M-mode without using page tables, and thus can't run
bare metal without help from additional firmware.
Most of the patch is just stubbing out code not needed without page
tables, but there is an interesting detail in the signals implementation:
- The normal RISC-V syscall ABI only implements rt_sigreturn as VDSO
entry point, but the ELF VDSO is not supported for nommu Linux.
We instead copy the code to call the syscall onto the stack.
In addition to enabling the nommu code a new defconfig for a small
kernel image that can run in nommu mode on qemu is also provided, to run
a kernel in qemu you can use the following command line:
qemu-system-riscv64 -smp 2 -m 64 -machine virt -nographic \
-kernel arch/riscv/boot/loader \
-drive file=rootfs.ext2,format=raw,id=hd0 \
-device virtio-blk-device,drive=hd0
Contains contributions from Damien Le Moal <Damien.LeMoal@wdc.com>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anup Patel <anup@brainfault.org>
[paul.walmsley@sifive.com: updated to apply; add CONFIG_MMU guards
around PCI_IOBASE definition to fix build issues; fixed checkpatch
issues; move the PCI_IO_* and VMEMMAP address space macros along
with the others; resolve sparse warning]
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
When we get booted we want a clear slate without any leaks from previous
supervisors or the firmware. Flush the instruction cache and then clear
all registers to known good values. This is really important for the
upcoming nommu support that runs on M-mode, but can't really harm when
running in S-mode either. Vaguely based on the concepts from opensbi.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
When in M-Mode, we can use the mhartid CSR to get the ID of the running
HART. Doing so, direct M-Mode boot without firmware is possible.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Atish Patra <atish.patra@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
RISC-V has the concept of a cpu level interrupt controller. The
interface for it is split between a standardized part that is exposed
as bits in the mstatus/sstatus register and the mie/mip/sie/sip
CRS. But the bit to actually trigger IPIs is not standardized and
just mentioned as implementable using MMIO.
Add support for IPIs using MMIO using the SiFive clint layout (which
is also shared by Ariane, Kendryte and the Qemu virt platform).
Additionally the MMIO block also supports the time value and timer
compare registers, so they are also set up using the same OF node.
Support for other layouts should also be relatively easy to add in the
future.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anup Patel <anup@brainfault.org>
[paul.walmsley@sifive.com: update include guard format; fix checkpatch
issues; minor commit message cleanup]
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
A 'struct device_type' instance can carry default attributes for the
device. Use this facility to remove the export of
nd_device_attribute_group and put the responsibility on the core rather
than leaf implementations to define this attribute.
For regions this creates a new nd_region_attribute_groups[] added to the
per-region device-type instances.
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Oliver O'Halloran" <oohall@gmail.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Link: https://lore.kernel.org/r/157309901138.1582359.12909354140826530394.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This removes the warnings about the fact that the 4 pci bridges (i.e.
the 4 pci hosts) don't have any ranges.
Signed-off-by: Valentin Longchamp <valentin@longchamp.me>
Signed-off-by: Scott Wood <oss@buserror.net>
Caching dates is never a good idea ;-)
Fixes: e7affb1dba ("powerpc/cache: add cache flush operation for various e500")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Scott Wood <oss@buserror.net>
Since commit 302c059f2e (QE: use subsys_initcall to init qe),
mpc85xx_qe_init() has done nothing apart from possibly emitting a
pr_err(). As part of reducing the amount of QE-related code in
arch/powerpc/ (and eventually support QE on other architectures),
remove this low-hanging fruit.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Scott Wood <oss@buserror.net>
Change all phy-connection-type properties to phy-mode that are better
supported by the fman driver.
Use the more readable fixed-link node for the 2 sgmii links.
Change the RGMII link to rgmii-id as the clock delays are added by the
phy.
Signed-off-by: Valentin Longchamp <valentin@longchamp.me>
Acked-by: Madalin Bucur <madalin.bucur@nxp.com>
Signed-off-by: Scott Wood <oss@buserror.net>
This ports the SUPERCOP implementation for usage in kernel space. In
addition to the usual header, macro, and style changes required for
kernel space, it makes a few small changes to the code:
- The stack alignment is relaxed to 16 bytes.
- Superfluous mov statements have been removed.
- ldr for constants has been replaced with movw.
- ldreq has been replaced with moveq.
- The str epilogue has been made more idiomatic.
- SIMD registers are not pushed and popped at the beginning and end.
- The prologue and epilogue have been made idiomatic.
- A hole has been removed from the stack, saving 32 bytes.
- We write-back the base register whenever possible for vld1.8.
- Some multiplications have been reordered for better A7 performance.
There are more opportunities for cleanup, since this code is from qhasm,
which doesn't always do the most opportune thing. But even prior to
extensive hand optimizations, this code delivers significant performance
improvements (given in get_cycles() per call):
----------- -------------
| generic C | this commit |
------------ ----------- -------------
| Cortex-A7 | 49136 | 22395 |
------------ ----------- -------------
| Cortex-A17 | 17326 | 4983 |
------------ ----------- -------------
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
[ardb: - move to arch/arm/crypto
- wire into lib/crypto framework
- implement crypto API KPP hooks ]
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This comes from Dan Bernstein and Peter Schwabe's public domain NEON
code, and is included here in raw form so that subsequent commits that
fix these up for the kernel can see how it has changed. This code does
have some entirely cosmetic formatting differences, adding indentation
and so forth, so that when we actually port it for use in the kernel in
the subsequent commit, it's obvious what's changed in the process.
This code originates from SUPERCOP 20180818, available at
<https://bench.cr.yp.to/supercop.html>.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This implementation is the fastest available x86_64 implementation, and
unlike Sandy2x, it doesn't requie use of the floating point registers at
all. Instead it makes use of BMI2 and ADX, available on recent
microarchitectures. The implementation was written by Armando
Faz-Hernández with contributions (upstream) from Samuel Neves and me,
in addition to further changes in the kernel implementation from us.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
[ardb: - move to arch/x86/crypto
- wire into lib/crypto framework
- implement crypto API KPP hooks ]
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
These implementations from Samuel Neves support AVX and AVX-512VL.
Originally this used AVX-512F, but Skylake thermal throttling made
AVX-512VL more attractive and possible to do with negligable difference.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Co-developed-by: Samuel Neves <sneves@dei.uc.pt>
[ardb: move to arch/x86/crypto, wire into lib/crypto framework]
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
In order to use 128-bit integer arithmetic in C code, the architecture
needs to have declared support for it by setting ARCH_SUPPORTS_INT128,
and it requires a version of the toolchain that supports this at build
time. This is why all existing tests for ARCH_SUPPORTS_INT128 also test
whether __SIZEOF_INT128__ is defined, since this is only the case for
compilers that can support 128-bit integers.
Let's fold this additional test into the Kconfig declaration of
ARCH_SUPPORTS_INT128 so that we can also use the symbol in Makefiles,
e.g., to decide whether a certain object needs to be included in the
first place.
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This is a straight import of the OpenSSL/CRYPTOGAMS Poly1305 implementation for
MIPS authored by Andy Polyakov, a prior 64-bit only version of which has been
contributed by him to the OpenSSL project. The file 'poly1305-mips.pl' is taken
straight from this upstream GitHub repository [0] at commit
d22ade312a7af958ec955620b0d241cf42c37feb, and already contains all the changes
required to build it as part of a Linux kernel module.
[0] https://github.com/dot-asm/cryptogams
Co-developed-by: Andy Polyakov <appro@cryptogams.org>
Signed-off-by: Andy Polyakov <appro@cryptogams.org>
Co-developed-by: René van Dorst <opensource@vdorst.com>
Signed-off-by: René van Dorst <opensource@vdorst.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This is a straight import of the OpenSSL/CRYPTOGAMS Poly1305 implementation
for NEON authored by Andy Polyakov, and contributed by him to the OpenSSL
project. The file 'poly1305-armv4.pl' is taken straight from this upstream
GitHub repository [0] at commit ec55a08dc0244ce570c4fc7cade330c60798952f,
and already contains all the changes required to build it as part of a
Linux kernel module.
[0] https://github.com/dot-asm/cryptogams
Co-developed-by: Andy Polyakov <appro@cryptogams.org>
Signed-off-by: Andy Polyakov <appro@cryptogams.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This is a straight import of the OpenSSL/CRYPTOGAMS Poly1305 implementation
for NEON authored by Andy Polyakov, and contributed by him to the OpenSSL
project. The file 'poly1305-armv8.pl' is taken straight from this upstream
GitHub repository [0] at commit ec55a08dc0244ce570c4fc7cade330c60798952f,
and already contains all the changes required to build it as part of a
Linux kernel module.
[0] https://github.com/dot-asm/cryptogams
Co-developed-by: Andy Polyakov <appro@cryptogams.org>
Signed-off-by: Andy Polyakov <appro@cryptogams.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Implement the arch init/update/final Poly1305 library routines in the
accelerated SIMD driver for x86 so they are accessible to users of
the Poly1305 library interface as well.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Remove the dependency on the generic Poly1305 driver. Instead, depend
on the generic library so that we only reuse code without pulling in
the generic skcipher implementation as well.
While at it, remove the logic that prefers the non-SIMD path for short
inputs - this is no longer necessary after recent FPU handling changes
on x86.
Since this removes the last remaining user of the routines exported
by the generic shash driver, unexport them and make them static.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
In preparation of exposing a Poly1305 library interface directly from
the accelerated x86 driver, align the state descriptor of the x86 code
with the one used by the generic driver. This is needed to make the
library interface unified between all implementations.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Move the core Poly1305 routines shared between the generic Poly1305
shash driver and the Adiantum and NHPoly1305 drivers into a separate
library so that using just this pieces does not pull in the crypto
API pieces of the generic Poly1305 routine.
In a subsequent patch, we will augment this generic library with
init/update/final routines so that Poyl1305 algorithm can be used
directly without the need for using the crypto API's shash abstraction.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This integrates the accelerated MIPS 32r2 implementation of ChaCha
into both the API and library interfaces of the kernel crypto stack.
The significance of this is that, in addition to becoming available
as an accelerated library implementation, it can also be used by
existing crypto API code such as Adiantum (for block encryption on
ultra low performance cores) or IPsec using chacha20poly1305. These
are use cases that have already opted into using the abstract crypto
API. In order to support Adiantum, the core assembler routine has
been adapted to take the round count as a function argument rather
than hardcoding it to 20.
Co-developed-by: René van Dorst <opensource@vdorst.com>
Signed-off-by: René van Dorst <opensource@vdorst.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This imports the accelerated MIPS 32r2 ChaCha20 implementation from the
Zinc patch set.
Co-developed-by: René van Dorst <opensource@vdorst.com>
Signed-off-by: René van Dorst <opensource@vdorst.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Expose the accelerated NEON ChaCha routine directly as a symbol
export so that users of the ChaCha library API can use it directly.
Given that calls into the library API will always go through the
routines in this module if it is enabled, switch to static keys
to select the optimal implementation available (which may be none
at all, in which case we defer to the generic implementation for
all invocations).
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Instead of falling back to the generic ChaCha skcipher driver for
non-SIMD cases, use a fast scalar implementation for ARM authored
by Eric Biggers. This removes the module dependency on chacha-generic
altogether, which also simplifies things when we expose the ChaCha
library interface from this module.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Expose the accelerated NEON ChaCha routine directly as a symbol
export so that users of the ChaCha library API can use it directly.
Given that calls into the library API will always go through the
routines in this module if it is enabled, switch to static keys
to select the optimal implementation available (which may be none
at all, in which case we defer to the generic implementation for
all invocations).
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Depend on the generic ChaCha library routines instead of pulling in the
generic ChaCha skcipher driver, which is more than we need, and makes
managing the dependencies between the generic library, generic driver,
accelerated library and driver more complicated.
While at it, drop the logic to prefer the scalar code on short inputs.
Turning the NEON on and off is cheap these days, and one major use case
for ChaCha20 is ChaCha20-Poly1305, which is guaranteed to hit the scalar
path upon every invocation (when doing the Poly1305 nonce generation)
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Wire the existing x86 SIMD ChaCha code into the new ChaCha library
interface, so that users of the library interface will get the
accelerated version when available.
Given that calls into the library API will always go through the
routines in this module if it is enabled, switch to static keys
to select the optimal implementation available (which may be none
at all, in which case we defer to the generic implementation for
all invocations).
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
In preparation of extending the x86 ChaCha driver to also expose the ChaCha
library interface, drop the dependency on the chacha_generic crypto driver
as a non-SIMD fallback, and depend on the generic ChaCha library directly.
This way, we only pull in the code we actually need, without registering
a set of ChaCha skciphers that we will never use.
Since turning the FPU on and off is cheap these days, simplify the SIMD
routine by dropping the per-page yield, which makes for a cleaner switch
to the library API as well. This also allows use to invoke the skcipher
walk routines in non-atomic mode.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Currently, our generic ChaCha implementation consists of a permute
function in lib/chacha.c that operates on the 64-byte ChaCha state
directly [and which is always included into the core kernel since it
is used by the /dev/random driver], and the crypto API plumbing to
expose it as a skcipher.
In order to support in-kernel users that need the ChaCha streamcipher
but have no need [or tolerance] for going through the abstractions of
the crypto API, let's expose the streamcipher bits via a library API
as well, in a way that permits the implementation to be superseded by
an architecture specific one if provided.
So move the streamcipher code into a separate module in lib/crypto,
and expose the init() and crypt() routines to users of the library.
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Pull x86 fixes from Ingo Molnar:
"Two fixes: disable unreliable HPET on Intel Coffe Lake platforms, and
fix a lockdep splat in the resctrl code"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Fix potential lockdep warning
x86/quirks: Disable HPET on Intel Coffe Lake platforms
- Add proper cleanup on optee_enumerate_devices() failure
- Make sure to register kernel allocations of dynamic shared memory
-----BEGIN PGP SIGNATURE-----
iQJOBAABCgA4FiEEcK3MsDvGvFp6zV9ztbC4QZeP7NMFAl3OgAAaHGplbnMud2lr
bGFuZGVyQGxpbmFyby5vcmcACgkQtbC4QZeP7NO+ZQ/9HIvaUqStpOVVPqRVx8UU
E2qNHleaVf72KIDfNgtGw2iovEaDr8Js2tWqXHz9f9awUJV5pJw098BEh45OC4KA
bqAgfX7Gxi9evBb0zPN42dXY7PSNlBYF7vmPs9zfTHlplGtigWbv5A9U34J/GZcX
7RXZU/IigYst7s00cu9CdT3XSkrCsx2M1Ecg7rrbkBirGCX8p4veRZ1XtYpJFGUF
xA1Xstq6x1pYEkcLRsX0k/R2GlvNKIgPjmAA8zh8GHEWNMcxeMcv+Gd4nAVJN9S0
eRr6jAJuJ597/6jS2HzCq/3ygNwgKKvdlthuDMizCzDkufo1wQdxAKdt3rCXzCr8
zPFAzRpaeHX5gep94wjpj/YqV70a4miS09OVbpkuL7a8kbOdkdcQHKAVvF7BNJvF
2Wp3Ytlnh9pfnl1A3bMp7tmdUY3LlW1ob+LtoS7+PmpxrQYp7XB3gPE6G3OF3xCS
hKSOF5BgreeMZpw/oX1IH7zMLrQcjf6kZShtcrZTFVw84bXSIXpt30KCsBA1Hdzf
EyygrIMgVEPOnIxiiaVrtd34f2LXc8YdiMj3N00vHzljFsMh2wcy7covEbxt62/7
3a3Jqz23ngRXFFQQaT9iNc73WKST4iuyIPTjcrwqc8VDlSvkq95SQVArj3bfafmC
0lhFIW4hnCjsHSrFqcgEZVg=
=E+qa
-----END PGP SIGNATURE-----
Merge tag 'tee-fixes-for-v5.4' of git://git.linaro.org/people/jens.wiklander/linux-tee into arm/fixes
Two OP-TE driver fixes:
- Add proper cleanup on optee_enumerate_devices() failure
- Make sure to register kernel allocations of dynamic shared memory
* tag 'tee-fixes-for-v5.4' of git://git.linaro.org/people/jens.wiklander/linux-tee: (591 commits)
tee: optee: fix device enumeration error handling
tee: optee: Fix dynamic shm pool allocations
Linux 5.4-rc3
tracing: Initialize iter->seq after zeroing in tracing_read_pipe()
tracing/hwlat: Don't ignore outer-loop duration when calculating max_latency
tracing/hwlat: Report total time spent in all NMIs during the sample
recordmcount: Fix nop_mcount() function
tracing: Do not create tracefs files if tracefs lockdown is in effect
tracing: Add locked_down checks to the open calls of files created for tracefs
tracing: Add tracing_check_open_get_tr()
tracing: Have trace events system open call tracing_open_generic_tr()
tracing: Get trace_array reference for available_tracers files
ftrace: Get a reference counter for the trace_array on filter files
tracefs: Revert ccbd54ff54 ("tracefs: Restrict tracefs when the kernel is locked down")
perf/x86/cstate: Add Tiger Lake CPU support
perf/x86/msr: Add Tiger Lake CPU support
perf/x86/intel: Add Tiger Lake CPU support
perf/x86/cstate: Update C-state counters for Ice Lake
perf/x86/msr: Add new CPU model numbers for Ice Lake
perf/x86/cstate: Add Comet Lake CPU support
...
Link: https://lore.kernel.org/r/20191115105353.GA26176@jax
Signed-off-by: Olof Johansson <olof@lixom.net>
This adds recently merged drivers to the aspeed g4 and g5 defconfigs,
and to the multi v5 and v7 shared defconfigs.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+nHMAt9PCBDH63wBa3ZZB4FHcJ4FAl3MldMACgkQa3ZZB4FH
cJ5eoA//QmS9ldAtiC02bCOY9iO7e9z2yrWb7z6EHgULx5gP2YOo9n4Vp/2R9kcB
yNrhoM/ichwkB0enq+7V802/IK5R80sr4C1C7uT8pCmBob1g3wQ8KbmyeoSCMhyb
c+psF5D6+tB5Ahqmw12gMulCUtXi9rOgHFkRfxlvbJl5hYzoR5oPJZZidqCgttOe
5SHK8O1DXeWIUb4zsggMtqb58BBA7QBkZmdIzcNaaLpnuwX+CeudvB5kkNCP/XPu
xdXF3XC8kIBKsLwanINdFgZKPxFFHrEYjFDPG5gEpAEhKH0Tbri5YGuohr464QFb
dRJikoBWGrSMcWaqXJfECxYJisBHG9xiMG5vigicSuhqJPUqo+rogMTOw+Katzx2
yF067law+MW/8JyxD/sGXQLjKq/D60KgoILEjJyEo7PKdY1AfIqdC+WaivdDveDI
JjHD75rgif/72lezdE9XBYhKdbe1FecvqpQ602aIPLl6hat+zrOp/G1UJJLFPZ+V
ffh6yKeZ4hzl+SZziV3UOQ333v/31Ww+yeCEwfeodyGNIhjALLgAmSm8j5m/IhHO
f2otJDewd8ex2AZKOrSvuqQKzxg+D27u6+qy5QWYYHS3ZvSBGxRkDdmT2i1BtAZw
gPpxJFHppFn3TNGftlXNbJhyv4sx29Ts3Ci5MbEMQBzaDK8Jgy8=
=/4iy
-----END PGP SIGNATURE-----
Merge tag 'aspeed-5.5-defconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/joel/aspeed into arm/defconfig
ASPEED device tree updates for 5.5
This adds recently merged drivers to the aspeed g4 and g5 defconfigs,
and to the multi v5 and v7 shared defconfigs.
* tag 'aspeed-5.5-defconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/joel/aspeed:
ARM: config: multi_v5: ASPEED SDHCI, SGPIO
ARM: configs: multi_v7: ASPEED network, gpio, FSI
ARM: config: aspeed-g4: Add MMC, and cleanup
ARM: config: aspeed-g5: Add SGPIO and FSI drivers
ARM: config: aspeed-g5: Enable 8250_DW quirks
Link: https://lore.kernel.org/r/CACPK8Xc3a=6irkTSfwyPrr=nWv_fa9nMxthtC2AyqFJEWvuMwg@mail.gmail.com
Signed-off-by: Olof Johansson <olof@lixom.net>
Since MDS and TAA mitigations are inter-related for processors that are
affected by both vulnerabilities, the followiing confusing messages can
be printed in the kernel log:
MDS: Vulnerable
MDS: Mitigation: Clear CPU buffers
To avoid the first incorrect message, defer the printing of MDS
mitigation after the TAA mitigation selection has been done. However,
that has the side effect of printing TAA mitigation first before MDS
mitigation.
[ bp: Check box is affected/mitigations are disabled first before
printing and massage. ]
Suggested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Mark Gross <mgross@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191115161445.30809-3-longman@redhat.com
For MDS vulnerable processors with TSX support, enabling either MDS or
TAA mitigations will enable the use of VERW to flush internal processor
buffers at the right code path. IOW, they are either both mitigated
or both not. However, if the command line options are inconsistent,
the vulnerabilites sysfs files may not report the mitigation status
correctly.
For example, with only the "mds=off" option:
vulnerabilities/mds:Vulnerable; SMT vulnerable
vulnerabilities/tsx_async_abort:Mitigation: Clear CPU buffers; SMT vulnerable
The mds vulnerabilities file has wrong status in this case. Similarly,
the taa vulnerability file will be wrong with mds mitigation on, but
taa off.
Change taa_select_mitigation() to sync up the two mitigation status
and have them turned off if both "mds=off" and "tsx_async_abort=off"
are present.
Update documentation to emphasize the fact that both "mds=off" and
"tsx_async_abort=off" have to be specified together for processors that
are affected by both TAA and MDS to be effective.
[ bp: Massage and add kernel-parameters.txt change too. ]
Fixes: 1b42f01741 ("x86/speculation/taa: Add mitigation for TSX Async Abort")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: linux-doc@vger.kernel.org
Cc: Mark Gross <mgross@linux.intel.com>
Cc: <stable@vger.kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191115161445.30809-2-longman@redhat.com
Jump directly to restore_regs_and_return_to_kernel instead of making
a pointless extra jump through .Lparanoid_exit_restore
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191023123117.779277679@linutronix.de
The C reimplementation of SYSENTER left that unused ENTRY() label
around. Remove it.
Fixes: 5f310f739b ("x86/entry/32: Re-implement SYSENTER using the new C path")
Originally-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191023123117.686514045@linutronix.de
On platforms implementing CPU power management, the CPUidle subsystem
can allow CPUs to enter idle states where local timers logic is lost on power
down. To keep the software timers functional the kernel relies on an
always-on broadcast timer to be present in the platform to relay the
interrupt signalling the timer expiries.
For platforms implementing CPU core gating that do not implement an always-on
HW timer or implement it in a broken way, this patch adds code to initialize
the kernel hrtimer based clock event device upon boot (which can be chosen as
tick broadcast device by the kernel).
It relies on a dynamically chosen CPU to be always powered-up. This CPU then
relays the timer interrupt to CPUs in deep-idle states through its HW local
timer device.
Having a CPU always-on has implications on power management platform
capabilities and makes CPUidle suboptimal, since at least a CPU is kept
always in a shallow idle state by the kernel to relay timer interrupts,
but at least leaves the kernel with a functional system with some working
power management capabilities.
The hrtimer based clock event device is unconditionally registered, but
has the lowest possible rating such that any broadcast-capable HW clock
event device present will be chosen in preference as the tick broadcast
device.
Signed-off-by: Benjamin Gaignard <benjamin.gaignard@st.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
When the default processor handling was added to the function
cpu_v7_spectre_init() it only excluded other ARM implemented processor
cores. The Broadcom Brahma B53 core is not implemented by ARM so it
ended up falling through into the set of processors that attempt to use
the ARM_SMCCC_ARCH_WORKAROUND_1 service to harden the branch predictor.
Since this workaround is not necessary for the Brahma-B53 this commit
explicitly checks for it and prevents it from applying a branch
predictor hardening workaround.
Fixes: 10115105cb ("ARM: spectre-v2: add firmware based hardening")
Signed-off-by: Doug Berger <opendmb@gmail.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
commit 6690e86be8 ("sched/x86: Save [ER]FLAGS on context switch")
re-introduced the flags saving on context switch to prevent AC leakage.
The pushf/popf instructions are right among the callee saved register
section, so the comment explaining the save/restore is not entirely
correct.
Add a seperate comment to pushf/popf explaining the reason.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If iopl() is disabled, then providing ioperm() does not make much sense.
Rename the config option and disable/enable both syscalls with it. Guard
the code with #ifdefs where appropriate.
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The IOPL emulation via the I/O bitmap is sufficient. Remove the legacy
cruft dealing with the (e)flags based IOPL mechanism.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Juergen Gross <jgross@suse.com> (Paravirt and Xen parts)
Acked-by: Andy Lutomirski <luto@kernel.org>
The access to the full I/O port range can be also provided by the TSS I/O
bitmap, but that would require to copy 8k of data on scheduling in the
task. As shown with the sched out optimization TSS.io_bitmap_base can be
used to switch the incoming task to a preallocated I/O bitmap which has all
bits zero, i.e. allows access to all I/O ports.
Implementing this allows to provide an iopl() emulation mode which restricts
the IOPL level 3 permissions to I/O port access but removes the STI/CLI
permission which is coming with the hardware IOPL mechansim.
Provide a config option to switch IOPL to emulation mode, make it the
default and while at it also provide an option to disable IOPL completely.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
The comment for the sys_iopl() implementation is outdated and actively
misleading in some parts. Fix it up.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
The I/O bitmap is duplicated on fork. That's wasting memory and slows down
fork. There is no point to do so. As long as the bitmap is not modified it
can be shared between threads and processes.
Add a refcount and just share it on fork. If a task modifies the bitmap
then it has to do the duplication if and only if it is shared.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
If ioperm() results in a bitmap with all bits set (no permissions to any
I/O port), then handling that bitmap on context switch and exit to user
mode is pointless. Drop it.
Move the bitmap exit handling to the ioport code and reuse it for both the
thread exit path and dropping it. This allows to reuse this code for the
upcoming iopl() emulation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
There is no point to update the TSS bitmap for tasks which use I/O bitmaps
on every context switch. It's enough to update it right before exiting to
user space.
That reduces the context switch bitmap handling to invalidating the io
bitmap base offset in the TSS when the outgoing task has TIF_IO_BITMAP
set. The invaldiation is done on purpose when a task with an IO bitmap
switches out to prevent any possible leakage of an activated IO bitmap.
It also removes the requirement to update the tasks bitmap atomically in
ioperm().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add a globally unique sequence number which is incremented when ioperm() is
changing the I/O bitmap of a task. Store the new sequence number in the
io_bitmap structure and compare it with the sequence number of the I/O
bitmap which was last loaded on a CPU. Only update the bitmap if the
sequence is different.
That should further reduce the overhead of I/O bitmap scheduling when there
are only a few I/O bitmap users on the system.
The 64bit sequence counter is sufficient. A wraparound of the sequence
counter assuming an ioperm() call every nanosecond would require about 584
years of uptime.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
No point in having all the data in thread_struct, especially as upcoming
changes add more.
Make the bitmap in the new struct accessible as array of longs and as array
of characters via a union, so both the bitmap functions and the update
logic can avoid type casts.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Move the non hardware portion of I/O bitmap data into a seperate struct for
readability sake.
Originally-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There is no requirement to update the TSS I/O bitmap when a thread using it is
scheduled out and the incoming thread does not use it.
For the permission check based on the TSS I/O bitmap the CPU calculates the memory
location of the I/O bitmap by the address of the TSS and the io_bitmap_base member
of the tss_struct. The easiest way to invalidate the I/O bitmap is to switch the
offset to an address outside of the TSS limit.
If an I/O instruction is issued from user space the TSS limit causes #GP to be
raised in the same was as valid I/O bitmap with all bits set to 1 would do.
This removes the extra work when an I/O bitmap using task is scheduled out
and puts the burden on the rare I/O bitmap users when they are scheduled
in.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If ioperm() is invoked the first time and the @turn_on argument is 0, then
there is no point to allocate a bitmap just to clear permissions which are
not set.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
On the first allocation of a task the I/O bitmap needs to be
allocated. After the allocation it is installed as an empty bitmap and
immediately afterwards updated.
Avoid that and just do the initial updates (store bitmap pointer, set TIF
flag and make TSS limit valid) in the update path unconditionally. If the
bitmap was already active this is redundant but harmless.
Preparatory change for later optimizations in the context switch code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The BUILD_BUG_ON(IO_BITMAP_OFFSET - 1 == 0x67) in the VMX code is bogus in
two aspects:
1) This wants to be in generic x86 code simply to catch issues even when
VMX is disabled in Kconfig.
2) The IO_BITMAP_OFFSET is not the right thing to check because it makes
asssumptions about the layout of tss_struct. Nothing requires that the
I/O bitmap is placed right after x86_tss, which is the hardware mandated
tss structure. It pointlessly makes restrictions on the struct
tss_struct layout.
The proper thing to check is:
- Offset of x86_tss in tss_struct is 0
- Size of x86_tss == 0x68
Move it to the other build time TSS checks and make it do the right thing.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Similar to copy_thread_tls() the 32bit and 64bit implementations of
cpu_init() are very similar and unification avoids duplicate changes in the
future.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
While looking at the TSS io bitmap it turned out that any change in that
area would require identical changes to copy_thread_tls(). The 32 and 64
bit variants share sufficient code to consolidate them into a common
function to avoid duplication of upcoming modifications.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
The active() callback of the IO bitmap regset divides the IO bitmap size by
the word size (32/64 bit). As the I/O bitmap size is in bytes the active
check fails for bitmap sizes of 1-3 bytes on 32bit and 1-7 bytes on 64bit.
Use DIV_ROUND_UP() instead.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Allow FENTRY/FEXIT BPF programs to attach to other BPF programs of any type
including their subprograms. This feature allows snooping on input and output
packets in XDP, TC programs including their return values. In order to do that
the verifier needs to track types not only of vmlinux, but types of other BPF
programs as well. The verifier also needs to translate uapi/linux/bpf.h types
used by networking programs into kernel internal BTF types used by FENTRY/FEXIT
BPF programs. In some cases LLVM optimizations can remove arguments from BPF
subprograms without adjusting BTF info that LLVM backend knows. When BTF info
disagrees with actual types that the verifiers sees the BPF trampoline has to
fallback to conservative and treat all arguments as u64. The FENTRY/FEXIT
program can still attach to such subprograms, but it won't be able to recognize
pointer types like 'struct sk_buff *' and it won't be able to pass them to
bpf_skb_output() for dumping packets to user space. The FENTRY/FEXIT program
would need to use bpf_probe_read_kernel() instead.
The BPF_PROG_LOAD command is extended with attach_prog_fd field. When it's set
to zero the attach_btf_id is one vmlinux BTF type ids. When attach_prog_fd
points to previously loaded BPF program the attach_btf_id is BTF type id of
main function or one of its subprograms.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20191114185720.1641606-18-ast@kernel.org
BPF trampoline can be made to work with existing 5 bytes of BPF program
prologue, but let's add 5 bytes of NOPs to the beginning of every JITed BPF
program to make BPF trampoline job easier. They can be removed in the future.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20191114185720.1641606-14-ast@kernel.org
Introduce BPF trampoline concept to allow kernel code to call into BPF programs
with practically zero overhead. The trampoline generation logic is
architecture dependent. It's converting native calling convention into BPF
calling convention. BPF ISA is 64-bit (even on 32-bit architectures). The
registers R1 to R5 are used to pass arguments into BPF functions. The main BPF
program accepts only single argument "ctx" in R1. Whereas CPU native calling
convention is different. x86-64 is passing first 6 arguments in registers
and the rest on the stack. x86-32 is passing first 3 arguments in registers.
sparc64 is passing first 6 in registers. And so on.
The trampolines between BPF and kernel already exist. BPF_CALL_x macros in
include/linux/filter.h statically compile trampolines from BPF into kernel
helpers. They convert up to five u64 arguments into kernel C pointers and
integers. On 64-bit architectures this BPF_to_kernel trampolines are nops. On
32-bit architecture they're meaningful.
The opposite job kernel_to_BPF trampolines is done by CAST_TO_U64 macros and
__bpf_trace_##call() shim functions in include/trace/bpf_probe.h. They convert
kernel function arguments into array of u64s that BPF program consumes via
R1=ctx pointer.
This patch set is doing the same job as __bpf_trace_##call() static
trampolines, but dynamically for any kernel function. There are ~22k global
kernel functions that are attachable via nop at function entry. The function
arguments and types are described in BTF. The job of btf_distill_func_proto()
function is to extract useful information from BTF into "function model" that
architecture dependent trampoline generators will use to generate assembly code
to cast kernel function arguments into array of u64s. For example the kernel
function eth_type_trans has two pointers. They will be casted to u64 and stored
into stack of generated trampoline. The pointer to that stack space will be
passed into BPF program in R1. On x86-64 such generated trampoline will consume
16 bytes of stack and two stores of %rdi and %rsi into stack. The verifier will
make sure that only two u64 are accessed read-only by BPF program. The verifier
will also recognize the precise type of the pointers being accessed and will
not allow typecasting of the pointer to a different type within BPF program.
The tracing use case in the datacenter demonstrated that certain key kernel
functions have (like tcp_retransmit_skb) have 2 or more kprobes that are always
active. Other functions have both kprobe and kretprobe. So it is essential to
keep both kernel code and BPF programs executing at maximum speed. Hence
generated BPF trampoline is re-generated every time new program is attached or
detached to maintain maximum performance.
To avoid the high cost of retpoline the attached BPF programs are called
directly. __bpf_prog_enter/exit() are used to support per-program execution
stats. In the future this logic will be optimized further by adding support
for bpf_stats_enabled_key inside generated assembly code. Introduction of
preemptible and sleepable BPF programs will completely remove the need to call
to __bpf_prog_enter/exit().
Detach of a BPF program from the trampoline should not fail. To avoid memory
allocation in detach path the half of the page is used as a reserve and flipped
after each attach/detach. 2k bytes is enough to call 40+ BPF programs directly
which is enough for BPF tracing use cases. This limit can be increased in the
future.
BPF_TRACE_FENTRY programs have access to raw kernel function arguments while
BPF_TRACE_FEXIT programs have access to kernel return value as well. Often
kprobe BPF program remembers function arguments in a map while kretprobe
fetches arguments from a map and analyzes them together with return value.
BPF_TRACE_FEXIT accelerates this typical use case.
Recursion prevention for kprobe BPF programs is done via per-cpu
bpf_prog_active counter. In practice that turned out to be a mistake. It
caused programs to randomly skip execution. The tracing tools missed results
they were looking for. Hence BPF trampoline doesn't provide builtin recursion
prevention. It's a job of BPF program itself and will be addressed in the
follow up patches.
BPF trampoline is intended to be used beyond tracing and fentry/fexit use cases
in the future. For example to remove retpoline cost from XDP programs.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20191114185720.1641606-5-ast@kernel.org
Add bpf_arch_text_poke() helper that is used by BPF trampoline logic to patch
nops/calls in kernel text into calls into BPF trampoline and to patch
calls/nops inside BPF programs too.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20191114185720.1641606-4-ast@kernel.org
Refactor x86 JITing of LDX, STX, CALL instructions into separate helper
functions. No functional changes in LDX and STX helpers. There is a minor
change in CALL helper. It will populate target address correctly on the first
pass of JIT instead of second pass. That won't reduce total number of JIT
passes though.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20191114185720.1641606-3-ast@kernel.org
It looks like a section directive was using "Solaris style" to declare
the section flags. Replace this with the GNU style so that Clang's
integrated assembler can assemble this directive.
The modified instances were identified via:
$ ag \.section | grep #
Link: https://ftp.gnu.org/old-gnu/Manuals/gas-2.9.1/html_chapter/as_7.html#SEC119
Link: https://github.com/ClangBuiltLinux/linux/issues/744
Link: https://bugs.llvm.org/show_bug.cgi?id=43759
Link: https://reviews.llvm.org/D69296
Acked-by: Nicolas Pitre <nico@fluxnic.net>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Suggested-by: Fangrui Song <maskray@google.com>
Suggested-by: Jian Cai <jiancai@google.com>
Suggested-by: Peter Smith <peter.smith@linaro.org>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
With the release of Linux 5.1 has been added a new syscall,
clock_gettime64, that provided a 64 bit time value for a specified
clock_ID to make the kernel Y2038 safe on 32 bit architectures.
Update the arm specific vDSO library accordingly with what it has
been done for the kernel syscall exposing the clock_gettime64 entry
point.
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
The generic vDSO library provides an implementation of clock_getres() that
can be leveraged by each architecture.
Add clock_getres() entry point on arm to be on pair with arm64.
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
The arm vDSO library requires some adaptations to take advantage of
the newly introduced generic vDSO library.
Introduce the following changes:
- Modification vdso.c to be compliant with the common vdso datapage
- Use of lib/vdso for gettimeofday
- Implementation of elf note
Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
LLVM's integrated assembler does not accept r15 as mrc operand.
arch/arm/boot/compressed/head.S:1267:16: error: operand must be a register in range [r0, r14] or apsr_nzcv
1: mrc p15, 0, r15, c7, c14, 3 @ test,clean,invalidate D cache
^
Use APSR_nzcv instead of r15. The GNU assembler supports this
syntax since binutils 2.21 [0].
[0] https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=commit;h=db472d6ff0f438a21b357249a9b48e4b74498076
Signed-off-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
When ARMv8 cores are used in AArch32 mode, arch_hw_breakpoint_init()
in arch/arm/kernel/hw_breakpoint.c will be used.
From ARMv8 specification, v8 debug architecture versions defined:
* 0110 ARMv8, v8 Debug architecture.
* 0111 ARMv8.1, v8 Debug architecture, with Virtualization Host
Extensions.
* 1000 ARMv8.2, v8.2 Debug architecture.
* 1001 ARMv8.4, v8.4 Debug architecture.
So missing ARMv8.1/ARMv8.2/ARMv8.4 cases will cause
enable_monitor_mode() returns -ENODEV,and eventually
arch_hw_breakpoint_init() will fail.
Signed-off-by: Candle Sun <candle.sun@unisoc.com>
Signed-off-by: Nianfu Bai <nianfu.bai@unisoc.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
The system currently warns if the config conditions for
building return_address in arch/arm/kernel/return_address.c
are not met, leaving just an EXPORT_SYMBOL_GPL(return_address)
of a function defined to be 'static linline'.
This is a result of aeea3592a1 ("ARM: 8158/1: LLVMLinux: use static inline in ARM ftrace.h").
Since we're not going to build anything other than an exported
symbol for something that is already being defined to be an
inline-able return of NULL, just avoid building the code to
remove the following warning:
Fixes: aeea3592a1 ("ARM: 8158/1: LLVMLinux: use static inline in ARM ftrace.h")
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
In preparation for static_call and variable size jump_label support,
teach text_poke_bp() to emulate instructions, namely:
JMP32, JMP8, CALL, NOP2, NOP_ATOMIC5, INT3
The current text_poke_bp() takes a @handler argument which is used as
a jump target when the temporary INT3 is hit by a different CPU.
When patching CALL instructions, this doesn't work because we'd miss
the PUSH of the return address. Instead, teach poke_int3_handler() to
emulate an instruction, typically the instruction we're patching in.
This fits almost all text_poke_bp() users, except
arch_unoptimize_kprobe() which restores random text, and for that site
we have to build an explicit emulate instruction.
Tested-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191111132457.529086974@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 8c7eebc10687af45ac8e40ad1bac0cf7893dba9f)
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The upcoming s390 branch length extension patches rely on "passes do
not increase code size" property in order to consistently choose between
short and long branches. Currently this property does not hold between
the first and the second passes for register save/restore sequences, as
well as various code fragments that depend on SEEN_* flags.
Generate the code during the first pass conservatively: assume register
save/restore sequences have the maximum possible length, and that all
SEEN_* flags are set.
Also refuse to JIT if this happens anyway (e.g. due to a bug), as this
might lead to verifier bypass once long branches are introduced.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191114151820.53222-1-iii@linux.ibm.com
Add PDMA support to (arch/riscv/boot/dts/sifive/fu540-c000.dtsi)
Signed-off-by: Green Wan <green.wan@sifive.com>
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
The x86_capability array in cpuinfo_x86 is of type u32 and thus is
naturally aligned to 4 bytes. But, set_bit() and clear_bit() require the
array to be aligned to size of unsigned long (i.e. 8 bytes on 64-bit
systems).
The array pointer is handed into atomic bit operations. If the access is
not aligned to unsigned long then the atomic bit operations can end up
crossing a cache line boundary, which causes the CPU to do a full bus lock
as it can't lock both cache lines at once. The bus lock operation is heavy
weight and can cause severe performance degradation.
The upcoming #AC split lock detection mechanism will issue warnings for
this kind of access.
Force the alignment of the array to unsigned long. This avoids the massive
code changes which would be required when converting the array data type to
unsigned long.
[ tglx: Rewrote changelog so it contains information WHY this is required ]
Suggested-by: David Laight <David.Laight@aculab.com>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190916223958.27048-4-tony.luck@intel.com
cpu_caps_cleared[] and cpu_caps_set[] are arrays of type u32 and therefore
naturally aligned to 4 bytes, which is also unsigned long aligned on
32-bit, but not on 64-bit.
The array pointer is handed into atomic bit operations. If the access not
aligned to unsigned long then the atomic bit operations can end up crossing
a cache line boundary, which causes the CPU to do a full bus lock as it
can't lock both cache lines at once. The bus lock operation is heavy weight
and can cause severe performance degradation.
The upcoming #AC split lock detection mechanism will issue warnings for
this kind of access.
Force the alignment of these arrays to unsigned long. This avoids the
massive code changes which would be required when converting the array data
type to unsigned long.
[ tglx: Rewrote changelog ]
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20190916223958.27048-2-tony.luck@intel.com
MAINTAINERS update for Broadcom MIPS systems.
-----BEGIN PGP SIGNATURE-----
iIwEABYIADQWIQRgLjeFAZEXQzy86/s+p5+stXUA3QUCXc4wwBYccGF1bGJ1cnRv
bkBrZXJuZWwub3JnAAoJED6nn6y1dQDdgA4BAPUIK91zzLqrG5krdu5Df/UDM3EN
XlhhDTX8tT7QHvufAP9Uyn9FkbCIPmupBp+6s7Q6OvHGyG8TYRxTehyMtdLiBA==
=McJE
-----END PGP SIGNATURE-----
Merge tag 'mips_fixes_5.4_4' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Pull MIPS fixes from Paul Burton:
"A fix and simplification for SGI IP27 exception handlers, and a small
MAINTAINERS update for Broadcom MIPS systems"
* tag 'mips_fixes_5.4_4' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
MAINTAINERS: Remove Kevin as maintainer of BMIPS generic platforms
MIPS: SGI-IP27: fix exception handler replication
* Two updates to the IFU erratum
* selftests build fix
* Brown paper bag fix
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJdzm9YAAoJEL/70l94x66DUWAIAKODbb0j7IB9RHD3lOJI9evz
m/+ajirYug2/vv78IkCo02sVssuHqUvqes60AlUGz7ZzGzVdwbSJ9dsu+e9ylq7M
ZJIySe1R31R9qKbjoszKmL8/Dc8oe4pCHvwPsVDLNHFMmgb7nbaBrQ7SnPK8iwas
N2ljMiq8gTRYmEmF7VOw2SZqm8BebjhxjfN8RVTOvDDkMozD15djw7mQ1BH0aISs
rFt8NH8qXhgbdHXMyQUnQrNySMeH5igrrPLFESOshjIcVq67gUKW6pMdDc+uIVM7
9SLqa8mY4PfuHaYAaljZ7PKs3EkcUt5Si0arP7gF+Y0WIObALlUMr2tNDeClBmY=
=VE3v
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more KVM fixes from Paolo Bonzini:
- fixes for CONFIG_KVM_COMPAT=n
- two updates to the IFU erratum
- selftests build fix
- brown paper bag fix
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: Add a comment describing the /dev/kvm no_compat handling
KVM: x86/mmu: Take slots_lock when using kvm_mmu_zap_all_fast()
KVM: Forbid /dev/kvm being opened by a compat task when CONFIG_KVM_COMPAT=n
KVM: X86: Reset the three MSR list number variables to 0 in kvm_init_msr_list()
selftests: kvm: fix build with glibc >= 2.30
kvm: x86: disable shattered huge page recovery for PREEMPT_RT.
At the moment, the compilation of the old time32 system calls depends
purely on the architecture. As systems with new libc based on 64-bit
time_t are getting deployed, even architectures that previously supported
these (notably x86-32 and arm32 but also many others) no longer depend on
them, and removing them from a kernel image results in a smaller kernel
binary, the same way we can leave out many other optional system calls.
More importantly, on an embedded system that needs to keep working
beyond year 2038, any user space program calling these system calls
is likely a bug, so removing them from the kernel image does provide
an extra debugging help for finding broken applications.
I've gone back and forth on hiding this option unless CONFIG_EXPERT
is set. This version leaves it visible based on the logic that
eventually it will be turned off indefinitely.
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
The itimer handling for the old alpha osf_setitimer/osf_getitimer
system calls is identical to the compat version of getitimer/setitimer,
so just use those directly.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
We store elapsed time for a crashed process in struct elf_prstatus using
'timeval' structures. Once glibc starts using 64-bit time_t, this becomes
incompatible with the kernel's idea of timeval since the structure layout
no longer matches on 32-bit architectures.
This changes the definition of the elf_prstatus structure to use
__kernel_old_timeval instead, which is hardcoded to the currently used
binary layout. There is no risk of overflow in y2038 though, because
the time values are all relative times, and can store up to 68 years
of process elapsed time.
There is a risk of applications breaking at build time when they
use the new kernel headers and expect the type to be exactly 'timeval'
rather than a structure that has the same fields as before. Those
applications have to be modified to deal with 64-bit time_t anyway.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
All of the remaining syscalls that pass a timeval (gettimeofday, utime,
futimesat) can trivially be changed to pass a __kernel_old_timeval
instead, which has a compatible layout, but avoids ambiguity with
the timeval type in user space.
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
There are two 'struct timeval' fields in 'struct rusage'.
Unfortunately the definition of timeval is now ambiguous when used in
user space with a libc that has a 64-bit time_t, and this also changes
the 'rusage' definition in user space in a way that is incompatible with
the system call interface.
While there is no good solution to avoid all ambiguity here, change
the definition in the kernel headers to be compatible with the kernel
ABI, using __kernel_old_timeval as an unambiguous base type.
In previous discussions, there was also a plan to add a replacement
for rusage based on 64-bit timestamps and nanosecond resolution,
i.e. 'struct __kernel_timespec'. I have patches for that as well,
if anyone thinks we should do that.
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
The time_t definition may differ between user space and kernel space,
so replace time_t with an unambiguous 'long' for the mips and sparc.
The same structures also contain 'off_t', which has the same problem,
so replace that as well on those two architectures and powerpc.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
There are two structures based on time_t that conflict between libc and
kernel: timeval and timespec. Both are now renamed to __kernel_old_timeval
and __kernel_old_timespec.
For time_t, the old typedef is still __kernel_time_t. There is nothing
wrong with that name, but it would be nice to not use that going forward
as this type is used almost only in deprecated interfaces because of
the y2038 overflow.
In the IPC headers (msgbuf.h, sembuf.h, shmbuf.h), __kernel_time_t is only
used for the 64-bit variants, which are not deprecated.
Change these to a plain 'long', which is the same type as __kernel_time_t
on all 64-bit architectures anyway, to reduce the number of users of the
old type.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
As a preparation to stop using 'struct timespec' in the kernel,
change the powerpc vdso implementation:
- split up the vdso data definition to have equivalent members
for seconds and nanoseconds instead of an xtime structure
- use timespec64 as an intermediate for the xtime update
- change the asm-offsets definition to be based the appropriate
fixed-length types
This is only a temporary fix for changing the types, in order
to actually support a 64-bit safe vdso32 version of clock_gettime(),
the entire powerpc vdso should be replaced with the generic
lib/vdso/ implementation. If that happens first, this patch
becomes obsolete.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
The nds32 vdso is now the last user of the deprecated timespec_add_ns().
Change it to an open-coded version like the one it already uses in
do_realtime(). What we should really do though is to use the
generic vdso implementation that is now used in x86. arm and mips.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Only x86 uses the 'time' syscall in vdso, so change that to
__kernel_old_time_t as a preparation for removing 'time_t' and
'__kernel_time_t' later.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
In order to remove 'timespec' completely from the kernel, all
internal uses should be converted to a y2038-safe type, while
those that are only for compatibity with existing user space
should be marked appropriately.
Change vdso to use __kernel_old_timespec in order to avoid
the deprecated type and mark these interfaces as outdated.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
The gettimeofday() function in vdso uses the traditional 'timeval'
structure layout, which will be incompatible with future versions of
glibc on 32-bit architectures that use a 64-bit time_t.
This interface is problematic for y2038, when time_t overflows on 32-bit
architectures, but the plan so far is that a libc with 64-bit time_t
will not call into the gettimeofday() vdso helper at all, and only
have a method for entering clock_gettime(). This means we don't have
to fix it here, though we probably want to add a new clock_gettime()
entry point using a 64-bit version of 'struct timespec' at some point.
Changing the vdso code to use __kernel_old_timeval helps isolate
this usage from the other ones that still need to be fixed properly,
and it gets us closer to removing the 'timeval' definition from the
kernel sources.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
CFLAGS_REMOVE_*.o syntax is used to drop particular flags when
building objects from C files. It has no effect for assembly files.
vdso-note.o is compiled from the assembly file, vdso-note.S, hence
CFLAGS_REMOVE_vdso-note.o is meaningless.
Neither vvar.c nor vvar.S is found in the vdso directory. Since there
is no source file to create vvar.o, CFLAGS_REMOVE_vvar.o is also
meaningless.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191114154922.30365-1-yamada.masahiro@socionext.com
In IOAPIC fixed delivery mode instead of flushing the scan
requests to all vCPUs, we should only send the requests to
vCPUs specified within the destination field.
This patch introduces kvm_get_dest_vcpus_mask() API which
retrieves an array of target vCPUs by using
kvm_apic_map_get_dest_lapic() and then based on the
vcpus_idx, it sets the bit in a bitmap. However, if the above
fails kvm_get_dest_vcpus_mask() finds the target vCPUs by
traversing all available vCPUs. Followed by setting the
bits in the bitmap.
If we had different vCPUs in the previous request for the
same redirection table entry then bits corresponding to
these vCPUs are also set. This to done to keep
ioapic_handled_vectors synchronized.
This bitmap is then eventually passed on to
kvm_make_vcpus_request_mask() to generate a masked request
only for the target vCPUs.
This would enable us to reduce the latency overhead on isolated
vCPUs caused by the IPI to process due to KVM_REQ_IOAPIC_SCAN.
Suggested-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The L1 hypervisor may include the IA32_TIME_STAMP_COUNTER MSR in the
vmcs12 MSR VM-exit MSR-store area as a way of determining the highest
TSC value that might have been observed by L2 prior to VM-exit. The
current implementation does not capture a very tight bound on this
value. To tighten the bound, add the IA32_TIME_STAMP_COUNTER MSR to the
vmcs02 VM-exit MSR-store area whenever it appears in the vmcs12 VM-exit
MSR-store area. When L0 processes the vmcs12 VM-exit MSR-store area
during the emulation of an L2->L1 VM-exit, special-case the
IA32_TIME_STAMP_COUNTER MSR, using the value stored in the vmcs02
VM-exit MSR-store area to derive the value to be stored in the vmcs12
VM-exit MSR-store area.
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename function find_msr() to vmx_find_msr_index() in preparation for an
upcoming patch where we export it and use it in nested.c.
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename NR_AUTOLOAD_MSRS to NR_LOADSTORE_MSRS. This needs to be done
due to the addition of the MSR-autostore area that will be added in a
future patch. After that the name AUTOLOAD will no longer make sense.
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add the function read_and_check_msr_entry() which just pulls some code
out of nested_vmx_store_msr(). This will be useful as reusable code in
upcoming patches.
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Correct a small inaccuracy in the shattering of vmx.c, which becomes
visible now that pmu_intel.c includes nested.h.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The "load IA32_PERF_GLOBAL_CTRL" bit for VM-entry and VM-exit should
only be exposed to the guest if IA32_PERF_GLOBAL_CTRL is a valid MSR.
Create a new helper to allow pmu_refresh() to update the VM-Entry and
VM-Exit controls to ensure PMU values are initialized when performing
the is_valid_msr() check.
Suggested-by: Jim Mattson <jmattson@google.com>
Co-developed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Oliver Upton <oupton@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add condition to prepare_vmcs02 which loads IA32_PERF_GLOBAL_CTRL on
VM-entry if the "load IA32_PERF_GLOBAL_CTRL" bit on the VM-entry control
is set. Use SET_MSR_OR_WARN() rather than directly writing to the field
to avoid overwrite by atomic_switch_perf_msrs().
Suggested-by: Jim Mattson <jmattson@google.com>
Co-developed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Oliver Upton <oupton@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The existing implementation for loading the IA32_PERF_GLOBAL_CTRL MSR
on VM-exit was incorrect, as the next call to atomic_switch_perf_msrs()
could cause this value to be overwritten. Instead, call kvm_set_msr()
which will allow atomic_switch_perf_msrs() to correctly set the values.
Define a macro, SET_MSR_OR_WARN(), to set the MSR with kvm_set_msr()
and WARN on failure.
Suggested-by: Jim Mattson <jmattson@google.com>
Co-developed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Oliver Upton <oupton@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a consistency check on nested vm-entry for host's
IA32_PERF_GLOBAL_CTRL from vmcs12. Per Intel's SDM Vol 3 26.2.2:
If the "load IA32_PERF_GLOBAL_CTRL"
VM-exit control is 1, bits reserved in the IA32_PERF_GLOBAL_CTRL
MSR must be 0 in the field for that register"
Suggested-by: Jim Mattson <jmattson@google.com>
Co-developed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Oliver Upton <oupton@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add condition to nested_vmx_check_guest_state() to check the validity of
GUEST_IA32_PERF_GLOBAL_CTRL. Per Intel's SDM Vol 3 26.3.1.1:
If the "load IA32_PERF_GLOBAL_CTRL" VM-entry control is 1, bits
reserved in the IA32_PERF_GLOBAL_CTRL MSR must be 0 in the field for that
register.
Suggested-by: Jim Mattson <jmattson@google.com>
Co-developed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Oliver Upton <oupton@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Create a helper function to check the validity of a proposed value for
IA32_PERF_GLOBAL_CTRL from the existing check in intel_pmu_set_msr().
Per Intel's SDM, the reserved bits in IA32_PERF_GLOBAL_CTRL must be
cleared for the corresponding host/guest state fields.
Suggested-by: Jim Mattson <jmattson@google.com>
Co-developed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Oliver Upton <oupton@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When KVM emulates a nested VMEntry (L1->L2 VMEntry), it switches mmu root
page. If nEPT is used, this will happen from
kvm_init_shadow_ept_mmu()->__kvm_mmu_new_cr3() and otherwise it will
happpen from nested_vmx_load_cr3()->kvm_mmu_new_cr3(). Either case,
__kvm_mmu_new_cr3() will use fast_cr3_switch() in attempt to switch to a
previously cached root page.
In case fast_cr3_switch() finds a matching cached root page, it will
set it in mmu->root_hpa and request KVM_REQ_LOAD_CR3 such that on
next entry to guest, KVM will set root HPA in appropriate hardware
fields (e.g. vmcs->eptp). In addition, fast_cr3_switch() calls
kvm_x86_ops->tlb_flush() in order to flush TLB as MMU root page
was replaced.
This works as mmu->root_hpa, which vmx_flush_tlb() use, was
already replaced in cached_root_available(). However, this may
result in unnecessary INVEPT execution because a KVM_REQ_TLB_FLUSH
may have already been requested. For example, by prepare_vmcs02()
in case L1 don't use VPID.
Therefore, change fast_cr3_switch() to just request TLB flush on
next entry to guest.
Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, a host perf_event is created for a vPMC functionality emulation.
It’s unpredictable to determine if a disabled perf_event will be reused.
If they are disabled and are not reused for a considerable period of time,
those obsolete perf_events would increase host context switch overhead that
could have been avoided.
If the guest doesn't WRMSR any of the vPMC's MSRs during an entire vcpu
sched time slice, and its independent enable bit of the vPMC isn't set,
we can predict that the guest has finished the use of this vPMC, and then
do request KVM_REQ_PMU in kvm_arch_sched_in and release those perf_events
in the first call of kvm_pmu_handle_event() after the vcpu is scheduled in.
This lazy mechanism delays the event release time to the beginning of the
next scheduled time slice if vPMC's MSRs aren't changed during this time
slice. If guest comes back to use this vPMC in next time slice, a new perf
event would be re-created via perf_event_create_kernel_counter() as usual.
Suggested-by: Wei Wang <wei.w.wang@intel.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The perf_event_create_kernel_counter() in the pmc_reprogram_counter() is
a heavyweight and high-frequency operation, especially when host disables
the watchdog (maximum 21000000 ns) which leads to an unacceptable latency
of the guest NMI handler. It limits the use of vPMUs in the guest.
When a vPMC is fully enabled, the legacy reprogram_*_counter() would stop
and release its existing perf_event (if any) every time EVEN in most cases
almost the same requested perf_event will be created and configured again.
For each vPMC, if the reuqested config ('u64 eventsel' for gp and 'u8 ctrl'
for fixed) is the same as its current config AND a new sample period based
on pmc->counter is accepted by host perf interface, the current event could
be reused safely as a new created one does. Otherwise, do release the
undesirable perf_event and reprogram a new one as usual.
It's light-weight to call pmc_pause_counter (disable, read and reset event)
and pmc_resume_counter (recalibrate period and re-enable event) as guest
expects instead of release-and-create again on any condition. Compared to
use the filterable event->attr or hw.config, a new 'u64 current_config'
field is added to save the last original programed config for each vPMC.
Based on this implementation, the number of calls to pmc_reprogram_counter
is reduced by ~82.5% for a gp sampling event and ~99.9% for a fixed event.
In the usage of multiplexing perf sampling mode, the average latency of the
guest NMI handler is reduced from 104923 ns to 48393 ns (~2.16x speed up).
If host disables watchdog, the minimum latecy of guest NMI handler could be
speed up at ~3413x (from 20407603 to 5979 ns) and at ~786x in the average.
Suggested-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduce a new callback msr_idx_to_pmc that returns a struct kvm_pmc*,
and change kvm_pmu_is_valid_msr to return ".msr_idx_to_pmc(vcpu, msr) ||
.is_valid_msr(vcpu, msr)" and AMD just returns false from .is_valid_msr.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The leagcy pmu_ops->msr_idx_to_pmc is only called in kvm_pmu_rdpmc, so
this function actually receives the contents of ECX before RDPMC, and
translates it to a kvm_pmc. Let's clarify its semantic by renaming the
existing msr_idx_to_pmc to rdpmc_ecx_to_pmc, and is_valid_msr_idx to
is_valid_rdpmc_ecx; likewise for the wrapper kvm_pmu_is_valid_msr_idx.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When L1 don't use TPR-Shadow to run L2, L0 configures vmcs02 without
TPR-Shadow and install intercepts on CR8 access (load and store).
If L1 do not intercept L2 CR8 access, L0 intercepts on those accesses
will emulate load/store on L1's LAPIC TPR. If in this case L2 lowers
TPR such that there is now an injectable interrupt to L1,
apic_update_ppr() will request a KVM_REQ_EVENT which will trigger a call
to update_cr8_intercept() to update TPR-Threshold to highest pending IRR
priority.
However, this update to TPR-Threshold is done while active vmcs is
vmcs02 instead of vmcs01. Thus, when later at some point L0 will
emulate an exit from L2 to L1, L1 will still run with high
TPR-Threshold. This will result in every VMEntry to L1 to immediately
exit on TPR_BELOW_THRESHOLD and continue to do so infinitely until
some condition will cause KVM_REQ_EVENT to be set.
(Note that TPR_BELOW_THRESHOLD exit handler do not set KVM_REQ_EVENT
until apic_update_ppr() will notice a new injectable interrupt for PPR)
To fix this issue, change update_cr8_intercept() such that if L2 lowers
L1's TPR in a way that requires to lower L1's TPR-Threshold, save update
to TPR-Threshold and apply it to vmcs01 when L0 emulates an exit from
L2 to L1.
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This check is unnecessary as x86 update_cr8_intercept() which calls
this VMX/SVM specific callback already performs this check.
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are some duplicate code in kvm_pv_send_ipi when deal with ipi
bitmap. Add helper func to remove it, and eliminate odd out label,
get rid of unnecessary kvm_lapic_irq field init and so on.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When SYSCALL/SYSENTER ability check failed, cs and ss is inited but
remain not used. Delay initializing cs and ss until SYSCALL/SYSENTER
ability check passed.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Intel SDM section 25.2 OTHER CAUSES OF VM EXITS specifies the following
on INIT signals: "Such exits do not modify register state or clear pending
events as they would outside of VMX operation."
When commit 4b9852f4f3 ("KVM: x86: Fix INIT signal handling in various CPU states")
was applied, I interepted above Intel SDM statement such that
INIT_SIGNAL exit don’t consume the LAPIC INIT pending event.
However, when Nadav Amit run matching kvm-unit-test on a bare-metal
machine, it turned out my interpetation was wrong. i.e. INIT_SIGNAL
exit does consume the LAPIC INIT pending event.
(See: https://www.spinics.net/lists/kvm/msg196757.html)
Therefore, fix KVM code to behave as observed on bare-metal.
Fixes: 4b9852f4f3 ("KVM: x86: Fix INIT signal handling in various CPU states")
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 4b9852f4f3 ("KVM: x86: Fix INIT signal handling in various CPU states")
fixed KVM to also latch pending LAPIC INIT event when vCPU is in VMX
operation.
However, current API of KVM_SET_MP_STATE allows userspace to put vCPU
into KVM_MP_STATE_SIPI_RECEIVED or KVM_MP_STATE_INIT_RECEIVED even when
vCPU is in VMX operation.
Fix this by introducing a util method to check if vCPU state latch INIT
signals and use it in KVM_SET_MP_STATE handler.
Fixes: 4b9852f4f3 ("KVM: x86: Fix INIT signal handling in various CPU states")
Reported-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 4b9852f4f3 ("KVM: x86: Fix INIT signal handling in various CPU states")
fixed KVM to also latch pending LAPIC INIT event when vCPU is in VMX
operation.
However, current API of KVM_SET_VCPU_EVENTS defines this field as
part of SMM state and only set pending LAPIC INIT event if vCPU is
specified to be in SMM mode (events->smi.smm is set).
Change KVM_SET_VCPU_EVENTS handler to set pending LAPIC INIT event
by latched_init field regardless of if vCPU is in SMM mode or not.
Fixes: 4b9852f4f3 ("KVM: x86: Fix INIT signal handling in various CPU states")
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It's enough to check the value and issue the direct call.
After this commit is applied, here the most common retpolines executed
under a high resolution timer workload in the guest on a VMX host:
[..]
@[
trace_retpoline+1
__trace_retpoline+30
__x86_indirect_thunk_rax+33
do_syscall_64+89
entry_SYSCALL_64_after_hwframe+68
]: 267
@[]: 2256
@[
trace_retpoline+1
__trace_retpoline+30
__x86_indirect_thunk_rax+33
__kvm_wait_lapic_expire+284
vmx_vcpu_run.part.97+1091
vcpu_enter_guest+377
kvm_arch_vcpu_ioctl_run+261
kvm_vcpu_ioctl+559
do_vfs_ioctl+164
ksys_ioctl+96
__x64_sys_ioctl+22
do_syscall_64+89
entry_SYSCALL_64_after_hwframe+68
]: 2390
@[]: 33410
@total: 315707
Note the highest hit above is __delay so probably not worth optimizing
even if it would be more frequent than 2k hits per sec.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It's enough to check the exit value and issue a direct call to avoid
the retpoline for all the common vmexit reasons.
Of course CONFIG_RETPOLINE already forbids gcc to use indirect jumps
while compiling all switch() statements, however switch() would still
allow the compiler to bisect the case value. It's more efficient to
prioritize the most frequent vmexits instead.
The halt may be slow paths from the point of the guest, but not
necessarily so from the point of the host if the host runs at full CPU
capacity and no host CPU is ever left idle.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Eliminate wasteful call/ret non RETPOLINE case and unnecessary fentry
dynamic tracing hooking points.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
pci.h is not a UAPI header, so the __KERNEL__ ifdef is rather pointless.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191113071836.21041-4-hch@lst.de
This file only contains external declarations for two non-existing
function pointers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191113071836.21041-3-hch@lst.de
The calgary IOMMU was only used on high-end IBM systems in the early
x86_64 age and has no known users left. Remove it to avoid having to
touch it for pending changes to the DMA API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20191113071836.21041-2-hch@lst.de
Hyper-V has historically initialized stimer-based clockevents late in the
process of onlining a CPU because clockevents depend on stimer
interrupts. In the original Hyper-V design, stimer interrupts generate a
VMbus message, so the VMbus machinery must be running first, and VMbus
can't be initialized until relatively late. On x86/64, LAPIC timer based
clockevents are used during early initialization before VMbus and
stimer-based clockevents are ready, and again during CPU offlining after
the stimer clockevents have been shut down.
Unfortunately, this design creates problems when offlining CPUs for
hibernation or other purposes. stimer-based clockevents are shut down
relatively early in the offlining process, so clockevents_unbind_device()
must be used to fallback to the LAPIC-based clockevents for the remainder
of the offlining process. Furthermore, the late initialization and early
shutdown of stimer-based clockevents doesn't work well on ARM64 since there
is no other timer like the LAPIC to fallback to. So CPU onlining and
offlining doesn't work properly.
Fix this by recognizing that stimer Direct Mode is the normal path for
newer versions of Hyper-V on x86/64, and the only path on other
architectures. With stimer Direct Mode, stimer interrupts don't require any
VMbus machinery. stimer clockevents can be initialized and shut down
consistent with how it is done for other clockevent devices. While the old
VMbus-based stimer interrupts must still be supported for backward
compatibility on x86, that mode of operation can be treated as legacy.
So add a new Hyper-V stimer entry in the CPU hotplug state list, and use
that new state when in Direct Mode. Update the Hyper-V clocksource driver
to allocate and initialize stimer clockevents earlier during boot. Update
Hyper-V initialization and the VMbus driver to use this new design. As a
result, the LAPIC timer is no longer used during boot or CPU
onlining/offlining and clockevents_unbind_device() is not called. But
retain the old design as a legacy implementation for older versions of
Hyper-V that don't support Direct Mode.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Link: https://lkml.kernel.org/r/1573607467-9456-1-git-send-email-mikelley@microsoft.com
There is a new driver in drivers/mmc/host/omap_hsmmc.c
configured by CONFIG_MMC_OMAP_HS and the last user
was the pdata-quirks for pandora.
Suggested-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: H. Nikolaus Schaller <hns@goldelico.com>
Acked-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
With a wl1251 child node of mmc3 in the device tree decoded
in omap_hsmmc.c to handle special wl1251 initialization, we do
no longer need to instantiate the mmc3 through pdata quirks.
We also can remove the wlan regulator and reset/interrupt definitions
and do them through device tree.
Fixes: 81eef6ca92 ("mmc: omap_hsmmc: Use dma_request_chan() for requesting DMA channel")
Signed-off-by: H. Nikolaus Schaller <hns@goldelico.com>
Cc: <stable@vger.kernel.org> # v4.7+
Acked-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
This partly reverts the commit efdfeb079c ("regulator: fixed: Convert to
use GPIO descriptor only").
We must remove this from mainline first, so that the following patch
to remove the openpandora quirks for mmc3 and wl1251 cleanly applies
to stable v4.9, v4.14, v4.19 where the above mentioned patch is not yet
present.
Since the code affected is removed (no pandora gpios in pdata-quirks
and more), there will be no matching revert-of-the-revert.
Signed-off-by: H. Nikolaus Schaller <hns@goldelico.com>
Acked-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Since v4.7 the dma initialization requires that there is a
device tree property for "rx" and "tx" channels which is
not provided by the pdata-quirks initialization.
By conversion of the mmc3 setup to device tree this will
finally allows to remove the OpenPandora wlan specific omap3
data-quirks.
Fixes: 81eef6ca92 ("mmc: omap_hsmmc: Use dma_request_chan() for requesting DMA channel")
Signed-off-by: H. Nikolaus Schaller <hns@goldelico.com>
Cc: <stable@vger.kernel.org> # v4.7+
Acked-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
This enables the Gigabit Ethernet support on the Raspberry Pi 4.
The defined PHY mode is equivalent to the default register settings
in the downstream tree.
Signed-off-by: Matthias Brugger <mbrugger@suse.com>
Signed-off-by: Stefan Wahren <wahrenst@gmx.net>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
arm64 places the CMA in ZONE_DMA32, which is not good enough for the
Raspberry Pi 4 since it contains peripherals that can only address the
first GB of memory. Explicitly place the CMA into that area.
Signed-off-by: Nicolas Saenz Julienne <nsaenzjulienne@suse.de>
Acked-by: Stefan Wahren <wahrenst@gmx.net>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
The C compiler is allowing more checks to make sure that function pointers
are assigned to the correct prototype function. Unfortunately, the function
graph tracer uses a special name with its assigned ftrace_graph_return
function pointer that maps to a stub function used by the function tracer
(ftrace_stub). The ftrace_graph_return variable is compared to the
ftrace_stub in some archs to know if the function graph tracer is enabled or
not. This means we can not just simply create a new function stub that
compares it without modifying all the archs.
Instead, have the linker script create a function_graph_stub that maps to
ftrace_stub, and this way we can define the prototype for it to match the
prototype of ftrace_graph_return, and make the compiler checks all happy!
Link: http://lkml.kernel.org/r/20191015090055.789a0aed@gandalf.local.home
Cc: linux-sh@vger.kernel.org
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Reported-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
... or let function calls stick out and thus remain on a single line,
even if the 80 cols rule is violated by a couple of chars, for better
readability.
No functional changes.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: x86@kernel.org
Link: https://lkml.kernel.org/r/20191114172200.19563-1-bp@alien8.de
When the crashkernel kernel command line option is specified, the low
1M memory will always be reserved now. Therefore, it's not necessary to
create a backup region anymore and also no need to copy the contents of
the first 640k to it.
Remove all the code related to handling that backup region.
[ bp: Massage commit message. ]
Signed-off-by: Lianbo Jiang <lijiang@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: bhe@redhat.com
Cc: Dave Young <dyoung@redhat.com>
Cc: d.hatayama@fujitsu.com
Cc: dhowells@redhat.com
Cc: ebiederm@xmission.com
Cc: horms@verge.net.au
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jürgen Gross <jgross@suse.com>
Cc: kexec@lists.infradead.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: vgoyal@redhat.com
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191108090027.11082-3-lijiang@redhat.com
There used to be a bug in the video driver that caused the timings
for the LCD to calculate in a way on the DM3730 which made it hang.
The work around for this bug was to set
CONFIG_OMAP2_DSS_MIN_FCK_PER_PCK=4 in the kernel. This work around
is no longer needed as the video drivers have been corrected.
This patch removes the legacy note.
Signed-off-by: Adam Ford <aford173@gmail.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
The baseboard of the Logic PD Torpedo development kit has a socket
for a rechargable battery. The battery is monitored by a charger
which can communicate of the the 1-wire HDQ pin.
This patch enables the pinmux for the HDQ pin.
Signed-off-by: Adam Ford <aford173@gmail.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
We have local timers on Cortex-A9, so using the gptimer option makes no
sense. Let's just drop it for omap4 to simplify the timer options a bit.
If this is really needed, it can be still done by specifying dts properties
in the board specific file for assigned-clocks and assigned-clock-parents.
This gets us a bit closer to start dropping legacy platform data for
gptimers except for timer1 that is used for system clockevent.
Cc: Keerthy <j-keerthy@ti.com>
Cc: Tero Kristo <t-kristo@ti.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
Acquire the per-VM slots_lock when zapping all shadow pages as part of
toggling nx_huge_pages. The fast zap algorithm relies on exclusivity
(via slots_lock) to identify obsolete vs. valid shadow pages, because it
uses a single bit for its generation number. Holding slots_lock also
obviates the need to acquire a read lock on the VM's srcu.
Failing to take slots_lock when toggling nx_huge_pages allows multiple
instances of kvm_mmu_zap_all_fast() to run concurrently, as the other
user, KVM_SET_USER_MEMORY_REGION, does not take the global kvm_lock.
(kvm_mmu_zap_all_fast() does take kvm->mmu_lock, but it can be
temporarily dropped by kvm_zap_obsolete_pages(), so it is not enough
to enforce exclusivity).
Concurrent fast zap instances causes obsolete shadow pages to be
incorrectly identified as valid due to the single bit generation number
wrapping, which results in stale shadow pages being left in KVM's MMU
and leads to all sorts of undesirable behavior.
The bug is easily confirmed by running with CONFIG_PROVE_LOCKING and
toggling nx_huge_pages via its module param.
Note, until commit 4ae5acbc4936 ("KVM: x86/mmu: Take slots_lock when
using kvm_mmu_zap_all_fast()", 2019-11-13) the fast zap algorithm used
an ulong-sized generation instead of relying on exclusivity for
correctness, but all callers except the recently added set_nx_huge_pages()
needed to hold slots_lock anyways. Therefore, this patch does not have
to be backported to stable kernels.
Given that toggling nx_huge_pages is by no means a fast path, force it
to conform to the current approach instead of reintroducing the previous
generation count.
Fixes: b8e8c8303f ("kvm: mmu: ITLB_MULTIHIT mitigation", but NOT FOR STABLE)
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since commit 54b8ae66ae ("kbuild: change *FLAGS_<basetarget>.o to
take the path relative to $(obj)"), sparc allmodconfig fails to build
as follows:
CC arch/sparc/vdso/vdso32/vclock_gettime.o
unrecognized e_machine 18 arch/sparc/vdso/vdso32/vclock_gettime.o
arch/sparc/vdso/vdso32/vclock_gettime.o: failed
The cause of the breakage is that -pg flag not being dropped.
The vdso32 files are located in the vdso32/ subdirectory, but I missed
to update the Makefile.
I removed the meaningless CFLAGS_REMOVE_vdso-note.o since it is only
effective for C file.
vdso-note.o is compiled from assembly file:
arch/sparc/vdso/vdso-note.S
arch/sparc/vdso/vdso32/vdso-note.S
Fixes: 54b8ae66ae ("kbuild: change *FLAGS_<basetarget>.o to take the path relative to $(obj)")
Reported-by: Anatoly Pugachev <matorola@gmail.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Tested-by: Anatoly Pugachev <matorola@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
When building allmodconfig KCONFIG_ALLCONFIG=$(pwd)/arch/arm64/configs/defconfig
CONFIG_CPU_BIG_ENDIAN gets enabled. Which tends not to be what most
people want. Another concern that has come up is that ACPI isn't built
for an allmodconfig kernel today since that also depends on !CPU_BIG_ENDIAN.
Rework so that we introduce a 'choice' and default the choice to
CPU_LITTLE_ENDIAN. That means that when we build an allmodconfig kernel
it will default to CPU_LITTLE_ENDIAN that most people tends to want.
Reviewed-by: John Garry <john.garry@huawei.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
On x86, purgatory() copies the first 640K of memory to a backup region
because the kernel needs those first 640K for the real mode trampoline
during boot, among others.
However, when SME is enabled, the kernel cannot properly copy the old
memory to the backup area but reads only its encrypted contents. The
result is that the crash tool gets invalid pointers when parsing vmcore:
crash> kmem -s|grep -i invalid
kmem: dma-kmalloc-512: slab:ffffd77680001c00 invalid freepointer:a6086ac099f0c5a4
kmem: dma-kmalloc-512: slab:ffffd77680001c00 invalid freepointer:a6086ac099f0c5a4
crash>
So reserve the remaining low 1M memory when the crashkernel option is
specified (after reserving real mode memory) so that allocated memory
does not fall into the low 1M area and thus the copying of the contents
of the first 640k to a backup region in purgatory() can be avoided
altogether.
This way, it does not need to be included in crash dumps or used for
anything except the trampolines that must live in the low 1M.
[ bp: Heavily rewrite commit message, flip check logic in
crash_reserve_low_1M().]
Signed-off-by: Lianbo Jiang <lijiang@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: bhe@redhat.com
Cc: Dave Young <dyoung@redhat.com>
Cc: d.hatayama@fujitsu.com
Cc: dhowells@redhat.com
Cc: ebiederm@xmission.com
Cc: horms@verge.net.au
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jürgen Gross <jgross@suse.com>
Cc: kexec@lists.infradead.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: vgoyal@redhat.com
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191108090027.11082-2-lijiang@redhat.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=204793
Add a forward declaration of struct kimage to the crash.h header because
future changes will invoke a crash-specific function from the realmode
init path and the compiler will complain otherwise like this:
In file included from arch/x86/realmode/init.c:11:
./arch/x86/include/asm/crash.h:5:32: warning: ‘struct kimage’ declared inside\
parameter list will not be visible outside of this definition or declaration
5 | int crash_load_segments(struct kimage *image);
| ^~~~~~
./arch/x86/include/asm/crash.h:6:37: warning: ‘struct kimage’ declared inside\
parameter list will not be visible outside of this definition or declaration
6 | int crash_copy_backup_region(struct kimage *image);
| ^~~~~~
./arch/x86/include/asm/crash.h:7:39: warning: ‘struct kimage’ declared inside\
parameter list will not be visible outside of this definition or declaration
7 | int crash_setup_memmap_entries(struct kimage *image,
|
[ bp: Rewrite the commit message. ]
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Lianbo Jiang <lijiang@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: bhe@redhat.com
Cc: d.hatayama@fujitsu.com
Cc: dhowells@redhat.com
Cc: dyoung@redhat.com
Cc: ebiederm@xmission.com
Cc: horms@verge.net.au
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jürgen Gross <jgross@suse.com>
Cc: kexec@lists.infradead.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: vgoyal@redhat.com
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191108090027.11082-4-lijiang@redhat.com
Link: https://lkml.kernel.org/r/201910310233.EJRtTMWP%25lkp@intel.com
This is to augment commit 3f5a7896a5 ("x86/mce: Include the PPIN in MCE
records when available").
I'm also adding "synd" and "ipid" fields to struct xen_mce, in an
attempt to keep field offsets in sync with struct mce. These two fields
won't get populated for now, though.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Some versions of SoC MT7621 have three PCI express hosts. Some boards
make use of those PCI through the staging driver mt7621-pci. Recently
PCI support has been removed from MT7621 Soc kernel configuration due
to a build error. This makes imposible to compile staging driver and
produces a regression for gnubee based boards. Enable support for PCI
again but enable it only if staging mt7621-pci driver is selected.
Fixes: c4d48cf5e2 ("MIPS: ralink: deactivate PCI support for SOC_MT7621")
Cc: Hauke Mehrtens <hauke@hauke-m.de>
Cc: ralf@linux-mips.org
Cc: jhogan@kernel.org
Cc: john@phrozen.org
Cc: NeilBrown <neil@brown.name>
Cc: linux-mips@vger.kernel.org
Signed-off-by: Sergio Paracuellos <sergio.paracuellos@gmail.com>
Link: https://lore.kernel.org/r/20191019081233.7337-1-sergio.paracuellos@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On some systems that are vulnerable to Spectre v2, it is up to
software to flush the link stack (return address stack), in order to
protect against Spectre-RSB.
When exiting from a guest we do some house keeping and then
potentially exit to C code which is several stack frames deep in the
host kernel. We will then execute a series of returns without
preceeding calls, opening up the possiblity that the guest could have
poisoned the link stack, and direct speculative execution of the host
to a gadget of some sort.
To prevent this we add a flush of the link stack on exit from a guest.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In commit ee13cb249f ("powerpc/64s: Add support for software count
cache flush"), I added support for software to flush the count
cache (indirect branch cache) on context switch if firmware told us
that was the required mitigation for Spectre v2.
As part of that code we also added a software flush of the link
stack (return address stack), which protects against Spectre-RSB
between user processes.
That is all correct for CPUs that activate that mitigation, which is
currently Power9 Nimbus DD2.3.
What I got wrong is that on older CPUs, where firmware has disabled
the count cache, we also need to flush the link stack on context
switch.
To fix it we create a new feature bit which is not set by firmware,
which tells us we need to flush the link stack. We set that when
firmware tells us that either of the existing Spectre v2 mitigations
are enabled.
Then we adjust the patching code so that if we see that feature bit we
enable the link stack flush. If we're also told to flush the count
cache in software then we fall through and do that also.
On the older CPUs we don't need to do do the software count cache
flush, firmware has disabled it, so in that case we patch in an early
return after the link stack flush.
The naming of some of the functions is awkward after this patch,
because they're called "count cache" but they also do link stack. But
we'll fix that up in a later commit to ease backporting.
This is the fix for CVE-2019-18660.
Reported-by: Anthony Steinhauser <asteinhauser@google.com>
Fixes: ee13cb249f ("powerpc/64s: Add support for software count cache flush")
Cc: stable@vger.kernel.org # v4.4+
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When running in M-mode we can't use the SBI to set the timer, and
don't have access to the time CSR as that usually is emulated by
M-mode. Instead provide code that directly accesses the MMIO for
the timer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anup Patel <anup@brainfault.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de> # for drivers/clocksource
[paul.walmsley@sifive.com: updated to apply; fixed checkpatch
issue; timex.h now includes asm/mmio.h to resolve header file
problems]
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
The RISC-V ISA only supports flushing the instruction cache for the
local CPU core. Currently we always offload the remote TLB flushing to
the SBI, which then issues an IPI under the hoods. But with M-mode
we do not have an SBI so we have to do it ourselves. IPI to the
other nodes using the existing kernel helpers instead if we have
native clint support and thus can IPI directly from the kernel.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anup Patel <anup@brainfault.org>
[paul.walmsley@sifive.com: cleaned up code comment]
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
Move the sbi poweroff to a separate function and file that is only
compiled if CONFIG_SBI is set.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Atish Patra <atish.patra@wdc.com>
[paul.walmsley@sifive.com: split the WFI fix into a separate patch]
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
There is no SBI when we run in M-mode, so fail the compile for any code
trying to use SBI calls.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
When running in M-mode we can't use SBI based drivers. Add a new
CONFIG_RISCV_SBI that drivers that do SBI calls can depend on
instead.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
When applying commit 7a5ee6edb4 ("KVM: X86: Fix initialization of MSR
lists"), it forgot to reset the three MSR lists number varialbes to 0
while removing the useless conditionals.
Fixes: 7a5ee6edb4 (KVM: X86: Fix initialization of MSR lists)
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If a huge page is recovered (and becomes no executable) while another
thread is executing it, the resulting contention on mmu_lock can cause
latency spikes. Disabling recovery for PREEMPT_RT kernels fixes this
issue.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Objtool complains about the new ftrace direct trampoline code:
arch/x86/kernel/ftrace_64.o: warning: objtool: ftrace_regs_caller()+0x190: stack state mismatch: cfa1=7+16 cfa2=7+24
Typically, code has a deterministic stack layout, such that at a given
instruction address, the stack frame size is always the same.
That's not the case for the new ftrace_regs_caller() code after it
adjusts the stack for the direct case. Just plead ignorance and assume
it's always the non-direct path. Note this creates a tiny window for
ORC to get confused.
Link: http://lkml.kernel.org/r/20191108225100.ea3bhsbdf6oerj6g@treble
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
As testing for direct calls from the function graph tracer adds a little
overhead (which is a lot when tracing every function), add a counter that
can be used to test if function_graph tracer needs to test for a direct
caller or not.
It would have been nicer if we could use a static branch, but the static
branch logic fails when used within the function graph tracer trampoline.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Enable x86 to allow for register_ftrace_direct(), where a custom trampoline
may be called directly from an ftrace mcount/fentry location.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
rdtgroup_cpus_write() and mkdir_rdt_prepare() call
rdtgroup_kn_lock_live() -> kernfs_to_rdtgroup() to get 'rdtgrp', and
then call the rdt_last_cmd_{clear,puts,...}() functions which will check
if rdtgroup_mutex is held/requires its caller to hold rdtgroup_mutex.
But if 'rdtgrp' returned from kernfs_to_rdtgroup() is NULL,
rdtgroup_mutex is not held and calling rdt_last_cmd_{clear,puts,...}()
will result in a self-incurred, potential lockdep warning.
Remove the rdt_last_cmd_{clear,puts,...}() calls in these two paths.
Just returning error should be sufficient to report to the user that the
entry doesn't exist any more.
[ bp: Massage. ]
Fixes: 94457b36e8 ("x86/intel_rdt: Add diagnostics when writing the cpus file")
Fixes: cfd0f34e4c ("x86/intel_rdt: Add diagnostics when making directories")
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Fenghua Yu <fenghua.yu@intel.com>
Reviewed-by: Reinette Chatre <reinette.chatre@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: pei.p.jia@intel.com
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/1573079796-11713-1-git-send-email-xiaochen.shen@intel.com
With recent optimizations to AUX and PT buffer management code (high order
AUX allocations, opportunistic Single Range Output), it is far more likely
now that the output MSRs won't need reprogramming on every sched-in.
To avoid needless WRMSRs of those registers, cache their values and only
write them when needed.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/20191105082701.78442-3-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Most of PT implementations support Single Range Output mode, which is
an alternative to ToPA that can be used for a single contiguous buffer
and if we don't require an interrupt, that is, in AUX snapshot mode.
Now that perf core will use high order allocations for the AUX buffer,
in many cases the first condition will also be satisfied.
The two most obvious benefits of the Single Range Output mode over the
ToPA are:
* not having to allocate the ToPA table(s),
* not using the ToPA walk hardware.
Make use of this functionality where available and appropriate.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: https://lkml.kernel.org/r/20191105082701.78442-2-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add AUX sampling support to the PT PMU: implement an NMI-safe callback
that takes a snapshot of the buffer without touching the event states.
This is done for PT events that don't use PMIs, that is, snapshot mode
(RO mapping of the AUX area).
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: adrian.hunter@intel.com
Cc: mathieu.poirier@linaro.org
Link: https://lkml.kernel.org/r/20191025140835.53665-4-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
PT trace is now enabled at the bottom of the event configuration
function that takes care of all configuration bits related to a given
event, including the address filter update. This is only needed where
the event configuration changes, that is, in ->add()/->start().
In the interrupt path we can use a lighter version that keeps the
configuration intact, since it hasn't changed, and only flips the
enable bit.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: adrian.hunter@intel.com
Cc: mathieu.poirier@linaro.org
Link: https://lkml.kernel.org/r/20191025140835.53665-3-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Like all other architectures such as x86 or arm64, include KASLR offset
in VMCOREINFO ELF notes to assist in debugging. After this, we can use
crash --kaslr option to parse vmcore generated from a kaslr kernel.
Note: The crash tool needs to support --kaslr too.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Scott Wood <oss@buserror.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When kaslr is enabled, the kernel offset is different for every boot.
This brings some difficult to debug the kernel. Dump out the kernel
offset when panic so that we can easily debug the kernel.
This code is derived from x86/arm64 which has similar functionality.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Diana Craciun <diana.craciun@nxp.com>
Tested-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Scott Wood <oss@buserror.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
One may want to disable kaslr when boot, so provide a cmdline parameter
'nokaslr' to support this.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Diana Craciun <diana.craciun@nxp.com>
Tested-by: Diana Craciun <diana.craciun@nxp.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The original kernel still exists in the memory, clear it now.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Diana Craciun <diana.craciun@nxp.com>
Tested-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Scott Wood <oss@buserror.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
After we have the basic support of relocate the kernel in some
appropriate place, we can start to randomize the offset now.
Entropy is derived from the banner and timer, which will change every
build and boot. This not so much safe so additionally the bootloader may
pass entropy via the /chosen/kaslr-seed node in device tree.
We will use the first 512M of the low memory to randomize the kernel
image. The memory will be split in 64M zones. We will use the lower 8
bit of the entropy to decide the index of the 64M zone. Then we chose a
16K aligned offset inside the 64M zone to put the kernel in.
We also check if we will overlap with some areas like the dtb area, the
initrd area or the crashkernel area. If we cannot find a proper area,
kaslr will be disabled and boot from the original kernel.
Some pieces of code are derived from arch/x86/boot/compressed/kaslr.c or
arch/arm64/kernel/kaslr.c such as rotate_xor(). Credit goes to Kees and
Ard.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Diana Craciun <diana.craciun@nxp.com>
Tested-by: Diana Craciun <diana.craciun@nxp.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch add support to boot kernel from places other than KERNELBASE.
Since CONFIG_RELOCATABLE has already supported, what we need to do is
map or copy kernel to a proper place and relocate. Freescale Book-E
parts expect lowmem to be mapped by fixed TLB entries(TLB1). The TLB1
entries are not suitable to map the kernel directly in a randomized
region, so we chose to copy the kernel to a proper place and restart to
relocate.
The offset of the kernel was not randomized yet(a fixed 64M is set). We
will randomize it in the next patch.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Tested-by: Diana Craciun <diana.craciun@nxp.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <oss@buserror.net>
[mpe: Use PTRRELOC() in early_init()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add a new helper reloc_kernel_entry() to jump back to the start of the
new kernel. After we put the new kernel in a randomized place we can use
this new helper to enter the kernel and begin to relocate again.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Diana Craciun <diana.craciun@nxp.com>
Tested-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Scott Wood <oss@buserror.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add a new helper create_kaslr_tlb_entry() to create a tlb entry by the
virtual and physical address. This is a preparation to support boot kernel
at a randomized address.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Diana Craciun <diana.craciun@nxp.com>
Tested-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Scott Wood <oss@buserror.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Now the kernel base is a fixed value - KERNELBASE. To support KASLR, we
need a variable to store the kernel base.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Diana Craciun <diana.craciun@nxp.com>
Tested-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Scott Wood <oss@buserror.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
These two variables are both defined in init_32.c and init_64.c. Move
them to init-common.c and make them __ro_after_init.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Diana Craciun <diana.craciun@nxp.com>
Tested-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Scott Wood <oss@buserror.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
M_IF_NEEDED is defined too many times. Move it to a common place and
rename it to MAS2_M_IF_NEEDED which is much readable.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Diana Craciun <diana.craciun@nxp.com>
Tested-by: Diana Craciun <diana.craciun@nxp.com>
Signed-off-by: Scott Wood <oss@buserror.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
These changes just clean up few typos, and there is one non-critical
correction of missing put_device() after calling of_platform_populate()
for display controller.
Naturally none of this is urgent and can be merged when suitable.
-----BEGIN PGP SIGNATURE-----
iQJFBAABCAAvFiEEkgNvrZJU/QSQYIcQG9Q+yVyrpXMFAl3K6rMRHHRvbnlAYXRv
bWlkZS5jb20ACgkQG9Q+yVyrpXPrZg//Vt0zrGT7F7dyiyl+BCxUrnoejBa4aWJc
OUmdvWv8umKCepzn4QwrOTogLBgZes1Le15mPfbktmU8lSIH9KuZGm/KmZB0mO0S
uJCuoHY6gMqRF9hAhMXFcfsXcuBc0ytZVA9yOLhUP9VAsxySo7kKnfMLJ/W6xjWr
4bLW894/lDPJ3ENUrJcUR1breV6VRnYOJ6dqE3cNLSA3BNNcGIwfOJK7Ob/owkXN
k/3nTI+g6h6Pzm5WlJxt78cyhgRY1AGOZ8+UlBywgSMJmja4SaysZ1Ga+6rwjKQ9
mJFWe6b13lAr9gIu1f88Yw/zb3jYVpjhEdXUGf8VeFDTxgrFnDt6N8FuM3LbTagU
L2jsjSegC78qqYUDiFDOD+2w9V7LIxqu7G9wy27BXJqB15beE5aGFMJGnMTfhid6
xfVBUl9Q+G4dZi4TNG+SeT3pJJqmBuv80hlcsCVSC5mNoF6bvXa9zJo99DUhjsHX
tjyIILl1BsAImSbQq2LzDE+CtSjRKpovBNxtGO+2C9MeBKW7LI0+lNvTUIcKvHVJ
xvEMFdEAUE7SY0/3zKczLkHlIHP+NVolo+7+YgwAjQ28rKqcgKQ1nTLZIOT2wnQw
NDiKUJ5nTy3To8NPUzwdY645vzaY5NI+XRqRhUbk5OAgqRjkcq6T8cyA7FY6Db0b
3R99xi1BcQI=
=2Ql6
-----END PGP SIGNATURE-----
Merge tag 'omap-for-v5.5/soc-late-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into arm/soc
Few late SoC changes for v5.5 merge window
These changes just clean up few typos, and there is one non-critical
correction of missing put_device() after calling of_platform_populate()
for display controller.
Naturally none of this is urgent and can be merged when suitable.
* tag 'omap-for-v5.5/soc-late-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap:
ARM: OMAP2+: Add missing put_device() call in omapdss_init_of()
OMAP2: fixup doc comments in omap_device
ARM: OMAP1: drop duplicated dependency on ARCH_OMAP1
ARM: OMAP1: ams-delta FIQ: Fix a typo ("Initiaize")
Link: https://lore.kernel.org/r/pull-1573579927-542764@atomide.com-2
Signed-off-by: Olof Johansson <olof@lixom.net>
Let's configure omap3 rng as we can now easily do it with a dts change
only by tagging it with status = "disabled" for n900 where it's not
accessible. This improves the boot-up time with distros quite a bit with
proper entropy source around.
There's also another minor update for logicpd-torpedo that does not have
USB host pins wired.
-----BEGIN PGP SIGNATURE-----
iQJFBAABCAAvFiEEkgNvrZJU/QSQYIcQG9Q+yVyrpXMFAl3K69QRHHRvbnlAYXRv
bWlkZS5jb20ACgkQG9Q+yVyrpXNgXA/9FfTaWdyT67H1nlykEOxfv0EQg21uh/ZT
lWhSE8AzqDOhD1NYDjkRZ+DvWnDsofAbwqJmivIIOZir2sfLqYDBfyE7E0QFSA4P
2J3lzRvggkaKWpWfW/1eyPt9sWW47brnB7w0hjCYO9KTHfgWIBCk/3Mb7vAJKqNq
UBax7Bn+30k0mlZwGqOorBK6bHmS1/9rwE0wtl7SLi8dFjVnR7TcyqaMr+rP+VJb
LAo2yTL8Ok/xZwpXAE3WUOkcj0GIEoxMpdpIGBmLW73nhHJgIJ4/ujbDb3ev9Ikh
hLoMFqJxPONyRHM0QYxaVOaJpVo0DWmgOAYjHN+71mRMYA3/b2nSo/hMpGhFylja
kCxEiJhfJANBByblOGMy8SvxY5f26tBx0Pm19qGG3S5YHxcItf7nheltchlVUv/A
YflZT75QQ6lvqIbw9C4zD22jDb+n+uXR4XK+ZBJT3tItK68I2zWzDPOj0MNjdXGf
e7Ni2801P6YEYBsz3qdwygsOQOuVUjuZz/tTFMaWzQmkNChn6M06I6lLo6uW2mQY
ao4ZgghX7Rs94XNMjAz6idCly0sbCNpmOqOoMRwQO1yxmK+doAQOcFj2vA3+jaWl
xyg/jzE4AwdZabauHmFTlApv/0wCAXkZmJPhY4QXY3dfI3FSpeNRg2lJSWhQxitj
FbYWv5QgqM8=
=AwDa
-----END PGP SIGNATURE-----
Merge tag 'omap-for-v5.5/dt-late-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into arm/dt
Few late dts changes for omaps for v5.5 merge window
Let's configure omap3 rng as we can now easily do it with a dts change
only by tagging it with status = "disabled" for n900 where it's not
accessible. This improves the boot-up time with distros quite a bit with
proper entropy source around.
There's also another minor update for logicpd-torpedo that does not have
USB host pins wired.
* tag 'omap-for-v5.5/dt-late-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap:
ARM: dts: logicpd-torpedo: Disable USB Host
ARM: dts: Configure omap3 rng
Link: https://lore.kernel.org/r/pull-1573579927-542764@atomide.com
Signed-off-by: Olof Johansson <olof@lixom.net>
Currently it is not possible to distinguish the case when fadump is
supported by firmware and disabled in kernel and completely unsupported
using the kernel sysfs interface. User can investigate the devicetree
but it is more reasonable to provide sysfs files in case we get some
fadumpv2 in the future.
With this patch sysfs files are available whenever fadump is supported
by firmware.
There is duplicate message about lack of support by firmware in
fadump_reserve_mem and setup_fadump. Remove the duplicate message in
setup_fadump.
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Reviewed-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191107164757.15140-1-msuchanek@suse.de
Since commit ed1cd6deb0 ("powerpc: Activate CONFIG_THREAD_INFO_IN_TASK")
current_is_64bit() is quivalent to !is_32bit_task().
Remove the redundant function.
Suggested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190912194633.12045-1-msuchanek@suse.de
Currently when an EEH error is detected, the system log receives the
same (or almost the same) message twice:
EEH: PHB#0 failure detected, location: N/A
EEH: PHB#0 failure detected, location: N/A
or
EEH: eeh_dev_check_failure: Frozen PHB#0-PE#0 detected
EEH: Frozen PHB#0-PE#0 detected
This looks like a bug, but in fact the messages are from different
functions and mean slightly different things. So keep both but change
one of the messages slightly, so that it's clear they are different:
EEH: PHB#0 failure detected, location: N/A
EEH: Recovering PHB#0, location: N/A
or
EEH: eeh_dev_check_failure: Frozen PHB#0-PE#0 detected
EEH: Recovering PHB#0-PE#0
Signed-off-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/43817cb6e6631b0828b9a6e266f60d1f8ca8eb22.1571288375.git.sbobroff@linux.ibm.com
Changes the return variable to bool (as the return value) and
avoids doing a ternary operation before returning.
Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190802133914.30413-1-leonardo@linux.ibm.com
The powerpc version of dma-mapping.h only contains a version of
get_arch_dma_ops that always return NULL. Replace it with the
asm-generic version that does the same.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190807150752.17894-1-hch@lst.de
When the machine crash handler is invoked, all interrupts are masked
but interrupts which have not been started yet do not have an ESB page
mapped in the Linux address space. This crashes the 'crash kexec'
sequence on sPAPR guests.
To fix, force the mapping of the ESB page when an interrupt is being
mapped in the Linux IRQ number space. This is done by setting the
initial state of the interrupt to OFF which is not necessarily the
case on PowerNV.
Fixes: 243e25112d ("powerpc/xive: Native exploitation of the XIVE interrupt controller")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191031063100.3864-1-clg@kaod.org
The FSF does not reside in "675 Mass Ave, Cambridge" anymore...
let's simply use proper SPDX identifiers instead.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Acked-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190828060737.32531-1-thuth@redhat.com
Avoids confusion when printing Oops message like below
Faulting instruction address: 0xc00000000008bdb4
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
This was because we never clear the MMU_FTR_HPTE_TABLE feature flag
even if we run with radix translation. It was discussed that we should
look at this feature flag as an indication of the capability to run
hash translation and we should not clear the flag even if we run in
radix translation. All the code paths check for radix_enabled() check and
if found true consider we are running with radix translation. Follow the
same sequence for finding the MMU translation string to be used in Oops
message.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190711145814.17970-1-aneesh.kumar@linux.ibm.com
arch/powerpc/platforms/cell/spufs/inode.c:201:22:
warning: variable ctx set but not used [-Wunused-but-set-variable]
It is not used since commit 67cba9fd64 ("move
spu_forget() into spufs_rmdir()")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191023134423.15052-1-yuehaibing@huawei.com
The callback function of call_rcu() just calls a kfree(), so we
can use kfree_rcu() instead of call_rcu() + callback function.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190711141818.18044-1-yuehaibing@huawei.com
Fix sparse warnings:
arch/powerpc/platforms/powernv/opal-psr.c:20:1:
warning: symbol 'psr_mutex' was not declared. Should it be static?
arch/powerpc/platforms/powernv/opal-psr.c:27:3:
warning: symbol 'psr_attrs' was not declared. Should it be static?
arch/powerpc/platforms/powernv/opal-powercap.c:20:1:
warning: symbol 'powercap_mutex' was not declared. Should it be static?
arch/powerpc/platforms/powernv/opal-sensor-groups.c:20:1:
warning: symbol 'sg_mutex' was not declared. Should it be static?
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190702131733.44100-1-yuehaibing@huawei.com
These Kconfig options has been removed in commit 4c145dce26 ("xfrm:
make xfrm modes builtin") So there is no point to keep it in
defconfigs any longer.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
[mpe: Extract from cross arch patch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190612071901.21736-1-yuehaibing@huawei.com
Remove .owner field if calls are used which set it automatically
Generated by: scripts/coccinelle/api/platform_no_drv_owner.cocci
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190218133950.95225-1-yuehaibing@huawei.com
There is no need to have the 'struct dentry *vpa_dir' variable static
since new value always be assigned before use it.
Fixes: c6c26fb55e ("powerpc/pseries: Export raw per-CPU VPA data via debugfs")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190218125644.87448-1-yuehaibing@huawei.com
Use DEFINE_DEBUGFS_ATTRIBUTE rather than DEFINE_SIMPLE_ATTRIBUTE
for debugfs files.
Semantic patch information:
Rationale: DEFINE_SIMPLE_ATTRIBUTE + debugfs_create_file()
imposes some significant overhead as compared to
DEFINE_DEBUGFS_ATTRIBUTE + debugfs_create_file_unsafe().
Generated by: scripts/coccinelle/api/debugfs/debugfs_simple_attr.cocci
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1545705876-63132-1-git-send-email-yuehaibing@huawei.com
Use DEFINE_DEBUGFS_ATTRIBUTE rather than DEFINE_SIMPLE_ATTRIBUTE
for debugfs files.
Semantic patch information:
Rationale: DEFINE_SIMPLE_ATTRIBUTE + debugfs_create_file()
imposes some significant overhead as compared to
DEFINE_DEBUGFS_ATTRIBUTE + debugfs_create_file_unsafe().
Generated by: scripts/coccinelle/api/debugfs/debugfs_simple_attr.cocci
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1543498518-107601-1-git-send-email-yuehaibing@huawei.com
rtas_parse_epow_errlog() should pass 'modifier' to
handle_system_shutdown, because event modifier only use
bottom 4 bits.
Reviewed-by: Tyrel Datwyler <tyreld@linux.ibm.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191023134838.21280-1-yuehaibing@huawei.com
On powerpc, watchpoint match range is double-word granular. On a
watchpoint hit, DAR is set to the first byte of overlap between actual
access and watched range. And thus it's quite possible that DAR does
not point inside user specified range. Ex, say user creates a
watchpoint with address range 0x1004 to 0x1007. So hw would be
configured to watch from 0x1000 to 0x1007. If there is a 4 byte access
from 0x1002 to 0x1005, DAR will point to 0x1002 and thus interrupt
handler considers it as extraneous, but it's actually not, because
part of the access belongs to what user has asked.
Instead of blindly ignoring the exception, get actual address range by
analysing an instruction, and ignore only if actual range does not
overlap with user specified range.
Note: The behavior is unchanged for 8xx.
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191017093204.7511-5-ravi.bangoria@linux.ibm.com
ptrace_set_debugreg() does not consider new length while overwriting
the watchpoint. Fix that. ppc_set_hwdebug() aligns watchpoint address
to doubleword boundary but does not change the length. If address
range is crossing doubleword boundary and length is less then 8, we
will lose samples from second doubleword. So fix that as well.
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191017093204.7511-4-ravi.bangoria@linux.ibm.com
Watchpoint match range is always doubleword(8 bytes) aligned on
powerpc. If the given range is crossing doubleword boundary, we need
to increase the length such that next doubleword also get
covered. Ex,
address len = 6 bytes
|=========.
|------------v--|------v--------|
| | | | | | | | | | | | | | | | |
|---------------|---------------|
<---8 bytes--->
In such case, current code configures hw as:
start_addr = address & ~HW_BREAKPOINT_ALIGN
len = 8 bytes
And thus read/write in last 4 bytes of the given range is ignored.
Fix this by including next doubleword in the length.
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191017093204.7511-3-ravi.bangoria@linux.ibm.com
The issue was showing "Mitigation" message via sysfs whatever the
state of "RFI Flush", but it should show "Vulnerable" when it is
disabled.
If you have "L1D private" feature enabled and not "RFI Flush" you are
vulnerable to meltdown attacks.
"RFI Flush" is the key feature to mitigate the meltdown whatever the
"L1D private" state.
SEC_FTR_L1D_THREAD_PRIV is a feature for Power9 only.
So the message should be as the truth table shows:
CPU | L1D private | RFI Flush | sysfs
----|-------------|-----------|-------------------------------------
P9 | False | False | Vulnerable
P9 | False | True | Mitigation: RFI Flush
P9 | True | False | Vulnerable: L1D private per thread
P9 | True | True | Mitigation: RFI Flush, L1D private per thread
P8 | False | False | Vulnerable
P8 | False | True | Mitigation: RFI Flush
Output before this fix:
# cat /sys/devices/system/cpu/vulnerabilities/meltdown
Mitigation: RFI Flush, L1D private per thread
# echo 0 > /sys/kernel/debug/powerpc/rfi_flush
# cat /sys/devices/system/cpu/vulnerabilities/meltdown
Mitigation: L1D private per thread
Output after fix:
# cat /sys/devices/system/cpu/vulnerabilities/meltdown
Mitigation: RFI Flush, L1D private per thread
# echo 0 > /sys/kernel/debug/powerpc/rfi_flush
# cat /sys/devices/system/cpu/vulnerabilities/meltdown
Vulnerable: L1D private per thread
Signed-off-by: Gustavo L. F. Walbon <gwalbon@linux.ibm.com>
Signed-off-by: Mauro S. M. Rodrigues <maurosr@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190502210907.42375-1-gwalbon@linux.ibm.com
The stress test for vpmsum implementations executes a long for loop in
the kernel. This blocks the scheduler, which prevents other tasks from
running, resulting in a warning.
This fix adds a call to cond_reshed() at the end of each loop, which
allows the scheduler to run other tasks as required.
Signed-off-by: Chris Smart <chris.smart@humanservices.gov.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191103233356.5472-1-chris.smart@humanservices.gov.au
Let's allow to test the implementation without needing HW support.
When "simulate=1" is specified when loading the module, we bypass all
HW checks and HW calls. The sysfs file "simulate_loan_target_kb" can
be used to simulate HW requests.
The simualtion mode can be activated using:
modprobe cmm debug=1 simulate=1
And the requested loan target can be changed using:
echo X > /sys/devices/system/cmm/cmm0/simulate_loan_target_kb
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191031142933.10779-11-david@redhat.com
balloon_page_alloc() will use GFP_HIGHUSER_MOVABLE in case we have
CONFIG_BALLOON_COMPACTION. This is now possible, as balloon pages are
movable with CONFIG_BALLOON_COMPACTION. Without
CONFIG_BALLOON_COMPACTION, GFP_HIGHUSER is used.
Note that apart from that, balloon_page_alloc() uses the following
flags:
__GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN
And current code used:
GFP_NOIO | __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC
GFP_HIGHUSER/GFP_HIGHUSER_MOVABLE include
__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL | __GFP_HIGHMEM
GFP_NOIO is __GFP_RECLAIM.
With CONFIG_BALLOON_COMPACTION, we essentially add:
__GFP_IO | __GFP_FS | __GFP_HARDWALL | __GFP_HIGHMEM | __GFP_MOVABLE
Without CONFIG_BALLOON_COMPACTION, we essentially add:
__GFP_IO | __GFP_FS | __GFP_HARDWALL | __GFP_HIGHMEM
I assume this is fine, as this is what all other balloon compaction
users use. If it turns out to be a problem, we could add __GFP_MOVABLE
manually if we have CONFIG_BALLOON_COMPACTION.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191031142933.10779-10-david@redhat.com
We can now get rid of the cmm_lock and completely rely on the balloon
compaction internals, which now also manage the page list and the
lock.
Inflated/"loaned" pages are now movable. Memory blocks that contain
such pages can get offlined. Also, all such pages will be marked
PageOffline() and can therefore be excluded in memory dumps using
recent versions of makedumpfile.
Don't switch to balloon_page_alloc() yet (due to the GFP_NOIO). Will
do that separately to discuss this change in detail.
Signed-off-by: David Hildenbrand <david@redhat.com>
[mpe: Add isolated_pages-- in cmm_migratepage() as suggested by David]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191031142933.10779-9-david@redhat.com
When switching to balloon compaction, we want to drop the cmm_lock and
completely rely on the balloon compaction list lock internally.
loaned_pages is currently protected under the cmm_lock.
Note: Right now cmm_alloc_pages() and cmm_free_pages() can be called
at the same time, e.g., via the thread and a concurrent OOM notifier.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191031142933.10779-8-david@redhat.com
The memory isolate notifier was added to allow to offline memory
blocks that contain inflated/"loaned" pages. We can achieve the same
using the balloon compaction framework.
Get rid of the memory isolate notifier. Also, we can get rid of
cmm_mem_going_offline(), as we will never reach that code path now
when we have allocated memory in the balloon (allocated pages are
unmovable and will no longer be special-cased using the memory
isolation notifier).
Leave the memory notifier in place, so we can still back off in case
memory gets offlined.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191031142933.10779-7-david@redhat.com
adjust_managed_page_count() performs a totalram_pages_add(), but also
adjusts the managed pages of the zone. Let's use that instead, similar
to virtio-balloon. Use it before freeing a page.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191031142933.10779-6-david@redhat.com
We can simply store the pages in a list (page->lru), no need for a
separate data structure (+ complicated handling). This is how most
other balloon drivers store allocated pages without additional
tracking data.
For the notifiers, use page_to_pfn() to check if a page is in the
applicable range. Use page_to_phys() in plpar_page_set_loaned() and
plpar_page_set_active() (I assume due to the __pa() that's the right
thing to do).
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191031142933.10779-5-david@redhat.com
When unloading the module, one gets
------------[ cut here ]------------
Device 'cmm0' does not have a release() function, it is broken and must be fixed. See Documentation/kobject.txt.
WARNING: CPU: 0 PID: 19308 at drivers/base/core.c:1244 .device_release+0xcc/0xf0
...
We only have one static fake device. There is nothing to do when
releasing the device (via cmm_exit()).
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20191031142933.10779-2-david@redhat.com