The maximum nesting of the cpu topology is evaluated when /proc/sysinfo
is the first time read. This happens without a lock and a concurrent
reader on a different cpu can see and use an invalid intermediate value.
Besides the fact that this race is quite unlikely the worst thing that
could happen is that /proc/sysinfo would contain bogus information about
the machine's cpu topology.
Nevertheless this should be fixed. So move the detection code to the
early machine detection code and since now the value is early available
use it in the topology code as well.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Add a couple of missing fields that were introduced with z196.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The current proc implementation of the /proc/sysinfo file writes all
informations contained in all system information blocks to a single
page.
This is done by calling sprintf all the time in the expectation that
everything will fit into a single page. This however is not necessarily
true if the configuration of a machine is very large.
So convert /proc/sysinfo to avoid writing into random memory regions.
For readability reasons a couple of lines are longer than 80 characters.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Any change to sysinfo.h causes a whole kernel recompile since sysinfo.h is
included by topology.h, which again is used nearly everywhere.
So remove that include and add a forward declaration instead.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Shorten the code for addressing mode initialization. Also add missing
__init annotations, since this code is only used during kernel initialization.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Renaming the globally visible variable "user_mode" to "addressing_mode" in
order to fix a name clash was not a good idea. (Commit 37fe1d73 "s390/mm:
rename user_mode variable to addressing_mode")
Looking at the code after a couple of weeks one thinks: addressing mode of
what?
So rename the variable again. This time to s390_user_mode. Which hopefully
makes more sense.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Change the default addressing mode so that user space runs in primary space
and the kernel runs in home space.
In addition remove the "switch_amode" kernel parameter so all users who
already specified they want the new default behaviour will stay in the
"switched" mode instead of in the opposite they intended.
If there is a need to switch addressing modes, this can be done with the
"user_mode" kernel parameter: user_mode=home
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
When a sigp instruction is issued it may store a status. This status is
currently stored in a per cpu field of the target cpu.
If multiple cpus issue a sigp instruction with the same target cpu
and a status is stored the result is not necessarily as expected.
Currently this is not an issue:
- on cpu hotplug no sigps, except "restart" and "sense" are sent to the
target cpu.
- on external call we don't look at the status if it is stored
- on sense running the condition code "status stored" is sufficient to
tell if a cpu is running or not
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Add a line for each cpu cache to /proc/cpuinfo.
Since we only have information of private cpu caches in sysfs we
add a line for each cpu cache in /proc/cpuinfo which will also
contain information about shared caches.
For a z196 machine /proc/cpuinfo now looks like:
vendor_id : IBM/S390
bogomips per cpu: 14367.00
features : esan3 zarch stfle msa ldisp eimm dfp etf3eh highgprs
cache0 : level=1 type=Data scope=Private size=64K line_size=256 associativity=4
cache1 : level=1 type=Instruction scope=Private size=128K line_size=256 associativity=8
cache2 : level=2 type=Unified scope=Private size=1536K line_size=256 associativity=12
cache3 : level=3 type=Unified scope=Shared size=24576K line_size=256 associativity=12
cache4 : level=4 type=Unified scope=Shared size=196608K line_size=256 associativity=24
processor 0: version = FF, identification = 000123, machine = 2817
processor 1: version = FF, identification = 100123, machine = 2817
processor 2: version = FF, identification = 200123, machine = 2817
processor 3: version = FF, identification = 200123, machine = 2817
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Allow user-space processes to use transactional execution (TX).
If the TX facility is available user space programs can use
transactions for fine-grained serialization based on the data
objects that are referenced during a transaction. This is
useful for lockless data structures and speculative compiler
optimizations.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Allow user-space threads to use runtime instrumentation (RI). To enable RI
for a thread there is a new s390 specific system call, sys_s390_runtime_instr,
that takes as parameter a realtime signal number. If the RI facility is
available the system call sets up a control block for the calling thread with
the appropriate permissions for the thread to modify the control block.
The user-space thread can then use the store and modify RI instructions to
alter the control block and start/stop the instrumentation via RION/RIOFF.
If the user specified program buffer runs full RI triggers an external
interrupt. The external interrupt is translated to a real-time signal that
is delivered to the thread that enabled RI on that CPU. The number of
the real-time signal is the number specified in the RI system call. So,
user-space can select any available real-time signal number in case the
application itself uses real-time signals for other purposes.
The kernel saves the RI control blocks on task switch only if the running
thread was enabled for RI. Therefore, the performance impact on task switch
should be negligible if RI is not used.
RI is only enabled for user-space mode and is disabled for the supervisor
state.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Jan Glauber <jang@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Add support for EADM interrupt statistics in /proc/interrupts.
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
machine_crash_shutdown() was the only function in crash.c.
So move the function and delete one file.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Unify all our cpu hotplug notifiers to mask out the CPU_TASKS_FROZEN
bit, so we don't have to add all the *_FROZEN variant cases to the
notifiers.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Expose cpu cache topology via sysfs.
The created sysfs directory structure is compatible to what x86, ia64
and powerpc have.
On s390 we expose only information about cpu caches which are private
to a cpu via sysfs . Caches which are shared between cpus do not have
a sysfs representation.
The reason for that is that the file "shared_cpu_map" is mandatory
and only if running under LPAR it is possible to tell which cpus
share which cache. Second level hypervisors however do not and cannot
expose that information to guests.
In order to have a consistent view we made the choice to always only
expose information about private cpu caches via sysfs.
Example for a z196 cpu (cpu1 in /sys/devices/cpu):
cpu1/cache/index0/size -- 64K
cpu1/cache/index0/type -- Data
cpu1/cache/index0/level -- 1
cpu1/cache/index0/number_of_sets -- 64
cpu1/cache/index0/shared_cpu_map -- 00000000,00000002
cpu1/cache/index0/shared_cpu_list -- 1
cpu1/cache/index0/coherency_line_size -- 256
cpu1/cache/index0/ways_of_associativity -- 4
cpu1/cache/index1/size -- 128K
cpu1/cache/index1/type -- Instruction
cpu1/cache/index1/level -- 1
cpu1/cache/index1/number_of_sets -- 64
cpu1/cache/index1/shared_cpu_map -- 00000000,00000002
cpu1/cache/index1/shared_cpu_list -- 1
cpu1/cache/index1/coherency_line_size -- 256
cpu1/cache/index1/ways_of_associativity -- 8
cpu1/cache/index2/size -- 1536K
cpu1/cache/index2/type -- Unified
cpu1/cache/index2/level -- 2
cpu1/cache/index2/number_of_sets -- 512
cpu1/cache/index2/shared_cpu_map -- 00000000,00000002
cpu1/cache/index2/shared_cpu_list -- 1
cpu1/cache/index2/coherency_line_size -- 256
cpu1/cache/index2/ways_of_associativity -- 12
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
So far, large page support was completely disabled with
CONFIG_DEBUG_PAGEALLOC, although it would be sufficient if only the
large page kernel mapping was disabled. This patch enables large page
support with CONFIG_DEBUG_PAGEALLOC, while it prevents the large kernel
mapping in that case.
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Our memcpy and memcmp variants were implemented by calling the corresponding
gcc builtin variants.
However gcc is free to replace a call to __builtin_memcmp with a call to memcmp
which, when called, will result in an endless recursion within memcmp.
So let's provide asm variants and also fix the variants that are used for
uncompressing the kernel image.
In addition remove all other occurences of builtin function calls.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The s390 implementation of the JIT compiler for packet filter speedup.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Use a naming based on vtime as a prefix for virtual based
cputime accounting APIs:
- account_system_vtime() -> vtime_account()
- account_switch_vtime() -> vtime_task_switch()
It makes it easier to allow for further declension such
as vtime_account_system(), vtime_account_idle(), ... if we
want to find out the context we account to from generic code.
This also make it better to know on which subsystem these APIs
refer to.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
To help migrate archtectures over to the new update_vsyscall method,
redfine CONFIG_GENERIC_TIME_VSYSCALL as CONFIG_GENERIC_TIME_VSYSCALL_OLD
Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Since users will need to include timekeeper_internal.h, move
update_vsyscall definitions to timekeeper_internal.h.
Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Convert getresuid, getresgid, getuid, geteuid, getgid, getegid
Convert struct cred kuids and kgids into userspace uids and gids when
returning them.
These s390 system calls slipped through the cracks in my first
round of converstions :(
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
The bit for high gprs in the AT_HWCAP auxiliary vector field and the
highgprs tag in the output of /proc/cpuinfo should not be set for
31 bit kernels.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Merging critical fixes from upstream required for development.
* upstream/master: (809 commits)
libata: Add a space to " 2GB ATA Flash Disk" DMA blacklist entry
Revert "powerpc: Update g5_defconfig"
powerpc/perf: Use pmc_overflow() to detect rolled back events
powerpc: Fix VMX in interrupt check in POWER7 copy loops
powerpc: POWER7 copy_to_user/copy_from_user patch applied twice
powerpc: Fix personality handling in ppc64_personality()
powerpc/dma-iommu: Fix IOMMU window check
powerpc: Remove unnecessary ifdefs
powerpc/kgdb: Restore current_thread_info properly
powerpc/kgdb: Bail out of KGDB when we've been triggered
powerpc/kgdb: Do not set kgdb_single_step on ppc
powerpc/mpic_msgr: Add missing includes
powerpc: Fix null pointer deref in perf hardware breakpoints
powerpc: Fixup whitespace in xmon
powerpc: Fix xmon dl command for new printk implementation
xfs: check for possible overflow in xfs_ioc_trim
xfs: unlock the AGI buffer when looping in xfs_dialloc
xfs: fix uninitialised variable in xfs_rtbuf_get()
powerpc/fsl: fix "Failed to mount /dev: No such device" errors
powerpc/fsl: update defconfigs
...
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The archs that implement virtual cputime accounting all
flush the cputime of a task when it gets descheduled
and sometimes set up some ground initialization for the
next task to account its cputime.
These archs all put their own hooks in their context
switch callbacks and handle the off-case themselves.
Consolidate this by creating a new account_switch_vtime()
callback called in generic code right after a context switch
and that these archs must implement to flush the prev task
cputime and initialize the next task cputime related state.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
The native 31 bit and the compat behaviour for the mmap system calls differ:
In native 31 bit mode the passed in address for the mmap system call will be
unmodified passed to sys_mmap_pgoff().
In compat mode however the passed in address will be modified with
compat_ptr() which masks out the most significant bit.
The result is that in native 31 bit mode each mmap request (with MAP_FIXED)
will fail where the most significat bit is set, while in compat mode it
may succeed.
This odd behaviour was introduced with d3815898 "[S390] mmap: add missing
compat_ptr conversion to both mmap compat syscalls".
To restore a consistent behaviour accross native and compat mode this
patch functionally reverts the above mentioned commit.
Cc: stable@vger.kernel.org
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The compat wrappers incorrectly called the non compat versions of
the system process_vm system calls.
Cc: stable@vger.kernel.org
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
There are multiple errors in how sys_32_personality() handles personality
flags stored in top three bytes.
- directly comparing current->personality against PER_LINUX32 doesn't work
in cases when any of the personality flags stored in the top three bytes
are used.
- directly forcefully setting personality to PER_LINUX32 or PER_LINUX
discards any flags stored in the top three bytes
Fix the first one by properly using personality() macro to compare only
PER_MASK bytes.
Fix the second one by setting only the bits that should be set, instead of
overwriting the whole value.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
- bring back critical fixes (esp. aa67f6096c)
- provide an updated base for development
* upstream: (4334 commits)
missed mnt_drop_write() in do_dentry_open()
UBIFS: nuke pdflush from comments
gfs2: nuke pdflush from comments
drbd: nuke pdflush from comments
nilfs2: nuke write_super from comments
hfs: nuke write_super from comments
vfs: nuke pdflush from comments
jbd/jbd2: nuke write_super from comments
btrfs: nuke pdflush from comments
btrfs: nuke write_super from comments
ext4: nuke pdflush from comments
ext4: nuke write_super from comments
ext3: nuke write_super from comments
Documentation: fix the VM knobs descritpion WRT pdflush
Documentation: get rid of write_super
vfs: kill write_super and sync_supers
ACPI processor: Fix tick_broadcast_mask online/offline regression
ACPI: Only count valid srat memory structures
ACPI: Untangle a return statement for better readability
Linux 3.6-rc1
...
Signed-off-by: Avi Kivity <avi@redhat.com>
We use the user_mode() helper already at several places but also
have the open coded variant at other places.
Convert the code to always use the helper function.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Fix name clash with user_mode() define which is also used in common code.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Provide a new function, insn_to_mnemonic, by which e.g. kvm can obtain
a human-readable decoding of an instruction's opcode.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Currently the vmcmd shutdown action is parsed by the kernel and
if multiple cp commands have been specified, they are issued
separately with the cpcmd() function.
The underlying diagnose 8 instruction already allows to specify
multiple commands that are separated by 0x15. The ASCEBC() function
used by cpcmd() translates '\n' to 0x15. The '\n' character is
currently used as vmcmd command separator and therefore the vmcmd
string can be passed directly to the cpcmd() function.
Using the diagnose 8 command separation has the advantage that also
after disruptive commands that stop Linux, for example "def store",
additional commands can be executed.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Use RO_DATA_SECTION instead of RODATA like several other archs do already.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Follow x86 and MIPS and sort the main exception table at build time.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
debug.c is not a module, so remove the module_exit function, since it
will never be called.
Also move the EXPORT_SYMBOL statements to the functions they belong to.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJQDRDNAAoJEI7yEDeUysxlkl8P/3C2AHx2webOU8sVzhfU6ONZ
ZoGevwBjyZIeJEmiWVpFTTEew1l0PXtpyOocXGNUXIddVnhXTQOKr/Scj4uFbmx8
ROqgK8NSX9+xOGrBPCoN7SlJkmp+m6uYtwYkl2SGnsEVLWMKkc7J7oqmszCcTQvN
UXMf7G47/Ul2NUSBdv4Yvizhl4kpvWxluiweDw3E/hIQKN0uyP7CY58qcAztw8nG
csZBAnnuPFwIAWxHXW3eBBv4UP138HbNDqJ/dujjocM6GnOxmXJmcZ6b57gh+Y64
3+w9IR4qrRWnsErb/I8inKLJ1Jdcf7yV2FmxYqR4pIXay2Yzo1BsvFd6EB+JavUv
pJpixrFiDDFoQyXlh4tGpsjpqdXNMLqyG4YpqzSZ46C8naVv9gKE7SXqlXnjyDlb
Llx3hb9Fop8O5ykYEGHi+gIISAK5eETiQl4yw9RUBDpxydH4qJtqGIbLiDy8y9wi
Xyi8PBlNl+biJFsK805lxURqTp/SJTC3+Zb7A7CzYEQm5xZw3W/CKZx1ZYBfpaa/
pWaP6tB7JwgLIVXi4HQayLWqMVwH0soZIn9yazpOEFv6qO8d5QH5RAxAW2VXE3n5
JDlrajar/lGIdiBVWfwTJLb86gv3QDZtIWoR9mZuLKeKWE/6PRLe7HQpG1pJovsm
2AsN5bS0BWq+aqPpZHa5
=pECD
-----END PGP SIGNATURE-----
Merge tag 'kvm-3.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Avi Kivity:
"Highlights include
- full big real mode emulation on pre-Westmere Intel hosts (can be
disabled with emulate_invalid_guest_state=0)
- relatively small ppc and s390 updates
- PCID/INVPCID support in guests
- EOI avoidance; 3.6 guests should perform better on 3.6 hosts on
interrupt intensive workloads)
- Lockless write faults during live migration
- EPT accessed/dirty bits support for new Intel processors"
Fix up conflicts in:
- Documentation/virtual/kvm/api.txt:
Stupid subchapter numbering, added next to each other.
- arch/powerpc/kvm/booke_interrupts.S:
PPC asm changes clashing with the KVM fixes
- arch/s390/include/asm/sigp.h, arch/s390/kvm/sigp.c:
Duplicated commits through the kvm tree and the s390 tree, with
subsequent edits in the KVM tree.
* tag 'kvm-3.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (93 commits)
KVM: fix race with level interrupts
x86, hyper: fix build with !CONFIG_KVM_GUEST
Revert "apic: fix kvm build on UP without IOAPIC"
KVM guest: switch to apic_set_eoi_write, apic_write
apic: add apic_set_eoi_write for PV use
KVM: VMX: Implement PCID/INVPCID for guests with EPT
KVM: Add x86_hyper_kvm to complete detect_hypervisor_platform check
KVM: PPC: Critical interrupt emulation support
KVM: PPC: e500mc: Fix tlbilx emulation for 64-bit guests
KVM: PPC64: booke: Set interrupt computation mode for 64-bit host
KVM: PPC: bookehv: Add ESR flag to Data Storage Interrupt
KVM: PPC: bookehv64: Add support for std/ld emulation.
booke: Added crit/mc exception handler for e500v2
booke/bookehv: Add host crit-watchdog exception support
KVM: MMU: document mmu-lock and fast page fault
KVM: MMU: fix kvm_mmu_pagetable_walk tracepoint
KVM: MMU: trace fast page fault
KVM: MMU: fast path of handling guest page fault
KVM: MMU: introduce SPTE_MMU_WRITEABLE bit
KVM: MMU: fold tlb flush judgement into mmu_spte_update
...
Pull s390 changes from Martin Schwidefsky:
"No new functions, a few changes to make the code more robust, some
cleanups and bug fixes."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (21 commits)
s390/vtimer: rework virtual timer interface
s390/dis: Add the servc instruction to the disassembler.
s390/comments: unify copyright messages and remove file names
s390/lgr: Add init check to lgr_info_log()
s390/cpu init: use __get_cpu_var instead of per_cpu
s390/idle: reduce size of s390_idle_data structure
s390/idle: fix sequence handling vs cpu hotplug
s390/ap: resend enable adapter interrupt request.
s390/hypfs: Add missing get_next_ino()
s390/dasd: add shutdown action
s390/ipl: Fix ipib handling for "dumpreipl" shutdown action
s390/smp: make absolute lowcore / cpu restart parameter accesses more robust
s390/vmlogrdr: cleanup driver attribute usage
s390/vmlogrdr: cleanup device attribute usage
s390/ccwgroup: remove unused ccwgroup_device member
s390/cio/chp: cleanup attribute usage
s390/sigp: use sigp order code defines in assembly code
s390/smp: use sigp cpu status definitions
s390/smp/kvm: unifiy sigp definitions
s390/smp: remove redundant check
...
The current virtual timer interface is inherently per-cpu and hard to
use. The sole user of the interface is appldata which uses it to execute
a function after a specific amount of cputime has been used over all cpus.
Rework the virtual timer interface to hook into the cputime accounting.
This makes the interface independent from the CPU timer interrupts, and
makes the virtual timers global as opposed to per-cpu.
Overall the code is greatly simplified. The downside is that the accuracy
is not as good as the original implementation, but it is still good enough
for appldata.
Reviewed-by: Jan Glauber <jang@linux.vnet.ibm.com>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Remove the file name from the comment at top of many files. In most
cases the file name was wrong anyway, so it's rather pointless.
Also unify the IBM copyright statement. We did have a lot of sightly
different statements and wanted to change them one after another
whenever a file gets touched. However that never happened. Instead
people start to take the old/"wrong" statements to use as a template
for new files.
So unify all of them in one go.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
If lgr has not been initialized, the lgr_info_log() function currently
crashes because 'lgr_page' is not allocated. To fix this 'lgr_page'
is allocated statically now.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Just saves a couple of instructions.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The s390 idle accounting code uses a sequence counter which gets used
when the per cpu idle statistics get updated and read.
One assumption on read access is that only when the sequence counter is
even and did not change while reading all values the result is valid.
On cpu hotplug however the per cpu data structure gets initialized via
a cpu hotplug notifier on CPU_ONLINE.
CPU_ONLINE however is too late, since the onlined cpu is already running
and might access the per cpu data. Worst case is that the data structure
gets initialized while an idle thread is updating its idle statistics.
This will result in an uneven sequence counter after an update.
As a result user space tools like top, which access /proc/stat in order
to get idle stats, will busy loop waiting for the sequence counter to
become even again, which will never happen until the queried cpu will
update its idle statistics again. And even then the sequence counter
will only have an even value for a couple of cpu cycles.
Fix this by moving the initialization of the per cpu idle statistics
to cpu_init(). I prefer that solution in favor of changing the
notifier to CPU_UP_PREPARE, which would be a different solution to
the problem.
Cc: stable@vger.kernel.org
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The smp and the kvm code have different defines for the sigp order codes.
Let's just have a single place where these are defined.
Also move the sigp condition code and sigp cpu status bits to the new
sigp.h header file.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
condition code "status stored" for sigp sense running always implies
that only the "not running" status bit is set. Therefore no need to
check if it is set.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Fix problem that was introduced with patch "s390/smp: make absolute
lowcore / cpu restart parameter". After that patch the "dumpreipl"
shutdown action does not work any more. To fix the problem we have
to assign "reipl_block_actual" instead of "&reipl_block_actual"
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Setting the cpu restart parameters is done in three different fashions:
- directly setting the four parameters individually
- copying the four parameters with memcpy (using 4 * sizeof(long))
- copying the four parameters using a private structure
In addition code in entry*.S relies on a certain order of the restart
members of struct _lowcore.
Make all of this more robust to future changes by adding a
mem_absolute_assign(dest, val) define, which assigns val to dest
using absolute addressing mode. Also the load multiple instructions
in entry*.S have been split into separate load instruction so the
order of the struct _lowcore members doesn't matter anymore.
In addition move the prototypes of memcpy_real/absolute from uaccess.h
to processor.h. These memcpy* variants are not related to uaccess at all.
string.h doesn't seem to match as well, so lets use processor.h.
Also replace the eight byte array in struct _lowcore which represents a
misaliged u64 with a u64. The compiler will always create code that
handles the misaligned u64 correctly.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
For processing under KVM it is required to detect
the actual SCLP console type in order to set it as
preferred console.
Signed-off-by: Heinz Graalfs <graalfs@linux.vnet.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Peter Oberparleiter <peter.oberparleiter@de.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Use sigp order code defines in assembly code as well.
With this change all places that use sigp constants should
have been converted to use self describing defines instead
of directly using constants.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
We got them from the kvm code, so let's use them.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The smp and the kvm code have different defines for the sigp order codes.
Let's just have a single place where these are defined.
Also move the sigp condition code and sigp cpu status bits to the new
sigp.h header file.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
condition code "status stored" for sigp sense running always implies
that only the "not running" status bit is set. Therefore no need to
check if it is set.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
After
commit 5e8010cb50
s390: replace TIF_SIE with PF_VCPU
there is no need to load the thread info before sie_loop where
it is also loaded.
Get rid of this duplicate instruction.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Does block_sigmask() + tracehook_signal_handler(); called when
sigframe has been successfully built. All architectures converted
to it; block_sigmask() itself is gone now (merged into this one).
I'm still not too happy with the signature, but that's a separate
story (IMO we need a structure that would contain signal number +
siginfo + k_sigaction, so that get_signal_to_deliver() would fill one,
signal_delivered(), handle_signal() and probably setup...frame() -
take one).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Only 3 out of 63 do not. Renamed the current variant to __set_current_blocked(),
added set_current_blocked() that will exclude unblockable signals, switched
open-coded instances to it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
replace boilerplate "should we use ->saved_sigmask or ->blocked?"
with calls of obvious inlined helper...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
first fruits of ..._restore_sigmask() helpers: now we can take
boilerplate "signal didn't have a handler, clear RESTORE_SIGMASK
and restore the blocked mask from ->saved_mask" into a common
helper. Open-coded instances switched...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull second pile of signal handling patches from Al Viro:
"This one is just task_work_add() series + remaining prereqs for it.
There probably will be another pull request from that tree this
cycle - at least for helpers, to get them out of the way for per-arch
fixes remaining in the tree."
Fix trivial conflict in kernel/irq/manage.c: the merge of Andrew's pile
had brought in commit 97fd75b7b8 ("kernel/irq/manage.c: use the
pr_foo() infrastructure to prefix printks") which changed one of the
pr_err() calls that this merge moves around.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
keys: kill task_struct->replacement_session_keyring
keys: kill the dummy key_replace_session_keyring()
keys: change keyctl_session_to_parent() to use task_work_add()
genirq: reimplement exit_irq_thread() hook via task_work_add()
task_work_add: generic process-context callbacks
avr32: missed _TIF_NOTIFY_RESUME on one of do_notify_resume callers
parisc: need to check NOTIFY_RESUME when exiting from syscall
move key_repace_session_keyring() into tracehook_notify_resume()
TIF_NOTIFY_RESUME is defined on all targets now
Pull s390 patches from Heiko Carstens:
"A couple of s390 patches for the 3.5 merge window. Just a collection
of bug fixes and cleanups."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/uaccess: fix access_ok compile warnings
s390/cmpxchg: select HAVE_CMPXCHG_LOCAL option
s390/cmpxchg: fix sign extension bugs
s390/cmpxchg: fix 1 and 2 byte memory accesses
s390/cmpxchg: fix compile warnings specific to s390
s390/cmpxchg: add missing memory barrier to cmpxchg64
s390/cpu: remove cpu "capabilities" sysfs attribute
s390/kernel: Fix smp_call_ipl_cpu() for offline CPUs
s390/kernel: Introduce memcpy_absolute() function
s390/headers: replace __s390x__ with CONFIG_64BIT where possible
s390/headers: remove #ifdef __KERNEL__ from not exported headers
s390/irq: split irq stats for cpu-measurement alert facilities
s390/kexec: Move early_pgm_check_handler() to text section
s390/kdump: Use real mode for PSW restart and kexec
s390/kdump: Account /sys/kernel/kexec_crash_size changes in OS info
s390/kernel: Remove OS info init function call and diag 308 for kdump
It has been a big mistage to add the capabilities attribute to the
cpus in sysfs:
First the attribute only contains the cpu capability of primary cpus,
which however is not necessarily (or better: unlikely) the type of
cpu the kernel runs on, which is typically an IFL.
In addition all information that is necessary is available in
/proc/sysinfo already. So this attribute partially duplicated
informations.
So programs should look into the sysinfo file to retrieve all
informations they are interested in.
Since with this kernel release also the powersavings cpu attributes
are removed this seems to be a good opportunity to remove another
broken interface.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
If the IPL CPU is offline, currently the pcpu_delegate() function
used by smp_call_ipl_cpu() does not work because pcpu_delegate()
modifies the lowcore of the target CPU. In case of an offline
IPL CPU currently the prefix register is zero but pcpu->lowcore
still points to the old prefix page. Therefore the lowcore changes
done by pcpu_delegate() have no effect.
With this fix pcpu_delegate() now uses memcpy_absolute() and therefore
also prepares the absolute zero lowcore if the target CPU has prefix
register zero.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
This patch introduces the new function memcpy_absolute() that allows to
copy memory using absolute addressing. This means that the prefix swap
does not apply when this function is used.
With this patch also all s390 kernel code that accesses absolute zero
now uses the new memcpy_absolute() function. The old and less generic
copy_to_absolute_zero() function is removed.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Pull first series of signal handling cleanups from Al Viro:
"This is just the first part of the queue (about a half of it);
assorted fixes all over the place in signal handling.
This one ends with all sigsuspend() implementations switched to
generic one (->saved_sigmask-based).
With this, a bunch of assorted old buglets are fixed and most of the
missing bits of NOTIFY_RESUME hookup are in place. Two more fixes sit
in arm and um trees respectively, and there's a couple of broken ones
that need obvious fixes - parisc and avr32 check TIF_NOTIFY_RESUME
only on one of two codepaths; fixes for that will happen in the next
series"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (55 commits)
unicore32: if there's no handler we need to restore sigmask, syscall or no syscall
xtensa: add handling of TIF_NOTIFY_RESUME
microblaze: drop 'oldset' argument of do_notify_resume()
microblaze: handle TIF_NOTIFY_RESUME
score: add handling of NOTIFY_RESUME to do_notify_resume()
m68k: add TIF_NOTIFY_RESUME and handle it.
sparc: kill ancient comment in sparc_sigaction()
h8300: missing checks of __get_user()/__put_user() return values
frv: missing checks of __get_user()/__put_user() return values
cris: missing checks of __get_user()/__put_user() return values
powerpc: missing checks of __get_user()/__put_user() return values
sh: missing checks of __get_user()/__put_user() return values
sparc: missing checks of __get_user()/__put_user() return values
avr32: struct old_sigaction is never used
m32r: struct old_sigaction is never used
xtensa: xtensa_sigaction doesn't exist
alpha: tidy signal delivery up
score: don't open-code force_sigsegv()
cris: don't open-code force_sigsegv()
blackfin: don't open-code force_sigsegv()
...
Pull user namespace enhancements from Eric Biederman:
"This is a course correction for the user namespace, so that we can
reach an inexpensive, maintainable, and reasonably complete
implementation.
Highlights:
- Config guards make it impossible to enable the user namespace and
code that has not been converted to be user namespace safe.
- Use of the new kuid_t type ensures the if you somehow get past the
config guards the kernel will encounter type errors if you enable
user namespaces and attempt to compile in code whose permission
checks have not been updated to be user namespace safe.
- All uids from child user namespaces are mapped into the initial
user namespace before they are processed. Removing the need to add
an additional check to see if the user namespace of the compared
uids remains the same.
- With the user namespaces compiled out the performance is as good or
better than it is today.
- For most operations absolutely nothing changes performance or
operationally with the user namespace enabled.
- The worst case performance I could come up with was timing 1
billion cache cold stat operations with the user namespace code
enabled. This went from 156s to 164s on my laptop (or 156ns to
164ns per stat operation).
- (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
Most uid/gid setting system calls treat these value specially
anyway so attempting to use -1 as a uid would likely cause
entertaining failures in userspace.
- If setuid is called with a uid that can not be mapped setuid fails.
I have looked at sendmail, login, ssh and every other program I
could think of that would call setuid and they all check for and
handle the case where setuid fails.
- If stat or a similar system call is called from a context in which
we can not map a uid we lie and return overflowuid. The LFS
experience suggests not lying and returning an error code might be
better, but the historical precedent with uids is different and I
can not think of anything that would break by lying about a uid we
can't map.
- Capabilities are localized to the current user namespace making it
safe to give the initial user in a user namespace all capabilities.
My git tree covers all of the modifications needed to convert the core
kernel and enough changes to make a system bootable to runlevel 1."
Fix up trivial conflicts due to nearby independent changes in fs/stat.c
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
userns: Silence silly gcc warning.
cred: use correct cred accessor with regards to rcu read lock
userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
userns: Convert cgroup permission checks to use uid_eq
userns: Convert tmpfs to use kuid and kgid where appropriate
userns: Convert sysfs to use kgid/kuid where appropriate
userns: Convert sysctl permission checks to use kuid and kgids.
userns: Convert proc to use kuid/kgid where appropriate
userns: Convert ext4 to user kuid/kgid where appropriate
userns: Convert ext3 to use kuid/kgid where appropriate
userns: Convert ext2 to use kuid/kgid where appropriate.
userns: Convert devpts to use kuid/kgid where appropriate
userns: Convert binary formats to use kuid/kgid where appropriate
userns: Add negative depends on entries to avoid building code that is userns unsafe
userns: signal remove unnecessary map_cred_ns
userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
userns: Convert stat to return values mapped from kuids and kgids
userns: Convert user specfied uids and gids in chown into kuids and kgid
userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
...
CPU-measurement alerts are generated for different CPU-measurement
facilities, for example, the sampling and counter facilities.
Split the irq stats according to available facilities.
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The early_pgm_check_handler() function is also used after the
init phase in s390_reset_system(). Therefore it must not be in
the init section.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Currently the PSW restart handler and kexec are executed in real
mode with DAT=off. For kexec/kdump the function setup_regs() is
called that uses the per-cpu variable "crash_notes". Because
there are situations when the per-cpu implementation uses vmalloc
memory, calling setup_regs() in real mode can cause a program
check interrupt.
To fix that problem this patch changes the following:
* Ensure that diag308_reset() does not change PSW bits to real mode
* Enable DAT in __do_restart() after we switched to an online CPU
* Enable DAT in __machine_kexec() after we switched to the IPL CPU
* Call setup_regs() before we switch to real mode and call purgatory
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The crashkernel size for kdump can be reduced at runtime with the
sysfs file "/sys/kernel/kexec_crash_size". Currently those changes
do not update the OS info crashkernel information that is used
for stand-alone kdump. With this fix now also the OS info crashkernel
information is updated correctly.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Because of a design change for stand-alone kdump the function that
was done by the OS info init function is moved to the boot loader
code. This has two implications that are implemented by this patch:
a) The OS info init function is no longer called by the kernel
b) The diag 308 subcode 1 reset is no longer done by the kdump boot code.
This is necessary because otherwise the operation that is done now
by the boot loader would be reversed. For the normal kexec based
kdump mechansim the reset is already done by the kdump trigger code
(e.g. panic or PSW restart).
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
guts of saved_sigmask-based sigsuspend/rt_sigsuspend. Takes
kernel sigset_t *.
Open-coded instances replaced with calling it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull security subsystem updates from James Morris:
"New notable features:
- The seccomp work from Will Drewry
- PR_{GET,SET}_NO_NEW_PRIVS from Andy Lutomirski
- Longer security labels for Smack from Casey Schaufler
- Additional ptrace restriction modes for Yama by Kees Cook"
Fix up trivial context conflicts in arch/x86/Kconfig and include/linux/filter.h
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (65 commits)
apparmor: fix long path failure due to disconnected path
apparmor: fix profile lookup for unconfined
ima: fix filename hint to reflect script interpreter name
KEYS: Don't check for NULL key pointer in key_validate()
Smack: allow for significantly longer Smack labels v4
gfp flags for security_inode_alloc()?
Smack: recursive tramsmute
Yama: replace capable() with ns_capable()
TOMOYO: Accept manager programs which do not start with / .
KEYS: Add invalidation support
KEYS: Do LRU discard in full keyrings
KEYS: Permit in-place link replacement in keyring list
KEYS: Perform RCU synchronisation on keys prior to key destruction
KEYS: Announce key type (un)registration
KEYS: Reorganise keys Makefile
KEYS: Move the key config into security/keys/Kconfig
KEYS: Use the compat keyctl() syscall wrapper on Sparc64 for Sparc32 compat
Yama: remove an unused variable
samples/seccomp: fix dependencies on arch macros
Yama: add additional ptrace scopes
...
Pull smp hotplug cleanups from Thomas Gleixner:
"This series is merily a cleanup of code copied around in arch/* and
not changing any of the real cpu hotplug horrors yet. I wish I'd had
something more substantial for 3.5, but I underestimated the lurking
horror..."
Fix up trivial conflicts in arch/{arm,sparc,x86}/Kconfig and
arch/sparc/include/asm/thread_info_32.h
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits)
um: Remove leftover declaration of alloc_task_struct_node()
task_allocator: Use config switches instead of magic defines
sparc: Use common threadinfo allocator
score: Use common threadinfo allocator
sh-use-common-threadinfo-allocator
mn10300: Use common threadinfo allocator
powerpc: Use common threadinfo allocator
mips: Use common threadinfo allocator
hexagon: Use common threadinfo allocator
m32r: Use common threadinfo allocator
frv: Use common threadinfo allocator
cris: Use common threadinfo allocator
x86: Use common threadinfo allocator
c6x: Use common threadinfo allocator
fork: Provide kmemcache based thread_info allocator
tile: Use common threadinfo allocator
fork: Provide weak arch_release_[task_struct|thread_info] functions
fork: Move thread info gfp flags to header
fork: Remove the weak insanity
sh: Remove cpu_idle_wait()
...
There is a small race window in the __switch_to code in regard to
the transfer of the TIF_MCCK_PENDING bit from the previous to the
next task. The bit is transferred before the task struct pointer
and the thread-info pointer for the next task has been stored to
lowcore. If a machine check sets the TIF_MCCK_PENDING bit between
the transfer code and the store of current/thread_info the bit
is still set for the previous task. And if the previous task has
terminated it can get lost. The effect is that a pending CRW is
not retrieved until the next machine checks sets TIF_MCCK_PENDING.
To fix this reorder __switch_to to first store the task struct
and thread-info pointer and then do the transfer of the bit.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Use HAVE_MARCH_Z9_109_FEATURES to figure out if stckf is available
at compile time.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The store clock fast instruction saves a couple of instructions compared
to the store clock instruction. Always use stckf instead of stck if it
is available.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Remove the builtin tape ipl code. If somebody really wants to create a
tape which can be ipl'ed from, then this can be achieved by using zipl.
zipl can write an ipl record to a tape device and aftwards the kernel
image must be written to tape.
The steps are described in the "Linux on System z - Device Drivers,
Features, and Commands" book (SC33-8411).
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The code in entry[64].S calls do_signal only on return to user space.
user_mode(regs) is true for every calls to do_signal, it is unnecessary
to recheck user_mode at the start of do_signal and the legacy signal
stack switching path in get_sigframe is never reached.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Replace the check for TIF_SIE in the fault handler by a check for PF_VCPU.
With the last user of TIF_SIE gone we can now remove the bit.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
HANDLE_SIE_INTERCEPT is called early, use supervisor state and
instruction address to decide if the reset of the PSW to sie_loop
is required.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
To allow correct stack backtraces the backchain for the external
interrupt handler is now initialized with zero like it is already
done for example by io_int_handler().
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Add missing #ifdep CONFIG_HOTPLUG_CPU to get rid of this one:
arch/s390/kernel/smp.c:229:13: warning: 'pcpu_free_lowcore'
defined but not used [-Wunused-function]
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
- Store uids and gids with kuid_t and kgid_t in struct kstat
- Convert uid and gids to userspace usable values with
from_kuid and from_kgid
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Same code. Use the generic version. The special Makefile treatment is
pointless anyway as init_task.o contains only data which is handled by
the linker script. So no point on being treated like head text.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Link: http://lkml.kernel.org/r/20120503085035.271439530@linutronix.de
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
iQEcBAABAgAGBQJPnb50AAoJEHm+PkMAQRiGAE0H/A4zFZIUGmF3miKPDYmejmrZ
oVDYxVAu6JHjHWhu8E3VsinvyVscowjV8dr15eSaQzmDmRkSHAnUQ+dB7Di7jLC2
MNopxsWjwyZ8zvvr3rFR76kjbWKk/1GYytnf7GPZLbJQzd51om2V/TY/6qkwiDSX
U8Tt7ihSgHAezefqEmWp2X/1pxDCEt+VFyn9vWpkhgdfM1iuzF39MbxSZAgqDQ/9
JJrBHFXhArqJguhENwL7OdDzkYqkdzlGtS0xgeY7qio2CzSXxZXK4svT6FFGA8Za
xlAaIvzslDniv3vR2ZKd6wzUwFHuynX222hNim3QMaYdXm012M+Nn1ufKYGFxI0=
=4d4w
-----END PGP SIGNATURE-----
Merge tag 'v3.4-rc5' into next
Linux 3.4-rc5
Merge to pull in prerequisite change for Smack:
86812bb0de
Requested by Casey.
As a first step to converting struct cred to be all kuid_t and kgid_t
values convert the group values stored in group_info to always be
kgid_t values. Unless user namespaces are used this change should
have no effect.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Link: http://lkml.kernel.org/r/20120420124557.652574928@linutronix.de
Preparatory patch to make the idle thread allocation for secondary
cpus generic.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Howells <dhowells@redhat.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/20120420124556.964170564@linutronix.de
This change is inspired by
https://lkml.org/lkml/2012/4/16/14
which fixes the build warnings for arches that don't support
CONFIG_HAVE_ARCH_SECCOMP_FILTER.
In particular, there is no requirement for the return value of
secure_computing() to be checked unless the architecture supports
seccomp filter. Instead of silencing the warnings with (void)
a new static inline is added to encode the expected behavior
in a compiler and human friendly way.
v2: - cleans things up with a static inline
- removes sfr's signed-off-by since it is a different approach
v1: - matches sfr's original change
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Will Drewry <wad@chromium.org>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: James Morris <james.l.morris@oracle.com>
The stfle() function writes into lowcore memory when stfl_fac_list
is initialized with "S390_lowcore.stfl_fac_list = 0". For older
compilers this triggers a lowcore exception. With newer compilers
and "-OXX" compile option the bug does not show up because
the "S390_lowcore.stfl_fac_list" initialization is removed by the
compiler. The reason for thatis the incorrect "=m"
(S390_lowcore.stfl_fac_list) constraint in the stfl inline assembly.
The following shows the disassembly of the stfle() optimized code
that is inlined in the lgr_info_get() function:
000000000011325c <lgr_info_get>:
11325c: eb 9f f0 60 00 24 stmg %r9,%r15,96(%r15)
113262: c0 d0 00 29 0e 47 larl %r13,634ef0 <servi..>
113268: a7 f1 3f c0 tml %r15,16320
11326c: b9 04 00 ef lgr %r14,%r15
113270: a7 84 00 01 je 113272 <lgr_info_g..>
113274: a7 fb ff c0 aghi %r15,-64
113278: b9 04 00 c2 lgr %r12,%r2
11327c: a7 29 00 01 lghi %r2,1
113280: e3 e0 f0 98 00 24 stg %r14,152(%r15)
113286: d7 97 c0 00 c0 00 xc 0(152,%r12),0(%r12)
11328c: c0 e5 00 28 db 4c brasl %r14,62e924 <add_e..>
113292: b2 b1 00 00 stfl 0
To fix the problem we now clear the S390_lowcore.stfl_fac_list at
startup in "head.S" for all machine types before lowcore protection
is enabled.
In addition to that the "=m" constraint is replaced by "+m".
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Fix these:
arch/s390/kernel/perf_cpum_cf.c:180:3: warning: format '%lx'
expects argument of type 'long unsigned int',
but argument 2 has type 'int' [-Wformat]
arch/s390/kernel/perf_cpum_cf.c: In function 'cpumf_pmu_disable':
arch/s390/kernel/perf_cpum_cf.c:205:3: warning: format '%lx'
expects argument of type 'long unsigned int',
but argument 2 has type 'int' [-Wformat]
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Use braces for if/else/list_for_each_entry bodies if the body consists
of more than a single line. Otherwise I get confused and check if there
is something broken whenever I see these code snippets.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIVAwUAT3NKzROxKuMESys7AQKElw/+JyDxJSlj+g+nymkx8IVVuU8CsEwNLgRk
8KEnRfLhGtkXFLSJYWO6jzGo16F8Uqli1PdMFte/wagSv0285/HZaKlkkBVHdJ/m
u40oSjgT013bBh6MQ0Oaf8pFezFUiQB5zPOA9QGaLVGDLXCmgqUgd7exaD5wRIwB
ZmyItjZeAVnDfk1R+ZiNYytHAi8A5wSB+eFDCIQYgyulA1Igd1UnRtx+dRKbvc/m
rWQ6KWbZHIdvP1ksd8wHHkrlUD2pEeJ8glJLsZUhMm/5oMf/8RmOCvmo8rvE/qwl
eDQ1h4cGYlfjobxXZMHqAN9m7Jg2bI946HZjdb7/7oCeO6VW3FwPZ/Ic75p+wp45
HXJTItufERYk6QxShiOKvA+QexnYwY0IT5oRP4DrhdVB/X9cl2MoaZHC+RbYLQy+
/5VNZKi38iK4F9AbFamS7kd0i5QszA/ZzEzKZ6VMuOp3W/fagpn4ZJT1LIA3m4A9
Q0cj24mqeyCfjysu0TMbPtaN+Yjeu1o1OFRvM8XffbZsp5bNzuTDEvviJ2NXw4vK
4qUHulhYSEWcu9YgAZXvEWDEM78FXCkg2v/CrZXH5tyc95kUkMPcgG+QZBB5wElR
FaOKpiC/BuNIGEf02IZQ4nfDxE90QwnDeoYeV+FvNj9UEOopJ5z5bMPoTHxm4cCD
NypQthI85pc=
=G9mT
-----END PGP SIGNATURE-----
Merge tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system
Pull "Disintegrate and delete asm/system.h" from David Howells:
"Here are a bunch of patches to disintegrate asm/system.h into a set of
separate bits to relieve the problem of circular inclusion
dependencies.
I've built all the working defconfigs from all the arches that I can
and made sure that they don't break.
The reason for these patches is that I recently encountered a circular
dependency problem that came about when I produced some patches to
optimise get_order() by rewriting it to use ilog2().
This uses bitops - and on the SH arch asm/bitops.h drags in
asm-generic/get_order.h by a circuituous route involving asm/system.h.
The main difficulty seems to be asm/system.h. It holds a number of
low level bits with no/few dependencies that are commonly used (eg.
memory barriers) and a number of bits with more dependencies that
aren't used in many places (eg. switch_to()).
These patches break asm/system.h up into the following core pieces:
(1) asm/barrier.h
Move memory barriers here. This already done for MIPS and Alpha.
(2) asm/switch_to.h
Move switch_to() and related stuff here.
(3) asm/exec.h
Move arch_align_stack() here. Other process execution related bits
could perhaps go here from asm/processor.h.
(4) asm/cmpxchg.h
Move xchg() and cmpxchg() here as they're full word atomic ops and
frequently used by atomic_xchg() and atomic_cmpxchg().
(5) asm/bug.h
Move die() and related bits.
(6) asm/auxvec.h
Move AT_VECTOR_SIZE_ARCH here.
Other arch headers are created as needed on a per-arch basis."
Fixed up some conflicts from other header file cleanups and moving code
around that has happened in the meantime, so David's testing is somewhat
weakened by that. We'll find out anything that got broken and fix it..
* tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system: (38 commits)
Delete all instances of asm/system.h
Remove all #inclusions of asm/system.h
Add #includes needed to permit the removal of asm/system.h
Move all declarations of free_initmem() to linux/mm.h
Disintegrate asm/system.h for OpenRISC
Split arch_align_stack() out from asm-generic/system.h
Split the switch_to() wrapper out of asm-generic/system.h
Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h
Create asm-generic/barrier.h
Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h
Disintegrate asm/system.h for Xtensa
Disintegrate asm/system.h for Unicore32 [based on ver #3, changed by gxt]
Disintegrate asm/system.h for Tile
Disintegrate asm/system.h for Sparc
Disintegrate asm/system.h for SH
Disintegrate asm/system.h for Score
Disintegrate asm/system.h for S390
Disintegrate asm/system.h for PowerPC
Disintegrate asm/system.h for PA-RISC
Disintegrate asm/system.h for MN10300
...
Pull s390 patches part 2 from Martin Schwidefsky:
"Some minor improvements and one additional feature for the 3.4 merge
window: Hendrik added perf support for the s390 CPU counters."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
[S390] register cpu devices for SMP=n
[S390] perf: add support for s390x CPU counters
[S390] oprofile: Allow multiple users of the measurement alert interrupt
[S390] qdio: log all adapter characteristics
[S390] Remove unncessary export of arch_pick_mmap_layout
The motivation for this patchset was that I was looking at a way for a
qemu-kvm process, to exclude the guest memory from its core dump, which
can be quite large. There are already a number of filter flags in
/proc/<pid>/coredump_filter, however, these allow one to specify 'types'
of kernel memory, not specific address ranges (which is needed in this
case).
Since there are no more vma flags available, the first patch eliminates
the need for the 'VM_ALWAYSDUMP' flag. The flag is used internally by
the kernel to mark vdso and vsyscall pages. However, it is simple
enough to check if a vma covers a vdso or vsyscall page without the need
for this flag.
The second patch then replaces the 'VM_ALWAYSDUMP' flag with a new
'VM_NODUMP' flag, which can be set by userspace using new madvise flags:
'MADV_DONTDUMP', and unset via 'MADV_DODUMP'. The core dump filters
continue to work the same as before unless 'MADV_DONTDUMP' is set on the
region.
The qemu code which implements this features is at:
http://people.redhat.com/~jbaron/qemu-dump/qemu-dump.patch
In my testing the qemu core dump shrunk from 383MB -> 13MB with this
patch.
I also believe that the 'MADV_DONTDUMP' flag might be useful for
security sensitive apps, which might want to select which areas are
dumped.
This patch:
The VM_ALWAYSDUMP flag is currently used by the coredump code to
indicate that a vma is part of a vsyscall or vdso section. However, we
can determine if a vma is in one these sections by checking it against
the gate_vma and checking for a non-NULL return value from
arch_vma_name(). Thus, freeing a valuable vma bit.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Acked-by: Roland McGrath <roland@hack.frob.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a perf PMU to access the CPU-measurement counter facility CPUM CF.
CPUM CF provides multiple counter sets for measuring generic,
problem-state, and crypto activaties. Also an extended counter set for
the IBM System z10 and IBM z196 mainframes is available.
Counters from the basic and problem-state counter set are mapped to
generic perf hardware events. Other counters are accessible through
raw events.
For a list of available counter sets and counters, see:
- The Load-Program-Parameter and the CPU-Measurement Facilities (SA23-2260)
- The CPU-Measurement Facility Extended Counters Definition for
z10 and z196 (SA23-2261)
Reviewed-by: Jan Glauber <jang@linux.vnet.ibm.com>
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Prepare the measurement facility which is currently only used by oprofile
for multiple users. To achieve that the measurement alert interrupt control
bit needs to be protected. The measurement alert definitions are moved
to a header file and an interrupt mask is added so that users can discard
interrupts if they are for a different measurement subsystem.
Reviewed-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Jan Glauber <jang@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Pull s390 patches from Martin Schwidefsky:
"The biggest patch is the rework of the smp code, something I wanted to
do for some time. There are some patches for our various dump methods
and one new thing: z/VM LGR detection. LGR stands for linux-guest-
relocation and is the guest migration feature of z/VM. For debugging
purposes we keep a log of the systems where a specific guest has lived."
Fix up trivial conflict in arch/s390/kernel/smp.c due to the scheduler
cleanup having removed some code next to removed s390 code.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
[S390] kernel: Pass correct stack for smp_call_ipl_cpu()
[S390] Ensure that vmcore_info pointer is never accessed directly
[S390] dasd: prevent validate server for offline devices
[S390] Remove monolithic build option for zcrypt driver.
[S390] stack dump: fix indentation in output
[S390] kernel: Add OS info memory interface
[S390] Use block_sigmask()
[S390] kernel: Add z/VM LGR detection
[S390] irq: external interrupt code passing
[S390] irq: set __ARCH_IRQ_EXIT_IRQS_DISABLED
[S390] zfcpdump: Implement async sdias event processing
[S390] Use copy_to_absolute_zero() instead of "stura/sturg"
[S390] rework idle code
[S390] rework smp code
[S390] rename lowcore field
[S390] Fix gcc 4.6.0 compile warning
Pull scheduler changes for v3.4 from Ingo Molnar
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
printk: Make it compile with !CONFIG_PRINTK
sched/x86: Fix overflow in cyc2ns_offset
sched: Fix nohz load accounting -- again!
sched: Update yield() docs
printk/sched: Introduce special printk_sched() for those awkward moments
sched/nohz: Correctly initialize 'next_balance' in 'nohz' idle balancer
sched: Cleanup cpu_active madness
sched: Fix load-balance wreckage
sched: Clean up parameter passing of proc_sched_autogroup_set_nice()
sched: Ditch per cgroup task lists for load-balancing
sched: Rename load-balancing fields
sched: Move load-balancing arguments into helper struct
sched/rt: Do not submit new work when PI-blocked
sched/rt: Prevent idle task boosting
sched/wait: Add __wake_up_all_locked() API
sched/rt: Document scheduler related skip-resched-check sites
sched/rt: Use schedule_preempt_disabled()
sched/rt: Add schedule_preempt_disabled()
sched/rt: Do not throttle when PI boosting
sched/rt: Keep period timer ticking when rt throttling is active
...
Pull RCU changes for v3.4 from Ingo Molnar. The major features of this
series are:
- making RCU more aggressive about entering dyntick-idle mode in order
to improve energy efficiency
- converting a few more call_rcu()s to kfree_rcu()s
- applying a number of rcutree fixes and cleanups to rcutiny
- removing CONFIG_SMP #ifdefs from treercu
- allowing RCU CPU stall times to be set via sysfs
- adding CPU-stall capability to rcutorture
- adding more RCU-abuse diagnostics
- updating documentation
- fixing yet more issues located by the still-ongoing top-to-bottom
inspection of RCU, this time with a special focus on the CPU-hotplug
code path.
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
rcu: Stop spurious warnings from synchronize_sched_expedited
rcu: Hold off RCU_FAST_NO_HZ after timer posted
rcu: Eliminate softirq-mediated RCU_FAST_NO_HZ idle-entry loop
rcu: Add RCU_NONIDLE() for idle-loop RCU read-side critical sections
rcu: Allow nesting of rcu_idle_enter() and rcu_idle_exit()
rcu: Remove redundant check for rcu_head misalignment
PTR_ERR should be called before its argument is cleared.
rcu: Convert WARN_ON_ONCE() in rcu_lock_acquire() to lockdep
rcu: Trace only after NULL-pointer check
rcu: Call out dangers of expedited RCU primitives
rcu: Rework detection of use of RCU by offline CPUs
lockdep: Add CPU-idle/offline warning to lockdep-RCU splat
rcu: No interrupt disabling for rcu_prepare_for_idle()
rcu: Move synchronize_sched_expedited() to rcutree.c
rcu: Check for illegal use of RCU from offlined CPUs
rcu: Update stall-warning documentation
rcu: Add CPU-stall capability to rcutorture
rcu: Make documentation give more realistic rcutorture duration
rcutorture: Permit holding off CPU-hotplug operations during boot
rcu: Print scheduling-clock information on RCU CPU stall-warning messages
...
Currently pcpu_devices->panic_stack is passed to pcpu_delegate() in
smp_call_ipl_cpu(). This is wrong because pcpu_delegate() expects
the bottom (high address) of the stack and pcpu_devices->panic_stack
points to the top (low address). We now pass the bottom of the stack
which is pcpu_devices->panic_stack + PAGE_SIZE.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Stepan found:
CPU0 CPUn
_cpu_up()
__cpu_up()
boostrap()
notify_cpu_starting()
set_cpu_online()
while (!cpu_active())
cpu_relax()
<PREEMPT-out>
smp_call_function(.wait=1)
/* we find cpu_online() is true */
arch_send_call_function_ipi_mask()
/* wait-forever-more */
<PREEMPT-in>
local_irq_enable()
cpu_notify(CPU_ONLINE)
sched_cpu_active()
set_cpu_active()
Now the purpose of cpu_active is mostly with bringing down a cpu, where
we mark it !active to avoid the load-balancer from moving tasks to it
while we tear down the cpu. This is required because we only update the
sched_domain tree after we brought the cpu-down. And this is needed so
that some tasks can still run while we bring it down, we just don't want
new tasks to appear.
On cpu-up however the sched_domain tree doesn't yet include the new cpu,
so its invisible to the load-balancer, regardless of the active state.
So instead of setting the active state after we boot the new cpu (and
consequently having to wait for it before enabling interrupts) set the
cpu active before we set it online and avoid the whole mess.
Reported-by: Stepan Moskovchenko <stepanm@codeaurora.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1323965362.18942.71.camel@twins
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The first line of a stack dump has a wrong (no) indentation.
Just fix this after more than 10 years.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
In order to allow kdump based stand-alone dump, some information
has to be passed from the old kernel to the new dump kernel. This
is done via a the struct "os_info" that contains the following fields:
* crashkernel base and size
* reipl block
* vmcoreinfo
* init function
A pointer to os_info is stored at a well known storage location
and the whole structure as well as all fields are secured with
checksums.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Use the new helper function introduced in commit 5e6292c0f2
("signal: add block_sigmask() for adding sigmask to current->blocked")
which centralises the code for updating current->blocked after
successfully delivering a signal and reduces the amount of duplicate
code across architectures.
In the past some architectures got this code wrong, so using this
helper function should stop that from happening again.
Cc: Oleg Nesterov <oleg@redhat.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: linux-s390@vger.kernel.org
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Currently the following mechanisms are available to move active
Linux on System z instances between machines:
* z/VM 6.2 SSI (Single System Image)
* Suspend/resume
For moving Linux instances in this patch the term LGR (Linux Guest
Relocation) is used. Because such an operation is critical, it
should be detectable from Linux. With this patch for both, a live
system and a kernel dump, the information about LGRs is accessible.
To identify a guest, stsi and stfle data is used. A new function
lgr_info_log() compares the current data (lgr_info_cur) with the
last recorded one (lgr_info_last). In case the two data sets differ,
lgr_info_cur is logged to the "lgr" s390dbf.
The following trigger points call lgr_info_log():
* panic
* die
* kdump
* LGR timer
* PSW restart
* QDIO recovery
* resume
This patch also changes the s390dbf hex_ascii view. Now only printable ASCII
characters are shown.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The external interrupt handlers have a parameter called ext_int_code.
Besides the name this paramter does not only contain the ext_int_code
but in addition also the "cpu address" (POP) which caused the external
interrupt.
To make the code a bit more obvious pass a struct instead so the called
function can easily distinguish between external interrupt code and
cpu address. The cpu address field however is named "subcode" since
some external interrupt sources do not pass a cpu address but a
different parameter (or none at all).
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Use the new copy_to_absolute_zero() function instead of manual "stura"
and "sturg" to make the code shorter and more readable.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Whenever the cpu loads an enabled wait PSW it will appear as idle to the
underlying host system. The code in default_idle calls vtime_stop_cpu
which does the necessary voodoo to get the cpu time accounting right.
The udelay code just loads an enabled wait PSW. To correct this rework
the vtime_stop_cpu/vtime_start_cpu logic and move the difficult parts
to entry[64].S, vtime_stop_cpu can now be called from anywhere and
vtime_start_cpu is gone. The correction of the cpu time during wakeup
from an enabled wait PSW is done with a critical section in entry[64].S.
As vtime_start_cpu is gone, s390_idle_check can be removed as well.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Define struct pcpu and merge some of the NR_CPUS arrays into it, including
__cpu_logical_map, current_set and smp_cpu_state. Split smp related
functions to those operating on physical cpus and the functions operating
on a logical cpu number. Make the functions for physical cpus use a
pointer to a struct pcpu. This hides the knowledge about cpu addresses in
smp.c, entry[64].S and swsusp_asm64.S, thus remove the sigp.h header.
The PSW restart mechanism is used to start secondary cpus, calling a
function on an online cpu, calling a function on the ipl cpu, and for
the nmi signal. Replace the different assembler functions with a
single function restart_int_handler. The new entry point calls a function
whose pointer is stored in the lowcore of the target cpu and it can wait
for the source cpu to stop. This covers all existing use cases.
Overall the code is now simpler and there are ~380 lines less code.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The 16 bit value at the lowcore location with offset 0x84 is the
cpu address that is associated with an external interrupt. Rename
the field from cpu_addr to ext_cpu_addr to make that clear.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
With gcc 4.6.0 we get a false compile warning:
arch/s390/kernel/setup.c: In function 'setup_arch':
arch/s390/kernel/setup.c:767:3: warning: 'msg' may be used
uninitialized in this function [-Wuninitialized]
arch/s390/kernel/setup.c:753:8: note: 'msg' was declared here
This patch makes gcc quiet.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Pull s390 fixes from Martin Schwidefsky
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
[S390] memory hotplug: prevent memory zone interleave
[S390] crash_dump: remove duplicate include
[S390] KEYS: Enable the compat keyctl wrapper on s390x
The major features of this series are:
- making RCU more aggressive about entering dyntick-idle mode in order to
improve energy efficiency
- converting a few more call_rcu()s to kfree_rcu()s
- applying a number of rcutree fixes and cleanups to rcutiny
- removing CONFIG_SMP #ifdefs from treercu
- allowing RCU CPU stall times to be set via sysfs
- adding CPU-stall capability to rcutorture
- adding more RCU-abuse diagnostics
- updating documentation
- fixing yet more issues located by the still-ongoing top-to-bottom
inspection of RCU, this time with a special focus on the
CPU-hotplug code path.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The new is_compat_task() define for the !COMPAT case in
include/linux/compat.h conflicts with a similar define in
arch/s390/include/asm/compat.h.
This is the minimal patch which fixes the build issues.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
arch/s390/kernel/crash_dump.c included 'linux/crash_dump.h' twice,
remove the duplicate.
Signed-off-by: Danny Kukawka <danny.kukawka@bisect.de>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The 'poll()' system call timeout parameter is supposed to be 'int', not
'long'.
Now, the reason this matters is that right now 32-bit compat mode is
broken on at least x86-64, because the 32-bit code just calls
'sys_poll()' directly on x86-64, and the 32-bit argument will have been
zero-extended, turning a signed 'int' into a large unsigned 'long'
value.
We could just introduce a 'compat_sys_poll()' function for this, and
that may eventually be what we have to do, but since the actual standard
poll() semantics is *supposed* to be 'int', and since at least on x86-64
glibc sign-extends the argument before invocing the system call (so
nobody can actually use a 64-bit timeout value in user space _anyway_,
even in 64-bit binaries), the simpler solution would seem to be to just
fix the definition of the system call to match what it should have been
from the very start.
If it turns out that somebody somehow circumvents the user-level libc
64-bit sign extension and actually uses a large unsigned 64-bit timeout
despite that not being how poll() is supposed to work, we will need to
do the compat_sys_poll() approach.
Reported-by: Thomas Meyer <thomas@m3y3r.de>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The call_rcu() in unregister_external_interrupt() invokes
ext_int_hash_update(), which just does a kfree(). Convert the
call_rcu() to kfree_rcu(), allowing ext_int_hash_update() to
be eliminated.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
The conversion of the ktime to a value suitable for the clock comparator
does not take changes to wall_to_monotonic into account. In fact the
conversion just needs the boot clock (sched_clock_base_cc) and the
total_sleep_time.
This is applicable to 3.2+ kernels.
CC: stable@vger.kernel.org
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Avoid calling wake_up() from our NMI "bottom halve" from RCU extended
quiescent state in idle. wake_up() has RCU read-side critical sections
but this will be completely ignored by RCU if the cpu is in extended
quiescent state.
Which means that whatever object is being accessed from within the
read-side critical section can be freed concurrently from a different
cpu.
So make sure we leave extended quiescent state before calling wake_up().
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The vmlinux file for s390 contains a currently unused entry point,
which is specified in two different locations: the linker script
and the makefile. As it happens both definitions are different and
the linker file is broken (_start does not exist) and the makefile
specifies an entry point which makes no sense (the SALIPL loader
entry point).
So lets get rid of one definition (the makefile) and use the entry
point of all other ipl methods (0x10000 -> startup) to be consistent.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit: (29 commits)
audit: no leading space in audit_log_d_path prefix
audit: treat s_id as an untrusted string
audit: fix signedness bug in audit_log_execve_info()
audit: comparison on interprocess fields
audit: implement all object interfield comparisons
audit: allow interfield comparison between gid and ogid
audit: complex interfield comparison helper
audit: allow interfield comparison in audit rules
Kernel: Audit Support For The ARM Platform
audit: do not call audit_getname on error
audit: only allow tasks to set their loginuid if it is -1
audit: remove task argument to audit_set_loginuid
audit: allow audit matching on inode gid
audit: allow matching on obj_uid
audit: remove audit_finish_fork as it can't be called
audit: reject entry,always rules
audit: inline audit_free to simplify the look of generic code
audit: drop audit_set_macxattr as it doesn't do anything
audit: inline checks for not needing to collect aux records
audit: drop some potentially inadvisable likely notations
...
Use evil merge to fix up grammar mistakes in Kconfig file.
Bad speling and horrible grammar (and copious swearing) is to be
expected, but let's keep it to commit messages and comments, rather than
expose it to users in config help texts or printouts.
Every arch calls:
if (unlikely(current->audit_context))
audit_syscall_entry()
which requires knowledge about audit (the existance of audit_context) in
the arch code. Just do it all in static inline in audit.h so that arch's
can remain blissfully ignorant.
Signed-off-by: Eric Paris <eparis@redhat.com>
The audit system previously expected arches calling to audit_syscall_exit to
supply as arguments if the syscall was a success and what the return code was.
Audit also provides a helper AUDITSC_RESULT which was supposed to simplify things
by converting from negative retcodes to an audit internal magic value stating
success or failure. This helper was wrong and could indicate that a valid
pointer returned to userspace was a failed syscall. The fix is to fix the
layering foolishness. We now pass audit_syscall_exit a struct pt_reg and it
in turns calls back into arch code to collect the return value and to
determine if the syscall was a success or failure. We also define a generic
is_syscall_success() macro which determines success/failure based on if the
value is < -MAX_ERRNO. This works for arches like x86 which do not use a
separate mechanism to indicate syscall failure.
We make both the is_syscall_success() and regs_return_value() static inlines
instead of macros. The reason is because the audit function must take a void*
for the regs. (uml calls theirs struct uml_pt_regs instead of just struct
pt_regs so audit_syscall_exit can't take a struct pt_regs). Since the audit
function takes a void* we need to use static inlines to cast it back to the
arch correct structure to dereference it.
The other major change is that on some arches, like ia64, MIPS and ppc, we
change regs_return_value() to give us the negative value on syscall failure.
THE only other user of this macro, kretprobe_example.c, won't notice and it
makes the value signed consistently for the audit functions across all archs.
In arch/sh/kernel/ptrace_64.c I see that we were using regs[9] in the old
audit code as the return value. But the ptrace_64.h code defined the macro
regs_return_value() as regs[3]. I have no idea which one is correct, but this
patch now uses the regs_return_value() function, so it now uses regs[3].
For powerpc we previously used regs->result but now use the
regs_return_value() function which uses regs->gprs[3]. regs->gprs[3] is
always positive so the regs_return_value(), much like ia64 makes it negative
before calling the audit code when appropriate.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: H. Peter Anvin <hpa@zytor.com> [for x86 portion]
Acked-by: Tony Luck <tony.luck@intel.com> [for ia64]
Acked-by: Richard Weinberger <richard@nod.at> [for uml]
Acked-by: David S. Miller <davem@davemloft.net> [for sparc]
Acked-by: Ralf Baechle <ralf@linux-mips.org> [for mips]
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [for ppc]
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (31 commits)
[S390] disassembler: mark exception causing instructions
[S390] Enable exception traces by default
[S390] return address of compat signals
[S390] sysctl: get rid of dead declaration
[S390] dasd: fix fixpoint divide exception in define_extent
[S390] dasd: add sanity check to detect path connection error
[S390] qdio: fix kernel panic for zfcp 31-bit
[S390] Add s390x description to Documentation/kdump/kdump.txt
[S390] Add VMCOREINFO_SYMBOL(high_memory) to vmcoreinfo
[S390] dasd: fix expiration handling for recovery requests
[S390] outstanding interrupts vs. smp_send_stop
[S390] ipc: call generic sys_ipc demultiplexer
[S390] zcrypt: Fix error return codes.
[S390] zcrypt: Rework length parameter checking.
[S390] cleanup trap handling
[S390] Remove Kerntypes leftovers
[S390] topology: increase poll frequency if change is anticipated
[S390] entry[64].S improvements
[S390] make arch/s390 subdirectories depend on config option
[S390] kvm: move cmf host id constant out of lowcore
...
Fix up conflicts in arch/s390/kernel/{smp.c,topology.c} due to the
sysdev removal clashing with "topology: get rid of ifdefs" which moved
some of that code around.
* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (165 commits)
reiserfs: Properly display mount options in /proc/mounts
vfs: prevent remount read-only if pending removes
vfs: count unlinked inodes
vfs: protect remounting superblock read-only
vfs: keep list of mounts for each superblock
vfs: switch ->show_options() to struct dentry *
vfs: switch ->show_path() to struct dentry *
vfs: switch ->show_devname() to struct dentry *
vfs: switch ->show_stats to struct dentry *
switch security_path_chmod() to struct path *
vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb
vfs: trim includes a bit
switch mnt_namespace ->root to struct mount
vfs: take /proc/*/mounts and friends to fs/proc_namespace.c
vfs: opencode mntget() mnt_set_mountpoint()
vfs: spread struct mount - remaining argument of next_mnt()
vfs: move fsnotify junk to struct mount
vfs: move mnt_devname
vfs: move mnt_list to struct mount
vfs: switch pnode.h macros to struct mount *
...
* 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (73 commits)
arm: fix up some samsung merge sysdev conversion problems
firmware: Fix an oops on reading fw_priv->fw in sysfs loading file
Drivers:hv: Fix a bug in vmbus_driver_unregister()
driver core: remove __must_check from device_create_file
debugfs: add missing #ifdef HAS_IOMEM
arm: time.h: remove device.h #include
driver-core: remove sysdev.h usage.
clockevents: remove sysdev.h
arm: convert sysdev_class to a regular subsystem
arm: leds: convert sysdev_class to a regular subsystem
kobject: remove kset_find_obj_hinted()
m86k: gpio - convert sysdev_class to a regular subsystem
mips: txx9_sram - convert sysdev_class to a regular subsystem
mips: 7segled - convert sysdev_class to a regular subsystem
sh: dma - convert sysdev_class to a regular subsystem
sh: intc - convert sysdev_class to a regular subsystem
power: suspend - convert sysdev_class to a regular subsystem
power: qe_ic - convert sysdev_class to a regular subsystem
power: cmm - convert sysdev_class to a regular subsystem
s390: time - convert sysdev_class to a regular subsystem
...
Fix up conflicts with 'struct sysdev' removal from various platform
drivers that got changed:
- arch/arm/mach-exynos/cpu.c
- arch/arm/mach-exynos/irq-eint.c
- arch/arm/mach-s3c64xx/common.c
- arch/arm/mach-s3c64xx/cpu.c
- arch/arm/mach-s5p64x0/cpu.c
- arch/arm/mach-s5pv210/common.c
- arch/arm/plat-samsung/include/plat/cpu.h
- arch/powerpc/kernel/sysfs.c
and fix up cpu_is_hotpluggable() as per Greg in include/linux/cpu.h
This resolves the conflict in the arch/arm/mach-s3c64xx/s3c6400.c file,
and it fixes the build error in the arch/x86/kernel/microcode_core.c
file, that the merge did not catch.
The microcode_core.c patch was provided by Stephen Rothwell
<sfr@canb.auug.org.au> who was invaluable in the merge issues involved
with the large sysdev removal process in the driver-core tree.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
cpu: Export cpu_up()
rcu: Apply ACCESS_ONCE() to rcu_boost() return value
Revert "rcu: Permit rt_mutex_unlock() with irqs disabled"
docs: Additional LWN links to RCU API
rcu: Augment rcu_batch_end tracing for idle and callback state
rcu: Add rcutorture tests for srcu_read_lock_raw()
rcu: Make rcutorture test for hotpluggability before offlining CPUs
driver-core/cpu: Expose hotpluggability to the rest of the kernel
rcu: Remove redundant rcu_cpu_stall_suppress declaration
rcu: Adaptive dyntick-idle preparation
rcu: Keep invoking callbacks if CPU otherwise idle
rcu: Irq nesting is always 0 on rcu_enter_idle_common
rcu: Don't check irq nesting from rcu idle entry/exit
rcu: Permit dyntick-idle with callbacks pending
rcu: Document same-context read-side constraints
rcu: Identify dyntick-idle CPUs on first force_quiescent_state() pass
rcu: Remove dynticks false positives and RCU failures
rcu: Reduce latency of rcu_prepare_for_idle()
rcu: Eliminate RCU_FAST_NO_HZ grace-period hang
rcu: Avoid needlessly IPIing CPUs at GP end
...
If an exception happens the PSW either points to the instruction that
caused the exception or to the instruction that follows the exception
causing instruction, depending on the exception type.
Since the inkernel disassembler adds a ">" in front of the disassembly
many people assume incorrectly that the instruction that is pointed to
must be the cause of the exception. To make people aware that this is
not necessarily the case add a different character in front of the
disassembled instruction that precedes the current instructions.
The output now looks like this:
Krnl PSW : 0704200180000000 0000000000120de8 (test_function+0x0/0x100)
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:2 PM:0 EA:3
Krnl GPRS: 000003ff00000000 0000000000120de4 000000000091bb40 0000000000000001
000003fffd2ea000 0000000030fb7df8 0000000030fb7f10 000003ffffa113c8
000000000091bb40 000003fffd2ea000 0000000000000002 0000000030fb7f10
000000003f290240 0000000000606220 00000000002cfb5c 0000000030fb7d58
Krnl Code: 0000000000120ddc: b90400a9 lgr %r10,%r9
0000000000120de0: a7f4ff88 brc 15,120cf0
#0000000000120de4: a7f40001 brc 15,120de6
>0000000000120de8: a7f13f80 tmll %r15,16256
0000000000120dec: eb8ff0580024 stmg %r8,%r15,88(%r15)
0000000000120df2: a7840001 brc 8,120df4
0000000000120df6: b90400ef lgr %r14,%r15
0000000000120dfa: a7fbffb8 aghi %r15,-72
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Enable exception traces by default so that early user space breakage
(e.g. broken code in initrd) can be easily indentified.
If not needed afterwards it can be disabled by writing '0' in one of
these two files:
/proc/sys/kernel/userprocess_debug
/proc/sys/debug/exception-trace
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
A 31-bit kernel always sets the high order bit in the return address
for a signal handler.
git commit d4e81b35b8 "[S390] allow all addressing modes" makes
sure that the high order bit is set in the signal return address for
standard signals of a 31-bit compat process but fails to do the same
for real-time signals. To make things consistent the bit needs to be
set by setup_rt_frame32 as well.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Currently the vmalloc_start address (or better end of real memory) for s390x
is obtained by makedumpfile using vmlist.addr symbol, which is not correct.
The correct vmalloc_start address can be obtained using 'high_memory' symbol.
This patch adds the high_memory symbol to vmcoreinfo.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The panic function will first print the panic message to the console,
then stop additional cpus with smp_send_stop and finally call the
function on the panic notifier list.
In case of an I/O based console the panic message will cause I/O to
be started and a function on the panic notifier list will wait for the
completion of the I/O. That does not work if an I/O completion interrupt
has already been delivered to a cpu that is then stopped by smp_send_stop.
To break this cyclic dependency add code to smp_send_stop that gives
the additional cpu the opportunity to complete outstanding interrupts.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Call generic IPC demultiplexer instead of having a nearly identical
s390 variant. Also make sure that native and compat handling now have
the same behaviour.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Move the program interruption code and the translation exception identifier
to the pt_regs structure as 'int_code' and 'int_parm_long' and make the
first level interrupt handler in entry[64].S store the two values. That
makes it possible to drop 'prot_addr' and 'trap_no' from the thread_struct
and to reduce the number of arguments to a lot of functions. Finally
un-inline do_trap. Overall this saves 5812 bytes in the .text section of
the 64 bit kernel.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Increase cpu topology change poll frequency if a change is anticipated.
Otherwise a user might be a bit confused to have to wait up to a minute
in order to see a change this should be visible immediatly.
However there is no guarantee that the change will happen during the
time frame the poll frequency is increased.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Another round of cleanup for entry[64].S, in particular the program check
handler looks more reasonable now. The code size for the 31 bit kernel
has been reduced by 616 byte and by 528 byte for the 64 bit version.
Even better the code is a bit faster as well.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
There is no reason for the cpu-measurement-facility host id constant to
reside in the lowcore where space is precious. Use an entry in the literal
pool in HANDLE_SIE_INTERCEPT and a stack slot in sie64a.
While we are at it replace the id -1 with 0 to indicate host execution.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cleanup z10 topology handling. This adds some more code but hopefully
the result is more readable and easier to maintain.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Remove all ifdefs from topology code and also only compile it for the
CONFIG_SCHED_BOOK case. The new code selects SCHED_MC if SCHED_BOOK is
selected. SCHED_MC without SCHED_BOOK is not possible anymore.
Furthermore various sysfs attributes are not available anymore for the
!SCHED_BOOK case. In particular all attributes that correspond to
CPU polarization.
But since all real world kernels have SCHED_BOOK selected anyway this
doesn't matter too much.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Currently, when smp_switch_to_ipl_cpu() is done, the backchain in the dump
analysis tool crash looks like the following:
#0 [1f746e70] __machine_kexec at 11dd92
#1 [1f746eb8] smp_restart_cpu at 11820e
#0 [00907eb0] cpu_idle at 10602e
#1 [00907ef8] start_kernel at 979a08
It would be good to see the registers of the interrupted function.
To achieve this, the backchain on the new stack has to be set to zero.
This looks then like the following:
#0 [1f746e70] __machine_kexec at 11dd8e
#1 [1f746eb8] smp_restart_cpu at 11820a
PSW: 0706000180000000 00000000005c6fe6 (vtime_stop_cpu+134)
GPRS: 0000000000000000 00000000005c6fe6 0000000001ad0228 0000000001ad0248
0000000000907f08 0000000001ad0b40 0000000000979344 0000000000000000
00000000009c0000 00000000009c0010 00000000009ab024 0000000001ad0200
0000000001ad0238 00000000005cc9d8 000000000010602e 0000000000907e68
#0 [00907eb0] cpu_idle at 10602e
#1 [00907ef8] start_kernel at 979a08
In addition to this, now also the correct PSW is stored in the pt_regs
structure that is located at the start of the panic stack.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The kernel address space of a 64 bit kernel currently uses a three level
page table and the vmemmap array has a fixed address and a fixed maximum
size. A three level page table is good enough for systems with less than
3.8TB of memory, for bigger systems four page table levels need to be
used. Each page table level costs a bit of performance, use 3 levels for
normal systems and 4 levels only for the really big systems.
To avoid bloating sparse.o too much set MAX_PHYSMEM_BITS to 46 for a
maximum of 64TB of memory.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
This patch makes the create_mem_hole() function more readable and
fixes some minor bugs (e.g. off-by-one problems).
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The current code in setup_boot_command_line() uses a heuristic to
detect an EBCDIC command line. It checks if any of the bytes in
the command line has bit one (0x80) set. In that case it is assumed
that we have an EBCDIC string and the complete command line is
converted.
On s390 there are cases where the boot loader provides a kernel
command line that is NULL terminated, but has random data after
the NULL termination. In that case, setup_boot_command_line()
might misinterpret an ASCII string for an EBCDIC string. A
subsequent string conversion can then damage the ASCII string.
This patch solves the problem by checking for NULL termination.
If no EBCDIC character has been found until the the NULL
termination has been found, we now assume that we have an ASCII
string.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Mask the extint_code parameter of the smp external interrupt handler
to get the interruption code. Otherwise emergency call interrupts
erroneously might be accounted as emergency signal interrupts.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
After all sysdev classes are ported to regular driver core entities, the
sysdev implementation will be entirely removed from the kernel.
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This moves the 'cpu sysdev_class' over to a regular 'cpu' subsystem
and converts the devices to regular devices. The sysdev drivers are
implemented as subsystem interfaces now.
After all sysdev classes are ported to regular driver core entities, the
sysdev implementation will be entirely removed from the kernel.
Userspace relies on events and generic sysfs subsystem infrastructure
from sysdev devices, which are made available with this conversion.
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@amd64.org>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: Len Brown <lenb@kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Those two APIs were provided to optimize the calls of
tick_nohz_idle_enter() and rcu_idle_enter() into a single
irq disabled section. This way no interrupt happening in-between would
needlessly process any RCU job.
Now we are talking about an optimization for which benefits
have yet to be measured. Let's start simple and completely decouple
idle rcu and dyntick idle logics to simplify.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
It is assumed that rcu won't be used once we switch to tickless
mode and until we restart the tick. However this is not always
true, as in x86-64 where we dereference the idle notifiers after
the tick is stopped.
To prepare for fixing this, add two new APIs:
tick_nohz_idle_enter_norcu() and tick_nohz_idle_exit_norcu().
If no use of RCU is made in the idle loop between
tick_nohz_enter_idle() and tick_nohz_exit_idle() calls, the arch
must instead call the new *_norcu() version such that the arch doesn't
need to call rcu_idle_enter() and rcu_idle_exit().
Otherwise the arch must call tick_nohz_enter_idle() and
tick_nohz_exit_idle() and also call explicitly:
- rcu_idle_enter() after its last use of RCU before the CPU is put
to sleep.
- rcu_idle_exit() before the first use of RCU after the CPU is woken
up.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: David Miller <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The tick_nohz_stop_sched_tick() function, which tries to delay
the next timer tick as long as possible, can be called from two
places:
- From the idle loop to start the dytick idle mode
- From interrupt exit if we have interrupted the dyntick
idle mode, so that we reprogram the next tick event in
case the irq changed some internal state that requires this
action.
There are only few minor differences between both that
are handled by that function, driven by the ts->inidle
cpu variable and the inidle parameter. The whole guarantees
that we only update the dyntick mode on irq exit if we actually
interrupted the dyntick idle mode, and that we enter in RCU extended
quiescent state from idle loop entry only.
Split this function into:
- tick_nohz_idle_enter(), which sets ts->inidle to 1, enters
dynticks idle mode unconditionally if it can, and enters into RCU
extended quiescent state.
- tick_nohz_irq_exit() which only updates the dynticks idle mode
when ts->inidle is set (ie: if tick_nohz_idle_enter() has been called).
To maintain symmetry, tick_nohz_restart_sched_tick() has been renamed
into tick_nohz_idle_exit().
This simplifies the code and micro-optimize the irq exit path (no need
for local_irq_save there). This also prepares for the split between
dynticks and rcu extended quiescent state logics. We'll need this split to
further fix illegal uses of RCU in extended quiescent states in the idle
loop.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: David Miller <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Hans-Christian Egtvedt <hans-christian.egtvedt@atmel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
s390 used early_node_map[] just to prime free_area_init_nodes(). Now
memblock can be used for the same purpose and early_node_map[] is
scheduled to be dropped. Use memblock instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux-s390@vger.kernel.org
git commit 20b40a794b "signal race with restarting system calls"
added code to the poke_user/poke_user_compat to reset the system call
restart information in the thread-info if the PSW address is changed.
The purpose of that change has been to workaround old gdbs that do
not know about the REGSET_SYSTEM_CALL. It turned out that this is not
a good idea, it makes the behaviour of the debuggee dependent on the
order of specific ptrace call, e.g. the REGSET_SYSTEM_CALL register
set needs to be written last. And the workaround does not really fix
old gdbs, inferior calls on interrupted restarting system calls do not
work either way.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The last breaking event address is a read-only value, the regset misses the
.set function. If a PTRACE_SETREGSET is done for NT_S390_LAST_BREAK we
get an oops due to a branch to zero:
Kernel BUG at 0000000000000002 verbose debug info unavailable
illegal operation: 0001 #1 SMP
...
Call Trace:
(<0000000000158294> ptrace_regset+0x184/0x188)
<00000000001595b6> ptrace_request+0x37a/0x4fc
<0000000000109a78> arch_ptrace+0x108/0x1fc
<00000000001590d6> SyS_ptrace+0xaa/0x12c
<00000000005c7a42> sysc_noemu+0x16/0x1c
<000003fffd5ec10c> 0x3fffd5ec10c
Last Breaking-Event-Address:
<0000000000158242> ptrace_regset+0x132/0x188
Add a nop .set function to prevent the branch to zero.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: stable@kernel.org
The TIF_SYSCALL bit needs to be cleared if the debugger changes the state
of the ptraced process in regard to the presence of a system call.
Otherwise the system call will be restarted although the debugger set up
an inferior call.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
In order to have the same behavior for kdump based stand-alone dump
as for the kexec method, the is_kdump_kernel() check (only true for
the kexec method) has to be replaced by the OLDMEM_BASE check (true
for both methods).
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Make sure that all cpus in a book on a z10 appear as book siblings
and not as core siblings. This fixes some performance regressions that
appeared after the book scheduling domain got introduced.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
In ESA mode STCKF is not defined even if the facility bit is enabled.
To prevent an illegal operation we must also check if we run a 64 bit kernel.
To make the check perform well add the STCKF bit to the machine flags.
Signed-off-by: Jan Glauber <jang@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
When the kernel is started in kdump mode, zfcpdump should not be
initialized because both dump methods can't be used at the same time.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
'readelf -n' on the s390 vmlinux file generates lots of warnings about
corrupt notes. The reason is that the 'NOTE' program header has incorrect
file and memory sizes. The problem is that the section following the
NOTES section do not switch to a different phdr and they get added to
the NOTE program section. Add a dummy entry to the linker script that
switches to the data phdr before the start of the RODATA section.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* 'upstream/jump-label-noearly' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen:
jump-label: initialize jump-label subsystem much earlier
x86/jump_label: add arch_jump_label_transform_static()
s390/jump-label: add arch_jump_label_transform_static()
jump_label: add arch_jump_label_transform_static() to optimise non-live code updates
sparc/jump_label: drop arch_jump_label_text_poke_early()
x86/jump_label: drop arch_jump_label_text_poke_early()
jump_label: if a key has already been initialized, don't nop it out
stop_machine: make stop_machine safe and efficient to call early
jump_label: use proper atomic_t initializer
Conflicts:
- arch/x86/kernel/jump_label.c
Added __init_or_module to arch_jump_label_text_poke_early vs
removal of that function entirely
- kernel/stop_machine.c
same patch ("stop_machine: make stop_machine safe and efficient
to call early") merged twice, with whitespace fix in one version
Currently it can happen that the pre-allocated ELF header contains a wrong
memory map which would result in errors when copying /proc/vmcore.
In order to still get a valid vmcore, we (temporarily) disable the error
checking in copy_oldmem_page(). This will then produce zero pages for those
memory regions.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
We use both the external call and emergency call IPIs to signal remote
cpus. Therefore it makes sense to account them differently withing
/proc/irqstats so we actually know what happened.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Fix three sparse warnings in math-emu / sysinfo:
arch/s390/kernel/sysinfo.c:448:17: error: return expression in void function
arch/s390/kernel/sysinfo.c:445:25: warning: shift too big (32) for type unsigned int
arch/s390/kernel/sysinfo.c:445:25: warning: shift too big (32) for type unsigned int
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Remove unnecessary code to avoid false positives from sparse, e.g.
arch/s390/kernel/compat_signal.c:221:61: warning: invalid access past the end of 'set32' (8 8)
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
On sie_fault we need to switch back to user ASCE. Otherwise we get
interresting effects when exiting to "userspace" while the guest
space is still active.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Use a sigp sense running to decide which signal processor order to use
for an ipi. If the target cpu is running use external call, if the target
cpu is not running use emergency signal.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Add support for CHSC I/O interrupt statistics in /proc/interrupts.
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The user space program can change its addressing mode between the
24-bit, 31-bit and the 64-bit mode if the kernel is 64 bit. Currently
the kernel always forces the standard amode on signal delivery and
signal return and on ptrace: 64-bit for a 64-bit process, 31-bit for
a compat process and 31-bit kernels. Change the signal and ptrace code
to allow the full range of addressing modes. Signal handlers are
run in the standard addressing mode for the process.
One caveat is that even an 31-bit compat process can switch to the
64-bit mode. The next signal will switch back into the 31-bit mode
and there is no room in the 31-bit compat signal frame to store the
information that the program came from the 64-bit mode.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Split out addressing mode bits from PSW_BASE_BITS, rename PSW_BASE_BITS
to PSW_MASK_BASE, get rid of psw_user32_bits, remove unused function
enabled_wait(), introduce PSW_MASK_USER, and drop PSW_MASK_MERGE macros.
Change psw_kernel_bits / psw_user_bits to contain only the bits that
are always set in the respective mode.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Add an explicit TIF_SYSCALL bit that indicates if a task is inside
a system call. The svc_code in the pt_regs structure is now only
valid if TIF_SYSCALL is set. With this definition TIF_RESTART_SVC
can be replaced with TIF_SYSCALL. Overall do_signal is a bit more
readable and it saves a few lines of code.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
An instruction with an address right below the adress limit for the
current addressing mode will wrap. The instruction restart logic in
the protection fault handler and the signal code need to follow the
wrapping rules to find the correct instruction address.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
For a ERESTARTNOHAND/ERESTARTSYS/ERESTARTNOINTR restarting system call
do_signal will prepare the restart of the system call with a rewind of
the PSW before calling get_signal_to_deliver (where the debugger might
take control). For A ERESTART_RESTARTBLOCK restarting system call
do_signal will set -EINTR as return code.
There are two issues with this approach:
1) strace never sees ERESTARTNOHAND, ERESTARTSYS, ERESTARTNOINTR or
ERESTART_RESTARTBLOCK as the rewinding already took place or the
return code has been changed to -EINTR
2) if get_signal_to_deliver does not return with a signal to deliver
the restart via the repeat of the svc instruction is left in place.
This opens a race if another signal is made pending before the
system call instruction can be reexecuted. The original system call
will be restarted even if the second signal would have ended the
system call with -EINTR.
These two issues can be solved by dropping the early rewind of the
system call before get_signal_to_deliver has been called and by using
the TIF_RESTART_SVC magic to do the restart if no signal has to be
delivered. The only situation where the system call restart via the
repeat of the svc instruction is appropriate is when a SA_RESTART
signal is delivered to user space.
Unfortunately this breaks inferior calls by the debugger again. The
system call number and the length of the system call instruction is
lost over the inferior call and user space will see ERESTARTNOHAND/
ERESTARTSYS/ERESTARTNOINTR/ERESTART_RESTARTBLOCK. To correct this a
new ptrace interface is added to save/restore the system call number
and system call instruction length.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Remove the save_area_64 field from the 0xe00 - 0xf00 area in the lowcore.
Use a free slot in the save_area array instead.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
This patch implements the crash_map_pages() function for s390.
KEXEC_CRASH_MEM_ALIGN is set to HPAGE_SIZE, in order to support
kernel mappings that use large pages. We also use HPAGE_SIZE alignment
for CONFIG_HUGETLB_PAGE=n in order to have the same 1 MiB alignment on
all s390 systems.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
This patch defines for s390 an ABI defined pointer to the vmcoreinfo note at
a well known address. With this patch tools are able to find this information
in dumps created by stand-alone or hypervisor dump tools.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
This patch provides the architecture specific part of the s390 kdump
support.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
PSW restart can be triggered on offline CPUs. If this happens, currently
the PSW restart code fails, because functions like smp_processor_id()
do not work on offline CPUs. This patch fixes this as follows:
If PSW restart is triggered on an offline CPU, the PSW restart (sigp restart)
is done a second time on another CPU that is online and the old CPU is
stopped afterwards.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Use the ENTRY macro for the system call wrapper sys_setns_wrapper
similarly to the other wrappers.
Signed-off-by: Jan Glauber <jang@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
git commit 5e9a2692 "[S390] ptrace cleanup" introduced a regression
for the case when both a user PER set (e.g. a storage alteration trace) and
PTRACE_SINGLESTEP are active. The new code will overrule the user PER set
with a instruction-fetch PER set over the whole address space for ptrace
single stepping. The inferior process will be stopped after each instruction
with an instruction fetch event. Any other events that may have occurred
concurrently are not reported (e.g. storage alteration event) because the
control bits for them are not set. The solution is to merge the PER control
bits of the user PER set with the PER_EVENT_IFETCH control bit for
PTRACE_SINGLESTEP.
Cc: stable@kernel.org
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Fix this warning:
WARNING: vmlinux.o(.text+0x199b6): Section mismatch in reference from
the function alloc_masks() to the function .init.text:__alloc_bootmem()
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Current IRQ statistics support does not show detail counts for I/O
interrupts which are processed internally only. The result is a
summation count which is way off such as this one:
CPU0 CPU1 CPU2
I/O: 1331 710 442
[...]
QAI: 15 16 16 [I/O] QDIO Adapter Interrupt
QDI: 1 0 0 [I/O] QDIO Interrupt
DAS: 706 645 381 [I/O] DASD
C15: 26 10 0 [I/O] 3215
C70: 0 0 0 [I/O] 3270
TAP: 0 0 0 [I/O] Tape
VMR: 0 0 0 [I/O] Unit Record Devices
LCS: 0 0 0 [I/O] LCS
CLW: 0 0 0 [I/O] CLAW
CTC: 0 0 0 [I/O] CTC
APB: 0 0 0 [I/O] AP Bus
Fix this by moving I/O interrupt accounting into the common I/O layer.
Signed-off-by: Peter Oberparleiter <peter.oberparleiter@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
time, s390: Get rid of compile warning
dw_apb_timer: constify clocksource name
time: Cleanup old CONFIG_GENERIC_TIME references that snuck in
time: Change jiffies_to_clock_t() argument type to unsigned long
alarmtimers: Fix error handling
clocksource: Make watchdog reset lockless
posix-cpu-timers: Cure SMP accounting oddities
s390: Use direct ktime path for s390 clockevent device
clockevents: Add direct ktime programming function
clockevents: Make minimum delay adjustments configurable
nohz: Remove "Switched to NOHz mode" debugging messages
proc: Consider NO_HZ when printing idle and iowait times
nohz: Make idle/iowait counter update conditional
nohz: Fix update_ts_time_stat idle accounting
cputime: Clean up cputime_to_usecs and usecs_to_cputime macros
alarmtimers: Rework RTC device selection using class interface
alarmtimers: Add try_to_cancel functionality
alarmtimers: Add more refined alarm state tracking
alarmtimers: Remove period from alarm structure
alarmtimers: Remove interval cap limit hack
...
This allows jump-label entries to be cheaply updated on code which is
not yet live.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Jason Baron <jbaron@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jan Glauber <jang@linux.vnet.ibm.com>
For s390 there is one additional byte associated with each page,
the storage key. This byte contains the referenced and changed
bits and needs to be included into the hibernation image.
If the storage keys are not restored to their previous state all
original pages would appear to be dirty. This can cause
inconsistencies e.g. with read-only filesystems.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
"s390: Use direct ktime path for s390 clockevent device" in linux-next
introduces this compile warning:
arch/s390/kernel/time.c: In function 's390_next_ktime':
arch/s390/kernel/time.c:118:2: warning:
comparison of distinct pointer types lacks a cast [enabled by default]
Just use a u64 instead of an s64 variable. This is not a problem since it
will always contain a positive value.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Link: http://lkml.kernel.org/r/1316675957-5538-1-git-send-email-heiko.carstens@de.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
598841ca99 ([S390] use gmap address
spaces for kvm guest images) changed kvm to use a separate address
space for kvm guests. This address space was switched in __vcpu_run
In some cases (preemption, page fault) there is the possibility that
this address space switch is lost.
The typical symptom was a huge amount of validity intercepts or
random guest addressing exceptions.
Fix this by doing the switch in sie_loop and sie_exit and saving the
address space in the gmap structure itself. Also use the preempt
notifier.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
The clock comparator on s390 uses the same format as the TOD clock.
If the value in the clock comparator is smaller than the current TOD
value an interrupt is pending. Use the CLOCK_EVT_FEAT_KTIME feature
to get the unmodified ktime of the next clockevent expiration and
use it to program the clock comparator without querying the TOD clock.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: john stultz <johnstul@us.ibm.com>
Link: http://lkml.kernel.org/r/20110823133143.153017933@de.ibm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The nfsservctl system call is now gone, so we should remove all
linkage for it.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The main purpose for PSW restart will be kdump. Therefore customers will
issue "system restart" for creating a dump. If kdump is not enabled,
currently "PSW restart" will reboot the system and then no dump can
be created any more. In order to still allow a manual stand-alone dump in
the case a user issues "PSW restart" on a system that has not enabled
kdump we now stop the system.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
reipl_fcp_kset was just initialized, so it appears that it should be tested
instead of reipl_kset.
Signed-off-by: Julia Lawall <julia@diku.dk>
Reported-by: Suman Saha <sumsaha@gmail.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
When IPL'ing from a block device and an NSS should be created we must
make sure that the kernel image and the initrd are in different 1MB
segments. Otherwise creating the NSS will fail.
So we make sure the initrd is 4MB behind the end of the kernel image
like we do already when IPL via the VM reader is performed.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
We should call set_restore_sigmask() instead of directly setting
TIF_RESTORE_SIGMASK. This change should have been done three years
earlier... see 4e4c22 "signals: add set_restore_sigmask".
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Remove pointless comments in startup_secondary(). There is not too much
value in having comments like e.g. "call cpu notifiers" just before a
call to notify_cpu*().
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
This is the same as fd8a7de1 "x86: cpu-hotplug: Prevent softirq wakeup
on wrong CPU".
Unlike on x86 this doesn't fix a bug on s390 since we do not have
threaded interrupt handlers. However we want to keep the same
initialization order like on x86. This should prevent bugs caused by
code which assumes (and relies on) the init order is the same on each
architecture.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Because of readability reasons we ignore the 80 character line limit
in asm offsets. Just one line per define, nothing else.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
The diagnose 308 call is the prefered method for clearing all ongoing I/O.
Therefore if it is available we use it instead of doing a manual reset.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
For kdump we need a store status function to save the registers for the
current CPU. Therefore this patch exports a function "store_status()".
In addition to that now also floating point registers are saved correctly.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
With this patch a new S390 shutdown trigger "restart" is added. If under
z/VM "systerm restart" is entered or under the HMC the "PSW restart" button
is pressed, the PSW located at 0 (31 bit) or 0x1a0 (64 bit) bit is loaded.
Now we execute do_restart() that processes the restart action that is
defined under /sys/firmware/shutdown_actions/on_restart. Currently the
following actions are possible: reipl (default), stop, vmcmd, dump, and
dump_reipl.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>
Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch removes all the module loader hook implementations in the
architecture specific code where the functionality is the same as that
now provided by the recently added default hooks.
Signed-off-by: Jonas Bonn <jonas@southpole.se>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Michal Simek <monstr@monstr.eu>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Provide additional information on SIGTRAP by using a sig_info signal.
Use TRAP_BRKPT for breakpoints via illegal operation and TRAP_HWBKPT
for breakpoints via program event recording. Provide the address of
the instruction that caused the breakpoint via si_addr.
While we are at it get rid of tracehook_consider_fatal_signal.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The cpu measurement alerts that are used for instance by oprofile
for hardware sampling are not turned off on a cpu that is going
offline. Add the appropriate control register bit that should be
disabled to the list.
Signed-off-by: Jan Glauber <jang@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Do not set the cr0 enablement bit for iucv by default in head[31|64].S,
move the enablement to iucv_init in the iucv base layer.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The (un-)register_external_interrupt functions are not race safe if
more than one interrupt handler is added or deleted for an external
interrupt concurrently.
Make the registration / unregistration of external interrupts race safe
by using RCU and a spinlock. RCU is used to avoid a performance penalty
in the external interrupt handler, the register and unregister functions
are protected by the spinlock and are not performance critical.
call_rcu must be used since the SCLP driver uses the interface with
IRQs disabled. Also use the generic list implementation rather than
homebrewn list code.
Signed-off-by: Jan Glauber <jang@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Add code that allows KVM to control the virtual memory layout that
is seen by a guest. The guest address space uses a second page table
that shares the last level pte-tables with the process page table.
If a page is unmapped from the process page table it is automatically
unmapped from the guest page table as well.
The guest address space mapping starts out empty, KVM can map any
individual 1MB segments from the process virtual memory to any 1MB
aligned location in the guest virtual memory. If a target segment in
the process virtual memory does not exist or is unmapped while a
guest mapping exists the desired target address is stored as an
invalid segment table entry in the guest page table.
The population of the guest page table is fault driven.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The alignment is missing for various global symbols in s390 assembly code.
With a recent gcc and an instruction like stgrl this can lead to a
specification exception if the instruction uses such a mis-aligned address.
Specify the alignment explicitely and while add it define __ALIGN for s390
and use the ENTRY define to save some lines of code.
Signed-off-by: Jan Glauber <jang@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The entry to / exit from sie has subtle dependencies to the first level
interrupt handler. Move the sie assembler code to entry64.S and replace
the SIE_HOOK callback with a test and the new _TIF_SIE bit.
In addition this patch fixes several problems in regard to the check for
the_TIF_EXIT_SIE bits. The old code checked the TIF bits before executing
the interrupt handler and it only modified the instruction address if it
pointed directly to the sie instruction. In both cases it could miss
a TIF bit that normally would cause an exit from the guest and would
reenter the guest context.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: (39 commits)
ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever
ptrace: fix ptrace_signal() && STOP_DEQUEUED interaction
connector: add an event for monitoring process tracers
ptrace: dont send SIGSTOP on auto-attach if PT_SEIZED
ptrace: mv send-SIGSTOP from do_fork() to ptrace_init_task()
ptrace_init_task: initialize child->jobctl explicitly
has_stopped_jobs: s/task_is_stopped/SIGNAL_STOP_STOPPED/
ptrace: make former thread ID available via PTRACE_GETEVENTMSG after PTRACE_EVENT_EXEC stop
ptrace: wait_consider_task: s/same_thread_group/ptrace_reparented/
ptrace: kill real_parent_is_ptracer() in in favor of ptrace_reparented()
ptrace: ptrace_reparented() should check same_thread_group()
redefine thread_group_leader() as exit_signal >= 0
do not change dead_task->exit_signal
kill task_detached()
reparent_leader: check EXIT_DEAD instead of task_detached()
make do_notify_parent() __must_check, update the callers
__ptrace_detach: avoid task_detached(), check do_notify_parent()
kill tracehook_notify_death()
make do_notify_parent() return bool
ptrace: s/tracehook_tracer_task()/ptrace_parent()/
...
At this point, tracehooks aren't useful to mainline kernel and mostly
just add an extra layer of obfuscation. Although they have comments,
without actual in-kernel users, it is difficult to tell what are their
assumptions and they're actually trying to achieve. To mainline
kernel, they just aren't worth keeping around.
This patch kills the following trivial tracehooks.
* Ones testing whether task is ptraced. Replace with ->ptrace test.
tracehook_expect_breakpoints()
tracehook_consider_ignored_signal()
tracehook_consider_fatal_signal()
* ptrace_event() wrappers. Call directly.
tracehook_report_exec()
tracehook_report_exit()
tracehook_report_vfork_done()
* ptrace_release_task() wrapper. Call directly.
tracehook_finish_release_task()
* noop
tracehook_prepare_release_task()
tracehook_report_death()
This doesn't introduce any behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
The bit shift operation in smp_ctl_set_bit does not specify the type
of the shifted bit so integer is used as default. Therefore it is not
possible to set bits in the upper 32 bit of the control register if
the kernel runs in 64 bit mode. Fix this by specifying the type as
unsigned long.
Signed-off-by: Jan Glauber <jang@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
32bit and 64bit on x86 are tested and working. The rest I have looked
at closely and I can't find any problems.
setns is an easy system call to wire up. It just takes two ints so I
don't expect any weird architecture porting problems.
While doing this I have noticed that we have some architectures that are
very slow to get new system calls. cris seems to be the slowest where
the last system calls wired up were preadv and pwritev. avr32 is weird
in that recvmmsg was wired up but never declared in unistd.h. frv is
behind with perf_event_open being the last syscall wired up. On h8300
the last system call wired up was epoll_wait. On m32r the last system
call wired up was fallocate. mn10300 has recvmmsg as the last system
call wired up. The rest seem to at least have syncfs wired up which was
new in the 2.6.39.
v2: Most of the architecture support added by Daniel Lezcano <dlezcano@fr.ibm.com>
v3: ported to v2.6.36-rc4 by: Eric W. Biederman <ebiederm@xmission.com>
v4: Moved wiring up of the system call to another patch
v5: ported to v2.6.39-rc6
v6: rebased onto parisc-next and net-next to avoid syscall conflicts.
v7: ported to Linus's latest post 2.6.39 tree.
> arch/blackfin/include/asm/unistd.h | 3 ++-
> arch/blackfin/mach-common/entry.S | 1 +
Acked-by: Mike Frysinger <vapier@gentoo.org>
Oh - ia64 wiring looks good.
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge irq.c and s390_ext.c into irq.c. That way all external interrupt
related functions are together.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Interrupt sources like pfault, sclp, dasd_diag and virtio all use the
service signal external interrupt subclass mask in control register 0
to enable and disable the corresponding interrupt.
Because no reference counting is implemented each subsystem thinks it
is the only user of subclass and sets and clears the bit like it wants.
This leads to case that unloading the dasd diag module under z/VM
causes both sclp and pfault interrupts to be masked. The result will
be locked up system sooner or later.
Fix this by introducing a new way to set (register) and clear
(unregister) the service signal subclass mask bit in cr0.
Also convert all drivers.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* 'for-2.6.40' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: Unify input section names
percpu: Avoid extra NOP in percpu_cmpxchg16b_double
percpu: Cast away printk format warning
percpu: Always align percpu output section to PAGE_SIZE
Fix up fairly trivial conflict in arch/x86/include/asm/percpu.h as per Tejun
When disabling a cpu all external interrupt subclass masks in control
register 0 get cleared. However instead of the service signal subclass
mask bit an unused bit got cleared.
Accidently (or luckily) the service subclass mask gets cleared with the
pfault_fini() call that happens just before the rest of the subclass
mask bits get cleared.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Remove unsused includes from arch/s390/kernel/process.c.
Signed-off-by: Jan Glauber <jang@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
On cpu hot remove a PFAULT CANCEL command is sent to the hypervisor
which in turn will cancel all outstanding pfault requests that have
been issued on that cpu (the same happens with a SIGP cpu reset).
The result is that we end up with uninterruptible processes where
the interrupt that would wake up these processes never arrives.
In order to solve this all processes which wait for a pfault
completion interrupt get woken up after a cpu hot remove. The worst
case that could happen is that they fault again and in turn need to
wait again.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>