This fixes two minor bugs: error handling in vhost,
and capability processing in virtio.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJW1w/9AAoJECgfDbjSjVRpOwMH/10g1iyVAF7FR3RgAgcutROy
mt7RNRGhcJrccB3imWvA5llHfpEQe6b2IGRRjLpy3cHOqoytSJ8F9GbAv/Ya6rBY
ZNsapZkEacLX3Byi/Tll9C6pP7eCHswveBwreXVYpxerTuViorqU0RQXHQCY1nBa
jAHr7eHp7PlSjAKlwUX181/cDKF35VMchCE+NfSmgDFnkOMb8E3TVXqV1wuSZSaf
Ci2IgULrQQC2rzxFg/ZQePweSP9cBrhpX3c3VkVa/N0Io4DOQWLg2KUbwzrs/mel
i4rlKngwwRO0rIsWU+J5hMq4Vqg+Zv6mVQGm3FFAJWle03rUOZPYFCPwbKA6HNU=
=HeDc
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull minor virtio/vhost fixes from Michael Tsirkin:
"This fixes two minor bugs: error handling in vhost, and capability
processing in virtio"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost: fix error path in vhost_init_used()
virtio-pci: read the right virtio_pci_notify_cap field
* fix hang caused by fbconsole blink timer
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJW2CSsAAoJEPo9qoy8lh71hjcQAK071t3TPSzKc1+GmLy2yVnV
suaT3C+0uY1yha/V8kRUrwUA4p3wRSHDvr/jF1SSSTgc4J8mY/1mxXYIFmAxfCyb
t/6fKTvHqDFHtCTMCM5RXK+3cR/zTu74KqLmMRllkMa4da7Iyr5fBBnfdryo1yjg
Brj++PANYKXT1B7HR9vrUAL7sU+2y976/oGtAtqTSfRuo4rppXh3PFaNZHJFgdwI
T8wkVIOPJHt9WdSnmXkVzMYeeXRp4mHNev2+tD7OvcH/b21Y+b1aO/ebIbeXWk+T
Or3SWj/5Ckje368NMvIZKywuGcRA9QHMDNqDhBU+pIF8D8Z0TcSd9spV0v94/Z/l
1DK1NROcE0rjcJvCFjhLcgXX9upsVlEQKDR94SsbkgJAv/ToXP2umoUCu8VXlO7Y
tqvW8aU06cpi4/UXBq61riqVP2pOz6mYGWMFHw26tFpmAq3+NvFvW/Zp1UI0pMW/
k4hbWDetmxPJA0T1UJcRHLHeumkEsyHTUgMFbD9UcM6YGs6tbsHw2aWgbEAGYMIa
TMnM+vi+xRWThmUN558SBj7fZ17x/H7pJJllNvL6myn514BPBTmpmGCXQFxqOlc3
/6AQEdYOMhG5z3axqCT+Hbjg8SQMeY7gAROU/Ql7VETJZZsRtIdl2Q/bynsjO06S
8HOemQdC2DZbXRHCdb5t
=8zQX
-----END PGP SIGNATURE-----
Merge tag 'fbdev-fixes-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux
Pull fbdev fix from Tomi Valkeinen:
"Fix hang caused by fbconsole blink timer"
* tag 'fbdev-fixes-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux:
fbcon: set a default value to blink interval
Pull parisc fixes from Helge Deller:
"We wire up the copy_file_range syscall, fix two bugs in the parisc
ptrace code and have a trivial fix for floppy.h to clarify an
expression with parentheses"
* 'parisc-4.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Wire up copy_file_range syscall
parisc: Fix ptrace syscall number and return value modification
parisc: Use parentheses around expression in floppy.h
Pull cifs fixes from Steve French:
"Various small CIFS/SMB3 fixes for stable:
Fixes address oops that can occur when accessing Macs with SMB3, and
another problem found to Samba when read responses queued (e.g. with
gluster under Samba)"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: Fix duplicate line introduced by clone_file_range patch
Fix cifs_uniqueid_to_ino_t() function for s390x
CIFS: Fix SMB2+ interim response processing for read requests
cifs: fix out-of-bounds access in lease parsing
The exit path will do some final updates to the VM of an exiting process
to inform others of the fact that the process is going away.
That happens, for example, for robust futex state cleanup, but also if
the parent has asked for a TID update when the process exits (we clear
the child tid field in user space).
However, at the time we do those final VM accesses, we've already
stopped accepting signals, so the usual "stop waiting for userfaults on
signal" code in fs/userfaultfd.c no longer works, and the process can
become an unkillable zombie waiting for something that will never
happen.
To solve this, just make handle_userfault() abort any user fault
handling if we're already in the exit path past the signal handling
state being dead (marked by PF_EXITING).
This VM special case is pretty ugly, and it is possible that we should
look at finalizing signals later (or move the VM final accesses
earlier). But in the meantime this is a fairly minimally intrusive fix.
Reported-and-tested-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We don't want side effects. If something fails, we rollback vq->is_le to
its previous value.
Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Looks like a copy-paste bug. The value is used as an optimization and a
wrong value probably isn't causing any serious damage. Found when
porting this code to Windows.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Pull d_inode/d_flags race fix from Al Viro.
I love this fix. Not only does it fix the race in the dentry type
handling, it entirely gets rid of the nasty and subtle memory ordering
rules for d_type and d_inode, and replaces them with the basic dentry
locking rules (sequence numbers under RCU, d_lock elsewhere).
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
use ->d_seq to get coherency between ->d_inode and ->d_flags
Mike Frysinger reported that his ptrace testcase showed strange
behaviour on parisc: It was not possible to avoid a syscall and the
return value of a syscall couldn't be changed.
To modify a syscall number, we were missing to save the new syscall
number to gr20 which is then picked up later in assembly again.
The effect that the return value couldn't be changed is a side-effect of
another bug in the assembly code. When a process is ptraced, userspace
expects each syscall to report entrance and exit of a syscall. If a
syscall number was given which doesn't exist, we jumped to the normal
syscall exit code instead of informing userspace that the (non-existant)
syscall exits. This unexpected behaviour confuses userspace and thus the
bug was misinterpreted as if we can't change the return value.
This patch fixes both problems and was tested on 64bit kernel with
32bit userspace.
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: stable@vger.kernel.org # v4.0+
Tested-by: Mike Frysinger <vapier@gentoo.org>
David Binderman reported a style issue in the floppy.h header file:
arch/parisc/include/asm/floppy.h:221: (style) Boolean result is used in bitwise
operation. Clarify expression with parentheses.
Reported-by: David Binderman <dcb314@hotmail.com>
Cc: David Binderman <dcb314@hotmail.com>
Signed-off-by: Helge Deller <deller@gmx.de>
Pull sparc fixes from David Miller:
1) System call tracing doesn't handle register contents properly across
the trace. From Mike Frysinger.
2) Hook up copy_file_range
3) Build fix for 32-bit with newer tools.
4) New sun4v watchdog driver, from Wim Coekaerts.
5) Set context system call has to allow for servicable faults when we
flush the register windows to memory
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
sparc64: Fix sparc64_set_context stack handling.
sparc32: Add -Wa,-Av8 to KBUILD_CFLAGS.
Add sun4v_wdt watchdog driver
sparc: Fix system call tracing register handling.
sparc: Hook up copy_file_range syscall.
Commit 04b38d6012 ("vfs: pull btrfs clone API to vfs layer")
added a duplicated line (in cifsfs.c) which causes a sparse compile
warning.
Signed-off-by: Steve French <steve.french@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Like a signal return, we should use synchronize_user_stack() rather
than flush_user_windows().
Reported-by: Ilya Malakhov <ilmalakhovthefirst@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Binutils used to be (erroneously) extremely permissive about
instruction usage. But that got fixed and if you don't properly tell
it to accept classes of instructions it will fail.
This uncovered a specs bug on sparc in gcc where it wouldn't pass the
proper options to binutils options.
Deal with this in the kernel build by adding -Wa,-Av8 to KBUILD_CFLAGS.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Games with ordering and barriers are way too brittle. Just
bump ->d_seq before and after updating ->d_inode and ->d_flags
type bits, so that verifying ->d_seq would guarantee they are
coherent.
Cc: stable@vger.kernel.org # v3.13+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This issue is caused by commit 02323db17e ("cifs: fix
cifs_uniqueid_to_ino_t not to ever return 0"), when BITS_PER_LONG
is 64 on s390x, the corresponding cifs_uniqueid_to_ino_t()
function will cast 64-bit fileid to 32-bit by using (ino_t)fileid,
because ino_t (typdefed __kernel_ino_t) is int type.
It's defined in arch/s390/include/uapi/asm/posix_types.h
#ifndef __s390x__
typedef unsigned long __kernel_ino_t;
...
#else /* __s390x__ */
typedef unsigned int __kernel_ino_t;
So the #ifdef condition is wrong for s390x, we can just still use
one cifs_uniqueid_to_ino_t() function with comparing sizeof(ino_t)
and sizeof(u64) to choose the correct execution accordingly.
Signed-off-by: Yadan Fan <ydfan@suse.com>
CC: stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
For interim responses we only need to parse a header and update
a number credits. Now it is done for all SMB2+ command except
SMB2_READ which is wrong. Fix this by adding such processing.
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Tested-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Pull perf fixes from Thomas Gleixner:
"A rather largish series of 12 patches addressing a maze of race
conditions in the perf core code from Peter Zijlstra"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Robustify task_function_call()
perf: Fix scaling vs. perf_install_in_context()
perf: Fix scaling vs. perf_event_enable()
perf: Fix scaling vs. perf_event_enable_on_exec()
perf: Fix ctx time tracking by introducing EVENT_TIME
perf: Cure event->pending_disable race
perf: Fix race between event install and jump_labels
perf: Fix cloning
perf: Only update context time when active
perf: Allow perf_release() with !event->ctx
perf: Do not double free
perf: Close install vs. exit race
Pull x86 fixes from Thomas Gleixner:
"This update contains:
- Hopefully the last ASM CLAC fixups
- A fix for the Quark family related to the IMR lock which makes
kexec work again
- A off-by-one fix in the MPX code. Ironic, isn't it?
- A fix for X86_PAE which addresses once more an unsigned long vs
phys_addr_t hickup"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mpx: Fix off-by-one comparison with nr_registers
x86/mm: Fix slow_virt_to_phys() for X86_PAE again
x86/entry/compat: Add missing CLAC to entry_INT80_32
x86/entry/32: Add an ASM_CLAC to entry_SYSENTER_32
x86/platform/intel/quark: Change the kernel's IMR lock bit to false
Pull scheduler fixlet from Thomas Gleixner:
"A trivial printk typo fix"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Fix trivial typo in printk() message
Pull irq fixes from Thomas Gleixner:
"Four small fixes for irqchip drivers:
- Add missing low level irq handler initialization on mxs, so
interrupts can acutally be delivered
- Add a missing barrier to the GIC driver
- Two fixes for the GIC-V3-ITS driver, addressing a double EOI write
and a cache flush beyond the actual region"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v3: Add missing barrier to 32bit version of gic_read_iar()
irqchip/mxs: Add missing set_handle_irq()
irqchip/gicv3-its: Avoid cache flush beyond ITS_BASERn memory size
irqchip/gic-v3-its: Fix double ICC_EOIR write for LPI in EOImode==1
Here is one patch, for the android binder driver, to resolve a reported
problem. Turns out it has been around for a while (since 3.15), so it
is good to finally get it resolved.
It has been in linux-next for a while with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlbSgioACgkQMUfUDdst+ynUggCfdkzcX15M2vIiZsA3BLDcPY6L
pPcAnj1uCrqUUNRF3XgglbGXtp15sOe0
=R3bx
-----END PGP SIGNATURE-----
Merge tag 'staging-4.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging/android fix from Greg KH:
"Here is one patch, for the android binder driver, to resolve a
reported problem. Turns out it has been around for a while (since
3.15), so it is good to finally get it resolved.
It has been in linux-next for a while with no reported issues"
* tag 'staging-4.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
drivers: android: correct the size of struct binder_uintptr_t for BC_DEAD_BINDER_DONE
Here are a few USB fixes for 4.5-rc6
They fix a reported bug for some USB 3 devices by reverting the recent
patch, a MAINTAINERS change for some drivers, some new device ids, and
of course, the usual bunch of USB gadget driver fixes.
All have been in linux-next for a while with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlbSgVoACgkQMUfUDdst+ymghQCgiVUpZOu/hYeC/8CDdTZPLlpQ
oR4AoLftFf9OAKRgBCbiPOY99lG9f33y
=qZKR
-----END PGP SIGNATURE-----
Merge tag 'usb-4.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are a few USB fixes for 4.5-rc6
They fix a reported bug for some USB 3 devices by reverting the recent
patch, a MAINTAINERS change for some drivers, some new device ids, and
of course, the usual bunch of USB gadget driver fixes.
All have been in linux-next for a while with no reported issues"
* tag 'usb-4.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
MAINTAINERS: drop OMAP USB and MUSB maintainership
usb: musb: fix DMA for host mode
usb: phy: msm: Trigger USB state detection work in DRD mode
usb: gadget: net2280: fix endpoint max packet for super speed connections
usb: gadget: gadgetfs: unregister gadget only if it got successfully registered
usb: gadget: remove driver from pending list on probe error
Revert "usb: hub: do not clear BOS field during reset device"
usb: chipidea: fix return value check in ci_hdrc_pci_probe()
usb: chipidea: error on overflow for port_test_write
USB: option: add "4G LTE usb-modem U901"
USB: cp210x: add IDs for GE B650V3 and B850V3 boards
USB: option: add support for SIM7100E
usb: musb: Fix DMA desired mode for Mentor DMA engine
usb: gadget: fsl_qe_udc: fix IS_ERR_VALUE usage
usb: dwc2: USB_DWC2 should depend on HAS_DMA
usb: dwc2: host: fix the data toggle error in full speed descriptor dma
usb: dwc2: host: fix logical omissions in dwc2_process_non_isoc_desc
usb: dwc3: Fix assignment of EP transfer resources
usb: dwc2: Add extra delay when forcing dr_mode
Calling return copy_to_user(...) in an ioctl will not
do the right thing if there's a pagefault:
copy_to_user returns the number of bytes not copied
in this case.
Fix up vfio to do
return copy_to_user(...)) ?
-EFAULT : 0;
everywhere.
Cc: stable@vger.kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Pull vfs fixes from Al Viro.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
do_last(): ELOOP failure exit should be done after leaving RCU mode
should_follow_link(): validate ->d_seq after having decided to follow
namei: ->d_inode of a pinned dentry is stable only for positives
do_last(): don't let a bogus return value from ->open() et.al. to confuse us
fs: return -EOPNOTSUPP if clone is not supported
hpfs: don't truncate the file when delete fails
We didn't have a batch last week, so this one is slightly larger.
None of them are scary though, a handful of fixes for small DT pieces,
replacing properties with newer conventions.
Highlights:
- N900 fix for setting system revision
- onenand init fix to avoid filesystem corruption
- Clock fix for audio on Beaglebone-x15
- Fixes on shmobile to deal with CONFIG_DEBUG_RODATA (default y in 4.6)
+ misc smaller stuff.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJW0jMpAAoJEIwa5zzehBx3nGgP/3wlhTrIyFWTu2Oa3s+0dwFJ
nXNcHc/7egzRlcPZ/dWfyrQfVC4/Zko7tI+76vJ8vSZ5oZ+la6CC1ZymlVpxUo9y
mF8wyFnRU5sc5yeSSNH91RzJg2fSJWvcUJ/5zeUBkjKLc1AEAfyMXEjxDHptDI/L
s+/JRqhrF8xsnfBymSW2mW6u34Sxn76dVsofWNrSCge/+kVAM4km/PDneWKz/14Q
oLY9eFl6b0O5DJ/+5OSME0pnnRnJC/eD5+HYQSBIu3+RKgP5CH+xQDNeqf0GIdlI
7Y0cKbjFxT5fXfvE4KOKQuLKgAzCSRe1PwuJ8MTDE73kWsUAWN8McWkCYtCSufxU
KSPlgjfO1xWoSkVneK3NzcRWJoi6Ev0lZ0s6HuMvZJAoce9XrcIbZRQ7CP3Iu3Oj
iC8GxIgHyIJV95XABpliH5IVTRERTbXIOgR82dKQPxLU6cbCRbFs/GU2v7JQEjOS
exJDM5R08SSBC8MRxvWp09pwcfO44XIkQu4pdRJfpaFVwJYejTYOUDVYCcCg3s9O
ApXzQj6/A0QMnp1SAvPHbc3LqLq5mTzvt1j59TNA8Q0O4U4r20CBF+D7lb9KMlu/
GyJ2wSsxCwnBDVWDPtXGdE3z/K81H7nPRBzuL0dM80cF5gQNglOdAN47UoD/bBP6
1pR5h9K92LbV5NiToyPY
=xeuW
-----END PGP SIGNATURE-----
Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Olof Johansson:
"We didn't have a batch last week, so this one is slightly larger.
None of them are scary though, a handful of fixes for small DT pieces,
replacing properties with newer conventions.
Highlights:
- N900 fix for setting system revision
- onenand init fix to avoid filesystem corruption
- Clock fix for audio on Beaglebone-x15
- Fixes on shmobile to deal with CONFIG_DEBUG_RODATA (default y in 4.6)
+ misc smaller stuff"
* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
MAINTAINERS: Extend info, add wiki and ml for meson arch
MAINTAINERS: alpine: add a new maintainer and update the entry
ARM: at91/dt: fix typo in sama5d2 pinmux descriptions
ARM: OMAP2+: Fix onenand initialization to avoid filesystem corruption
Revert "regulator: tps65217: remove tps65217.dtsi file"
ARM: shmobile: Remove shmobile_boot_arg
ARM: shmobile: Move shmobile_smp_{mpidr, fn, arg}[] from .text to .bss
ARM: shmobile: r8a7779: Remove remainings of removed SCU boot setup code
ARM: shmobile: Move shmobile_scu_base from .text to .bss
ARM: OMAP2+: Fix omap_device for module reload on PM runtime forbid
ARM: OMAP2+: Improve omap_device error for driver writers
ARM: DTS: am57xx-beagle-x15: Select SYS_CLK2 for audio clocks
ARM: dts: am335x/am57xx: replace gpio-key,wakeup with wakeup-source property
ARM: OMAP2+: Set system_rev from ATAGS for n900
ARM: dts: orion5x: fix the missing mtd flash on linkstation lswtgl
ARM: dts: kirkwood: use unique machine name for ds112
ARM: dts: imx6: remove bogus interrupt-parent from CAAM node
... otherwise d_is_symlink() above might have nothing to do with
the inode value we've got.
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
both do_last() and walk_component() risk picking a NULL inode out
of dentry about to become positive, *then* checking its flags and
seeing that it's not negative anymore and using (already stale by
then) value they'd fetched earlier. Usually ends up oopsing soon
after that...
Cc: stable@vger.kernel.org # v3.13+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... into returning a positive to path_openat(), which would interpret that
as "symlink had been encountered" and proceed to corrupt memory, etc.
It can only happen due to a bug in some ->open() instance or in some LSM
hook, etc., so we report any such event *and* make sure it doesn't trick
us into further unpleasantness.
Cc: stable@vger.kernel.org # v3.6+, at least
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-EBADF is a rather confusing error if an operations is not supported,
and nfsd gets rather upset about it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The delete opration can allocate additional space on the HPFS filesystem
due to btree split. The HPFS driver checks in advance if there is
available space, so that it won't corrupt the btree if we run out of space
during splitting.
If there is not enough available space, the HPFS driver attempted to
truncate the file, but this results in a deadlock since the commit
7dd29d8d86 ("HPFS: Introduce a global mutex
and lock it on every callback from VFS").
This patch removes the code that tries to truncate the file and -ENOSPC is
returned instead. If the user hits -ENOSPC on delete, he should try to
delete other files (that are stored in a leaf btree node), so that the
delete operation will make some space for deleting the file stored in
non-leaf btree node.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: stable@vger.kernel.org # 2.6.39+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Merge fixes from Andrew Morton:
"10 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
dax: move writeback calls into the filesystems
dax: give DAX clearing code correct bdev
ext4: online defrag not supported with DAX
ext2, ext4: only set S_DAX for regular inodes
block: disable block device DAX by default
ocfs2: unlock inode if deleting inode from orphan fails
mm: ASLR: use get_random_long()
drivers: char: random: add get_random_long()
mm: numa: quickly fail allocations for NUMA balancing on full nodes
mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEED
As it is currently written ext4_dax_mkwrite() assumes that the call into
__dax_mkwrite() will not have to do a block allocation so it doesn't create
a journal entry. For a read that creates a zero page to cover a hole
followed by a write that actually allocates storage this is incorrect. The
ext4_dax_mkwrite() -> __dax_mkwrite() -> __dax_fault() path calls
get_blocks() to allocate storage.
Fix this by having the ->page_mkwrite fault handler call ext4_dax_fault()
as this function already has all the logic needed to allocate a journal
entry and call __dax_fault().
Also update the ext2 fault handlers in this same way to remove duplicate
code and keep the logic between ext2 and ext4 the same.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
cycle.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABCAAGBQJW0OaVAAoJENidgRMleOc9+ioP/1SrApFcS6GcxUQDZGxV9wDo
R58NKT7EqYeLzcw3jasNwqbmgiBtpTwOO2LjqN0UrZtsh5pO7Vz9sZt2UC5vH+8I
Yd+BpqQ5pTs+Y7JgFsxNHX77azb15nqGzE2JEZerzferYTIOPWso8nNtgY6uSdFd
f9ScmoZZy8yw9kWKPKZolqEeMKVT3Q4Q1994hDxhzNwNt1LEtoYlvEQzwZggiEt9
568hTQ4G1nk8YdwkXb0eQgYTeOmC4VWK6Evh4QCVtIv+t81drVcV81AMKEDMEHyF
2kr+Sk3XoWfCf0WhsOMp12ASlKWeleLo1HjKS+Y920m3b16BqYC9mhIRL3zurX0c
O1ADytvFwDQfyo1JQB2BPPnoCvVP5/C6veJC2TWKmA/YTtDjA29MA4ORhN9kUmts
8TmRqz/QXUjKK+r0E1BBSzNJapIaX/S/bm2dytjWAGaeAi4x6OOn+F6oG2ot+HEW
euvMSIN5wxTg+UejvkwuUXpSY45LZp4VsKja8jicXwX6lxfd+yL3Iukw9fZU/X00
QRWR5FdZomZbavsuGKcb5FC4p+nyo8FZX5NC+RTjnMJ9tDYfyewrm8zGQ7O74zEG
Hxim1HSLUIf6ziJJ1S7GUzb40eEB+9byUmZrRqd1KOp+fmr+J3xKtYQ+IFi8r29M
AWWvm9Iq2baHvPZTr6Xq
=1q7R
-----END PGP SIGNATURE-----
Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fix from Stephen Boyd:
"One small fix to keep OMAP platforms working across a suspend/resume
cycle"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: ti: omap3+: dpll: use non-locking version of clk_get_rate
Previously calls to dax_writeback_mapping_range() for all DAX filesystems
(ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
dax_writeback_mapping_range() needs a struct block_device, and it used
to get that from inode->i_sb->s_bdev. This is correct for normal inodes
mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
block devices and for XFS real-time files.
Instead, call dax_writeback_mapping_range() directly from the filesystem
->writepages function so that it can supply us with a valid block
device. This also fixes DAX code to properly flush caches in response
to sync(2).
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dax_clear_blocks() needs a valid struct block_device and previously it
was using inode->i_sb->s_bdev in all cases. This is correct for normal
inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for
DAX raw block devices and for XFS real-time devices.
Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change
its arguments to take a bdev and a sector instead of an inode and a
block. This better reflects what the function does, and it allows the
filesystem and raw block device code to pass in an appropriate struct
block_device.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Online defrag operations for ext4 are hard coded to use the page cache.
See ext4_ioctl() -> ext4_move_extents() -> move_extent_per_page()
When combined with DAX I/O, which circumvents the page cache, this can
result in data corruption. This was observed with xfstests ext4/307 and
ext4/308.
Fix this by only allowing online defrag for non-DAX files.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When S_DAX is set on an inode we assume that if there are pages attached
to the mapping (mapping->nrpages != 0), those pages are clean zero pages
that were used to service reads from holes. Any dirty data associated
with the inode should be in the form of DAX exceptional entries
(mapping->nrexceptional) that is written back via
dax_writeback_mapping_range().
With the current code, though, this isn't always true. For example,
ext2 and ext4 directory inodes can have S_DAX set, but have their dirty
data stored as dirty page cache entries. For these types of inodes,
having S_DAX set doesn't really make sense since their I/O doesn't
actually happen through the DAX code path.
Instead, only allow S_DAX to be set for regular inodes for ext2 and
ext4. This allows us to have strict DAX vs non-DAX paths in the
writeback code.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The recent *sync enabling discovered that we are inserting into the
block_device pagecache counter to the expectations of the dirty data
tracking for dax mappings. This can lead to data corruption.
We want to support DAX for block devices eventually, but it requires
wider changes to properly manage the pagecache.
dump_stack+0x85/0xc2
dax_writeback_mapping_range+0x60/0xe0
blkdev_writepages+0x3f/0x50
do_writepages+0x21/0x30
__filemap_fdatawrite_range+0xc6/0x100
filemap_write_and_wait+0x4a/0xa0
set_blocksize+0x70/0xd0
sb_set_blocksize+0x1d/0x50
ext4_fill_super+0x75b/0x3360
mount_bdev+0x180/0x1b0
ext4_mount+0x15/0x20
mount_fs+0x38/0x170
Mark the support broken so its disabled by default, but otherwise still
available for testing.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When doing append direct io cleanup, if deleting inode fails, it goes
out without unlocking inode, which will cause the inode deadlock.
This issue was introduced by commit cf1776a9e8 ("ocfs2: fix a tiny
race when truncate dio orohaned entry").
Signed-off-by: Guozhonghua <guozhonghua@h3c.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org> [4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace calls to get_random_int() followed by a cast to (unsigned long)
with calls to get_random_long(). Also address shifting bug which, in
case of x86 removed entropy mask for mmap_rnd_bits values > 31 bits.
Signed-off-by: Daniel Cashman <dcashman@android.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: David S. Miller <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nick Kralevich <nnk@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Mark Salyzyn <salyzyn@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit d07e22597d ("mm: mmap: add new /proc tunable for mmap_base
ASLR") added the ability to choose from a range of values to use for
entropy count in generating the random offset to the mmap_base address.
The maximum value on this range was set to 32 bits for 64-bit x86
systems, but this value could be increased further, requiring more than
the 32 bits of randomness provided by get_random_int(), as is already
possible for arm64. Add a new function: get_random_long() which more
naturally fits with the mmap usage of get_random_int() but operates
exactly the same as get_random_int().
Also, fix the shifting constant in mmap_rnd() to be an unsigned long so
that values greater than 31 bits generate an appropriate mask without
overflow. This is especially important on x86, as its shift instruction
uses a 5-bit mask for the shift operand, which meant that any value for
mmap_rnd_bits over 31 acts as a no-op and effectively disables mmap_base
randomization.
Finally, replace calls to get_random_int() with get_random_long() where
appropriate.
This patch (of 2):
Add get_random_long().
Signed-off-by: Daniel Cashman <dcashman@android.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: David S. Miller <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nick Kralevich <nnk@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Mark Salyzyn <salyzyn@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 4167e9b2cf ("mm: remove GFP_THISNODE") removed the GFP_THISNODE
flag combination due to confusing semantics. It noted that
alloc_misplaced_dst_page() was one such user after changes made by
commit e97ca8e5b8 ("mm: fix GFP_THISNODE callers and clarify").
Unfortunately when GFP_THISNODE was removed, users of
alloc_misplaced_dst_page() started waking kswapd and entering direct
reclaim because the wrong GFP flags are cleared. The consequence is
that workloads that used to fit into memory now get reclaimed which is
addressed by this patch.
The problem can be demonstrated with "mutilate" that exercises memcached
which is software dedicated to memory object caching. The configuration
uses 80% of memory and is run 3 times for varying numbers of clients.
The results on a 4-socket NUMA box are
mutilate
4.4.0 4.4.0
vanilla numaswap-v1
Hmean 1 8394.71 ( 0.00%) 8395.32 ( 0.01%)
Hmean 4 30024.62 ( 0.00%) 34513.54 ( 14.95%)
Hmean 7 32821.08 ( 0.00%) 70542.96 (114.93%)
Hmean 12 55229.67 ( 0.00%) 93866.34 ( 69.96%)
Hmean 21 39438.96 ( 0.00%) 85749.21 (117.42%)
Hmean 30 37796.10 ( 0.00%) 50231.49 ( 32.90%)
Hmean 47 18070.91 ( 0.00%) 38530.13 (113.22%)
The metric is queries/second with the more the better. The results are
way outside of the noise and the reason for the improvement is obvious
from some of the vmstats
4.4.0 4.4.0
vanillanumaswap-v1r1
Minor Faults 1929399272 2146148218
Major Faults 19746529 3567
Swap Ins 57307366 9913
Swap Outs 50623229 17094
Allocation stalls 35909 443
DMA allocs 0 0
DMA32 allocs 72976349 170567396
Normal allocs 5306640898 5310651252
Movable allocs 0 0
Direct pages scanned 404130893 799577
Kswapd pages scanned 160230174 0
Kswapd pages reclaimed 55928786 0
Direct pages reclaimed 1843936 41921
Page writes file 2391 0
Page writes anon 50623229 17094
The vanilla kernel is swapping like crazy with large amounts of direct
reclaim and kswapd activity. The figures are aggregate but it's known
that the bad activity is throughout the entire test.
Note that simple streaming anon/file memory consumers also see this
problem but it's not as obvious. In those cases, kswapd is awake when
it should not be.
As there are at least two reclaim-related bugs out there, it's worth
spelling out the user-visible impact. This patch only addresses bugs
related to excessive reclaim on NUMA hardware when the working set is
larger than a NUMA node. There is a bug related to high kswapd CPU
usage but the reports are against laptops and other UMA hardware and is
not addressed by this patch.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org> [4.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>