We still need to call btrfs_end_transaction if we call btrfs_abort_transaction,
otherwise we hang and make me super grumpy. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While we are finishing a device replace operation we can have a concurrent
task trying to do a read repair operation, in which case it will call
btrfs_map_block() to get a struct btrfs_bio which can have a stripe that
points to the source device of the device replace operation. This allows
for the read repair task to dereference the stripe's device pointer after
the device replace operation has freed the source device, resulting in
an invalid memory access. This is similar to the problem solved by my
previous patch in the same series and named "Btrfs: fix race between
device replace and discard".
So fix this by surrounding the call to btrfs_map_block() and the code
that uses the returned struct btrfs_bio with calls to
btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), giving the
proper serialization with the finishing phase of the device replace
operation.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
While we are finishing a device replace operation, we can make a discard
operation (fs mounted with -o discard) do an invalid memory access like
the one reported by the following trace:
[ 3206.384654] general protection fault: 0000 [#1] PREEMPT SMP
[ 3206.387520] Modules linked in: dm_mod btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis psmouse tpm ppdev sg parport_pc evdev i2c_piix4 parport
processor serio_raw i2c_core pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci
virtio_ring scsi_mod e1000 virtio floppy [last unloaded: btrfs]
[ 3206.388595] CPU: 14 PID: 29194 Comm: fsstress Not tainted 4.6.0-rc7-btrfs-next-29+ #1
[ 3206.388595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 3206.388595] task: ffff88017ace0100 ti: ffff880171b98000 task.ti: ffff880171b98000
[ 3206.388595] RIP: 0010:[<ffffffff8124d233>] [<ffffffff8124d233>] blkdev_issue_discard+0x5c/0x2a7
[ 3206.388595] RSP: 0018:ffff880171b9bb80 EFLAGS: 00010246
[ 3206.388595] RAX: ffff880171b9bc28 RBX: 000000000090d000 RCX: 0000000000000000
[ 3206.388595] RDX: ffffffff82fa1b48 RSI: ffffffff8179f46c RDI: ffffffff82fa1b48
[ 3206.388595] RBP: ffff880171b9bcc0 R08: 0000000000000000 R09: 0000000000000001
[ 3206.388595] R10: ffff880171b9bce0 R11: 000000000090f000 R12: ffff880171b9bbe8
[ 3206.388595] R13: 0000000000000010 R14: 0000000000004868 R15: 6b6b6b6b6b6b6b6b
[ 3206.388595] FS: 00007f6182e4e700(0000) GS:ffff88023fdc0000(0000) knlGS:0000000000000000
[ 3206.388595] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3206.388595] CR2: 00007f617c2bbb18 CR3: 000000017ad9c000 CR4: 00000000000006e0
[ 3206.388595] Stack:
[ 3206.388595] 0000000000004878 0000000000000000 0000000002400040 0000000000000000
[ 3206.388595] 0000000000000000 ffff880171b9bbe8 ffff880171b9bbb0 ffff880171b9bbb0
[ 3206.388595] ffff880171b9bbc0 ffff880171b9bbc0 ffff880171b9bbd0 ffff880171b9bbd0
[ 3206.388595] Call Trace:
[ 3206.388595] [<ffffffffa042899e>] btrfs_issue_discard+0x12f/0x143 [btrfs]
[ 3206.388595] [<ffffffffa042899e>] ? btrfs_issue_discard+0x12f/0x143 [btrfs]
[ 3206.388595] [<ffffffffa042e862>] btrfs_discard_extent+0x87/0xde [btrfs]
[ 3206.388595] [<ffffffffa04303b5>] btrfs_finish_extent_commit+0xb2/0x1df [btrfs]
[ 3206.388595] [<ffffffff8149c246>] ? __mutex_unlock_slowpath+0x150/0x15b
[ 3206.388595] [<ffffffffa04464c4>] btrfs_commit_transaction+0x7fc/0x980 [btrfs]
[ 3206.388595] [<ffffffff8149c246>] ? __mutex_unlock_slowpath+0x150/0x15b
[ 3206.388595] [<ffffffffa0459af6>] btrfs_sync_file+0x38f/0x428 [btrfs]
[ 3206.388595] [<ffffffff811a8292>] vfs_fsync_range+0x8c/0x9e
[ 3206.388595] [<ffffffff811a82c0>] vfs_fsync+0x1c/0x1e
[ 3206.388595] [<ffffffff811a8417>] do_fsync+0x31/0x4a
[ 3206.388595] [<ffffffff811a8637>] SyS_fsync+0x10/0x14
[ 3206.388595] [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
[ 3206.388595] [<ffffffff81100c6b>] ? time_hardirqs_off+0x9/0x14
[ 3206.388595] [<ffffffff8108e87d>] ? trace_hardirqs_off_caller+0x1f/0xaa
This happens because when we call btrfs_map_block() from
btrfs_discard_extent() to get a btrfs_bio structure, the device replace
operation has not finished yet, but before we use the device of one of the
stripes from the returned btrfs_bio structure, the device object is freed.
This is illustrated by the following diagram.
CPU 1 CPU 2
btrfs_dev_replace_start()
(...)
btrfs_dev_replace_finishing()
btrfs_start_transaction()
btrfs_commit_transaction()
(...)
btrfs_sync_file()
btrfs_start_transaction()
(...)
btrfs_commit_transaction()
btrfs_finish_extent_commit()
btrfs_discard_extent()
btrfs_map_block()
--> returns a struct btrfs_bio
with a stripe that has a
device field pointing to
source device of the replace
operation (the device that
is being replaced)
mutex_lock(&uuid_mutex)
mutex_lock(&fs_info->fs_devices->device_list_mutex)
mutex_lock(&fs_info->chunk_mutex)
btrfs_dev_replace_update_device_in_mapping_tree()
--> iterates the mapping tree and for each
extent map that has a stripe pointing to
the source device, it updates the stripe
to point to the target device instead
btrfs_rm_dev_replace_blocked()
--> waits for fs_info->bio_counter to go down to 0
btrfs_rm_dev_replace_remove_srcdev()
--> removes source device from the list of devices
mutex_unlock(&fs_info->chunk_mutex)
mutex_unlock(&fs_info->fs_devices->device_list_mutex)
mutex_unlock(&uuid_mutex)
btrfs_rm_dev_replace_free_srcdev()
--> frees the source device
--> iterates over all stripes
of the returned struct
btrfs_bio
--> for each stripe it
dereferences its device
pointer
--> it ends up finding a
pointer to the device
used as the source
device for the replace
operation and that was
already freed
So fix this by surrounding the call to btrfs_map_block(), and the code
that uses the returned struct btrfs_bio, with calls to
btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), so that
the finishing phase of the device replace operation blocks until the
the bio counter decreases to zero before it frees the source device.
This is the same approach we do at btrfs_map_bio() for example.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
For the benefit of every single caller, take osdc instead of map.
Also, now that osdc->osdmap can't ever be NULL, drop the check.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
While iterating and copying extents from the source device, the device
replace code keeps adjusting a left cursor that is used to make sure that
once we finish processing a device extent, any future writes to extents
from the corresponding block group will get into both the source and
target devices. This left cursor is also used for resuming the device
replace operation at mount time.
However using this left cursor to decide whether writes go into both
devices or only the source device is not enough to guarantee we don't
miss copying extents into the target device. There are two cases where
the current approach fails. The first one is related to when there are
holes in the device and they get allocated for new block groups while
the device replace operation is iterating the device extents (more on
this explained below). The second one is that when that loop over the
device extents finishes, we start dellaloc, wait for all ordered extents
and then commit the current transaction, we might have got new block
groups allocated that are now using a device extent that has an offset
greater then or equals to the value of the left cursor, in which case
writes to extents belonging to these new block groups will get issued
only to the source device.
For the first case where the current approach of using a left cursor
fails, consider the source device currently has the following layout:
[ extent bg A ] [ hole, unallocated space ] [extent bg B ]
3Gb 4Gb 5Gb
While we are iterating the device extents from the source device using
the commit root of the device tree, the following happens:
CPU 1 CPU 2
<we are at transaction N>
scrub_enumerate_chunks()
--> searches the device tree for
extents belonging to the source
device using the device tree's
commit root
--> 1st iteration finds extent belonging to
block group A
--> sets block group A to RO mode
(btrfs_inc_block_group_ro)
--> sets cursor left to found_key.offset
which is 3Gb
--> scrub_chunk() starts
copies all allocated extents from
block group's A stripe at source
device into target device
btrfs_alloc_chunk()
--> allocates device extent
in the range [4Gb, 5Gb[
from the source device for
a new block group C
extent allocated from block
group C for a direct IO,
buffered write or btree node/leaf
extent is written to, perhaps
in response to a writepages()
call from the VM or directly
through direct IO
the write is made only against
the source device and not against
the target device because the
extent's offset is in the interval
[4Gb, 5Gb[ which is larger then
the value of cursor_left (3Gb)
--> scrub_chunks() finishes
--> updates left cursor from 3Gb to
4Gb
--> btrfs_dec_block_group_ro() sets
block group A back to RW mode
<we are still at transaction N>
--> 2nd iteration finds extent belonging to
block group B - it did not find the new
extent in the range [4Gb, 5Gb[ for block
group C because we are using the device
tree's commit root or even because the
block group's items are not all yet
inserted in the respective btrees, that is,
the block group is still attached to some
transaction handle's new_bgs list and
btrfs_create_pending_block_groups() was
not called yet against that transaction
handle, so the device extent items were
not yet inserted into the devices tree
<we are still at transaction N>
--> so we end not copying anything from the newly
allocated device extent from the source device
to the target device
So fix this by making __btrfs_map_block() always redirect writes to the
target device as well, independently of the left cursor's value. With
this change the left cursor is now used only for the purpose of tracking
progress and allow a mount operation to resume a device replace.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
After it finishes processing a device extent, the device replace code sets
back the block group to RW mode and then after that it sets the left cursor
to match the logical end address of the block group, so that future writes
into extents belonging to the block group go both the source (old) and
target (new) devices. However from the moment we turn the block group
back to RW mode we have a short time window, that lasts until we update
the left cursor's value, where extents can be allocated from the block
group and written to, in which case they will not be copied/written to
the target (new) device. Fix this by updating the left cursor's value
before turning the block group back to RW mode.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
We were assigning new values to fields of the device replace object
without holding the respective lock after processing each device extent.
This is important for the left cursor field which can be accessed by a
concurrent task running __btrfs_map_block (which, correctly, takes the
device replace lock).
So change these fields while holding the device replace lock.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
When we do a device replace, for each device extent we find from the
source device, we set the corresponding block group to readonly mode to
prevent writes into it from happening while we are copying the device
extent from the source to the target device. However just before we set
the block group to readonly mode some concurrent task might have already
allocated an extent from it or decided it could perform a nocow write
into one of its extents, which can make the device replace process to
miss copying an extent since it uses the extent tree's commit root to
search for extents and only once it finishes searching for all extents
belonging to the block group it does set the left cursor to the logical
end address of the block group - this is a problem if the respective
ordered extents finish while we are searching for extents using the
extent tree's commit root and no transaction commit happens while we
are iterating the tree, since it's the delayed references created by the
ordered extents (when they complete) that insert the extent items into
the extent tree (using the non-commit root of course).
Example:
CPU 1 CPU 2
btrfs_dev_replace_start()
btrfs_scrub_dev()
scrub_enumerate_chunks()
--> finds device extent belonging
to block group X
<transaction N starts>
starts buffered write
against some inode
writepages is run against
that inode forcing dellaloc
to run
btrfs_writepages()
extent_writepages()
extent_write_cache_pages()
__extent_writepage()
writepage_delalloc()
run_delalloc_range()
cow_file_range()
btrfs_reserve_extent()
--> allocates an extent
from block group X
(which is not yet
in RO mode)
btrfs_add_ordered_extent()
--> creates ordered extent Y
flush_epd_write_bio()
--> bio against the extent from
block group X is submitted
btrfs_inc_block_group_ro(bg X)
--> sets block group X to readonly
scrub_chunk(bg X)
scrub_stripe(device extent from srcdev)
--> keeps searching for extent items
belonging to the block group using
the extent tree's commit root
--> it never blocks due to
fs_info->scrub_pause_req as no
one tries to commit transaction N
--> copies all extents found from the
source device into the target device
--> finishes search loop
bio completes
ordered extent Y completes
and creates delayed data
reference which will add an
extent item to the extent
tree when run (typically
at transaction commit time)
--> so the task doing the
scrub/device replace
at CPU 1 misses this
and does not copy this
extent into the new/target
device
btrfs_dec_block_group_ro(bg X)
--> turns block group X back to RW mode
dev_replace->cursor_left is set to the
logical end offset of block group X
So fix this by waiting for all cow and nocow writes after setting a block
group to readonly mode.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
When it's finishing, the device replace code iterates all extent maps
representing block group and for each one that has a stripe that refers
to the source device, it replaces its device with the target device.
However when it replaces the source device with the target device it,
the target device still has an ID of 0ULL (BTRFS_DEV_REPLACE_DEVID),
only after its ID is changed to match the one from the source device.
This leads to races with the chunk removal code that can temporarly see
a device with an ID of 0ULL and then attempt to use that ID to remove
items from the device tree and fail, causing a transaction abort:
[ 9238.594364] BTRFS info (device sdf): dev_replace from /dev/sdf (devid 3) to /dev/sde finished
[ 9238.594377] ------------[ cut here ]------------
[ 9238.594402] WARNING: CPU: 14 PID: 21566 at fs/btrfs/volumes.c:2771 btrfs_remove_chunk+0x2e5/0x793 [btrfs]
[ 9238.594403] BTRFS: Transaction aborted (error 1)
[ 9238.594416] Modules linked in: btrfs crc32c_generic acpi_cpufreq xor tpm_tis tpm raid6_pq ppdev parport_pc processor psmouse parport i2c_piix4 evdev sg i2c_core se
rio_raw pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod fl
oppy [last unloaded: btrfs]
[ 9238.594418] CPU: 14 PID: 21566 Comm: btrfs-cleaner Not tainted 4.6.0-rc7-btrfs-next-29+ #1
[ 9238.594419] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
[ 9238.594421] 0000000000000000 ffff88017f1dbc60 ffffffff8126b42c ffff88017f1dbcb0
[ 9238.594422] 0000000000000000 ffff88017f1dbca0 ffffffff81052b14 00000ad37f1dbd18
[ 9238.594423] 0000000000000001 ffff88018068a558 ffff88005c4b9c00 ffff880233f60db0
[ 9238.594424] Call Trace:
[ 9238.594428] [<ffffffff8126b42c>] dump_stack+0x67/0x90
[ 9238.594430] [<ffffffff81052b14>] __warn+0xc2/0xdd
[ 9238.594432] [<ffffffff81052b7a>] warn_slowpath_fmt+0x4b/0x53
[ 9238.594434] [<ffffffff8116c311>] ? kmem_cache_free+0x128/0x188
[ 9238.594450] [<ffffffffa04d43f5>] btrfs_remove_chunk+0x2e5/0x793 [btrfs]
[ 9238.594452] [<ffffffff8108e456>] ? arch_local_irq_save+0x9/0xc
[ 9238.594464] [<ffffffffa04a26fa>] btrfs_delete_unused_bgs+0x317/0x382 [btrfs]
[ 9238.594476] [<ffffffffa04a961d>] cleaner_kthread+0x1ad/0x1c7 [btrfs]
[ 9238.594489] [<ffffffffa04a9470>] ? btree_invalidatepage+0x8e/0x8e [btrfs]
[ 9238.594490] [<ffffffff8106f403>] kthread+0xd4/0xdc
[ 9238.594494] [<ffffffff8149e242>] ret_from_fork+0x22/0x40
[ 9238.594495] [<ffffffff8106f32f>] ? kthread_stop+0x286/0x286
[ 9238.594496] ---[ end trace 183efbe50275f059 ]---
The sequence of steps leading to this is like the following:
CPU 1 CPU 2
btrfs_dev_replace_finishing()
at this point
dev_replace->tgtdev->devid ==
BTRFS_DEV_REPLACE_DEVID (0ULL)
...
btrfs_start_transaction()
btrfs_commit_transaction()
btrfs_delete_unused_bgs()
btrfs_remove_chunk()
looks up for the extent map
corresponding to the chunk
lock_chunks() (chunk_mutex)
check_system_chunk()
unlock_chunks() (chunk_mutex)
locks fs_info->chunk_mutex
btrfs_dev_replace_update_device_in_mapping_tree()
--> iterates fs_info->mapping_tree and
replaces the device in every extent
map's map->stripes[] with
dev_replace->tgtdev, which still has
an id of 0ULL (BTRFS_DEV_REPLACE_DEVID)
iterates over all stripes from
the extent map
--> calls btrfs_free_dev_extent()
passing it the target device
that still has an ID of 0ULL
--> btrfs_free_dev_extent() fails
--> aborts current transaction
finishes setting up the target device,
namely it sets tgtdev->devid to the value
of srcdev->devid (which is necessarily > 0)
frees the srcdev
unlocks fs_info->chunk_mutex
So fix this by taking the device list mutex while processing the stripes
for the chunk's extent map. This is similar to the race between device
replace and block group creation that was fixed by commit 50460e3718
("Btrfs: fix race when finishing dev replace leading to transaction abort").
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
The self-test was updated to cover zero-length strings; the function
needs to be updated, too.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Fixes: fcfd2fbf22 ("fs/namei.c: Add hashlen_string() function")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The original name was simply hash_string(), but that conflicted with a
function with that name in drivers/base/power/trace.c, and I decided
that calling it "hashlen_" was better anyway.
But you have to do it in two places.
[ This caused build errors for architectures that don't define
CONFIG_DCACHE_WORD_ACCESS - Linus ]
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Fixes: fcfd2fbf22 ("fs/namei.c: Add hashlen_string() function")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The HPFS filesystem used generic_show_options to produce string that is
displayed in /proc/mounts. However, there is a problem that the options
may disappear after remount. If we mount the filesystem with option1
and then remount it with option2, /proc/mounts should show both option1
and option2, however it only shows option2 because the whole option
string is replaced with replace_mount_options in hpfs_remount_fs.
To fix this bug, implement the hpfs_show_options function that prints
options that are currently selected.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit c8f33d0bec ("affs: kstrdup() memory handling") checks if the
kstrdup function returns NULL due to out-of-memory condition.
However, if we are remounting a filesystem with no change to
filesystem-specific options, the parameter data is NULL. In this case,
kstrdup returns NULL (because it was passed NULL parameter), although no
out of memory condition exists. The mount syscall then fails with
ENOMEM.
This patch fixes the bug. We fail with ENOMEM only if data is non-NULL.
The patch also changes the call to replace_mount_options - if we didn't
pass any filesystem-specific options, we don't call
replace_mount_options (thus we don't erase existing reported options).
Fixes: c8f33d0bec ("affs: kstrdup() memory handling")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # v4.1+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit ce657611ba ("hpfs: kstrdup() out of memory handling") checks if
the kstrdup function returns NULL due to out-of-memory condition.
However, if we are remounting a filesystem with no change to
filesystem-specific options, the parameter data is NULL. In this case,
kstrdup returns NULL (because it was passed NULL parameter), although no
out of memory condition exists. The mount syscall then fails with
ENOMEM.
This patch fixes the bug. We fail with ENOMEM only if data is non-NULL.
The patch also changes the call to replace_mount_options - if we didn't
pass any filesystem-specific options, we don't call
replace_mount_options (thus we don't erase existing reported options).
Fixes: ce657611ba ("hpfs: kstrdup() out of memory handling")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Various builds (such as i386:allmodconfig) fail with
fs/binfmt_aout.c:133:2: error: expected identifier or '(' before 'return'
fs/binfmt_aout.c:134:1: error: expected identifier or '(' before '}' token
[ Oops. My bad, I had stupidly thought that "allmodconfig" covered this
on x86-64 too, but it obviously doesn't. Egg on my face. - Linus ]
Fixes: 5d22fc25d4 ("mm: remove more IS_ERR_VALUE abuses")
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull string hash improvements from George Spelvin:
"This series does several related things:
- Makes the dcache hash (fs/namei.c) useful for general kernel use.
(Thanks to Bruce for noticing the zero-length corner case)
- Converts the string hashes in <linux/sunrpc/svcauth.h> to use the
above.
- Avoids 64-bit multiplies in hash_64() on 32-bit platforms. Two
32-bit multiplies will do well enough.
- Rids the world of the bad hash multipliers in hash_32.
This finishes the job started in commit 689de1d6ca ("Minimal
fix-up of bad hashing behavior of hash_64()")
The vast majority of Linux architectures have hardware support for
32x32-bit multiply and so derive no benefit from "simplified"
multipliers.
The few processors that do not (68000, h8/300 and some models of
Microblaze) have arch-specific implementations added. Those
patches are last in the series.
- Overhauls the dcache hash mixing.
The patch in commit 0fed3ac866 ("namei: Improve hash mixing if
CONFIG_DCACHE_WORD_ACCESS") was an off-the-cuff suggestion.
Replaced with a much more careful design that's simultaneously
faster and better. (My own invention, as there was noting suitable
in the literature I could find. Comments welcome!)
- Modify the hash_name() loop to skip the initial HASH_MIX(). This
would let us salt the hash if we ever wanted to.
- Sort out partial_name_hash().
The hash function is declared as using a long state, even though
it's truncated to 32 bits at the end and the extra internal state
contributes nothing to the result. And some callers do odd things:
- fs/hfs/string.c only allocates 32 bits of state
- fs/hfsplus/unicode.c uses it to hash 16-bit unicode symbols not bytes
- Modify bytemask_from_count to handle inputs of 1..sizeof(long)
rather than 0..sizeof(long)-1. This would simplify users other
than full_name_hash"
Special thanks to Bruce Fields for testing and finding bugs in v1. (I
learned some humbling lessons about "obviously correct" code.)
On the arch-specific front, the m68k assembly has been tested in a
standalone test harness, I've been in contact with the Microblaze
maintainers who mostly don't care, as the hardware multiplier is never
omitted in real-world applications, and I haven't heard anything from
the H8/300 world"
* 'hash' of git://ftp.sciencehorizons.net/linux:
h8300: Add <asm/hash.h>
microblaze: Add <asm/hash.h>
m68k: Add <asm/hash.h>
<linux/hash.h>: Add support for architecture-specific functions
fs/namei.c: Improve dcache hash function
Eliminate bad hash multipliers from hash_32() and hash_64()
Change hash_64() return value to 32 bits
<linux/sunrpc/svcauth.h>: Define hash_str() in terms of hashlen_string()
fs/namei.c: Add hashlen_string() function
Pull out string hash to <linux/stringhash.h>
This is just the infrastructure; there are no users yet.
This is modelled on CONFIG_ARCH_RANDOM; a CONFIG_ symbol declares
the existence of <asm/hash.h>.
That file may define its own versions of various functions, and define
HAVE_* symbols (no CONFIG_ prefix!) to suppress the generic ones.
Included is a self-test (in lib/test_hash.c) that verifies the basics.
It is NOT in general required that the arch-specific functions compute
the same thing as the generic, but if a HAVE_* symbol is defined with
the value 1, then equality is tested.
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Cc: Philippe De Muyter <phdm@macq.eu>
Cc: linux-m68k@lists.linux-m68k.org
Cc: Alistair Francis <alistai@xilinx.com>
Cc: Michal Simek <michal.simek@xilinx.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: uclinux-h8-devel@lists.sourceforge.jp
Patch 0fed3ac866 improved the hash mixing, but the function is slower
than necessary; there's a 7-instruction dependency chain (10 on x86)
each loop iteration.
Word-at-a-time access is a very tight loop (which is good, because
link_path_walk() is one of the hottest code paths in the entire kernel),
and the hash mixing function must not have a longer latency to avoid
slowing it down.
There do not appear to be any published fast hash functions that:
1) Operate on the input a word at a time, and
2) Don't need to know the length of the input beforehand, and
3) Have a single iterated mixing function, not needing conditional
branches or unrolling to distinguish different loop iterations.
One of the algorithms which comes closest is Yann Collet's xxHash, but
that's two dependent multiplies per word, which is too much.
The key insights in this design are:
1) Barring expensive ops like multiplies, to diffuse one input bit
across 64 bits of hash state takes at least log2(64) = 6 sequentially
dependent instructions. That is more cycles than we'd like.
2) An operation like "hash ^= hash << 13" requires a second temporary
register anyway, and on a 2-operand machine like x86, it's three
instructions.
3) A better use of a second register is to hold a two-word hash state.
With careful design, no temporaries are needed at all, so it doesn't
increase register pressure. And this gets rid of register copying
on 2-operand machines, so the code is smaller and faster.
4) Using two words of state weakens the requirement for one-round mixing;
we now have two rounds of mixing before cancellation is possible.
5) A two-word hash state also allows operations on both halves to be
done in parallel, so on a superscalar processor we get more mixing
in fewer cycles.
I ended up using a mixing function inspired by the ChaCha and Speck
round functions. It is 6 simple instructions and 3 cycles per iteration
(assuming multiply by 9 can be done by an "lea" instruction):
x ^= *input++;
y ^= x; x = ROL(x, K1);
x += y; y = ROL(y, K2);
y *= 9;
Not only is this reversible, two consecutive rounds are reversible:
if you are given the initial and final states, but not the intermediate
state, it is possible to compute both input words. This means that at
least 3 words of input are required to create a collision.
(It also has the property, used by hash_name() to avoid a branch, that
it hashes all-zero to all-zero.)
The rotate constants K1 and K2 were found by experiment. The search took
a sample of random initial states (I used 1023) and considered the effect
of flipping each of the 64 input bits on each of the 128 output bits two
rounds later. Each of the 8192 pairs can be considered a biased coin, and
adding up the Shannon entropy of all of them produces a score.
The best-scoring shifts also did well in other tests (flipping bits in y,
trying 3 or 4 rounds of mixing, flipping all 64*63/2 pairs of input bits),
so the choice was made with the additional constraint that the sum of the
shifts is odd and not too close to the word size.
The final state is then folded into a 32-bit hash value by a less carefully
optimized multiply-based scheme. This also has to be fast, as pathname
components tend to be short (the most common case is one iteration!), but
there's some room for latency, as there is a fair bit of intervening logic
before the hash value is used for anything.
(Performance verified with "bonnie++ -s 0 -n 1536:-2" on tmpfs. I need
a better benchmark; the numbers seem to show a slight dip in performance
between 4.6.0 and this patch, but they're too noisy to quote.)
Special thanks to Bruce fields for diligent testing which uncovered a
nasty fencepost error in an earlier version of this patch.
[checkpatch.pl formatting complaints noted and respectfully disagreed with.]
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
Tested-by: J. Bruce Fields <bfields@redhat.com>
We'd like to make more use of the highly-optimized dcache hash functions
throughout the kernel, rather than have every subsystem create its own,
and a function that hashes basic null-terminated strings is required
for that.
(The name is to emphasize that it returns both hash and length.)
It's actually useful in the dcache itself, specifically d_alloc_name().
Other uses in the next patch.
full_name_hash() is also tweaked to make it more generally useful:
1) Take a "char *" rather than "unsigned char *" argument, to
be consistent with hash_name().
2) Handle zero-length inputs. If we want more callers, we don't want
to make them worry about corner cases.
Signed-off-by: George Spelvin <linux@sciencehorizons.net>
improvements of UBI and UBIFS.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJXSJl6AAoJEEtJtSqsAOnWbbAP/2ls5KpGsRSoOP4NsKo5+8k9
GsZ8hb0iN+UGDxMB5Jlj6O1ISNPz/8o7iGBuWa5OWFdh4Nn38sA1Qv996Rg3Ca3O
8KYEGAD7POWANfxCTmyQX/L8AsOP62B3diFktPyGrtYmLsVzx/AFN9/nM27ticdF
PQpUtLwvL/m7/oE6ymFCB4x34+DkTa7Jx4e+chVlF61CUipGc5I60VVX1Do+/DWt
S37ajjIxMJx0BO7Zs6vrcH9uGFSbvSpVBr9/sC16t0Fa1/vAa9pXieg3y5XZmtOl
+6NlwUxXVu5IjHkYYGZP8jsrA7bKJlflgCo/w6WKC6V0KECnBx+oSG9uMjDVPBSM
rc4VF4nevnNLNqFGBRWhxiYrRUCdB1TcWVDOzM16peNeUNJiWj/+4PbsiJQBwECH
ZnE/dBrjU7v+J8UmHsUJSveW6/5PluJf1loDzrip7gk5QD13wdpPo3VOycOj1vMD
iS6H/TBjtcEzWWh7qcBbtgkbHQH0g+w9YSzVy7ODFweuoq+4qCZ1davqzzZaRv3q
hylor1gea3b4GnH1EDjmXyQOOuIf2rLdJhou8WoZhmy2q3FrnE+89/CCOCMkeQMQ
qQr1lAbxD7A+8bYwSZ7Nvv6lrBUrat59DV6FHFc69MdkGT7J6z85ZOyd3lxRvJc/
XiZZkKhrYSKgV/iLFhHW
=L16y
-----END PGP SIGNATURE-----
Merge tag 'upstream-4.7-rc1' of git://git.infradead.org/linux-ubifs
Pull UBI/UBIFS updates from Richard Weinberger:
"This contains mostly cleanups and minor improvements of UBI and UBIFS"
* tag 'upstream-4.7-rc1' of git://git.infradead.org/linux-ubifs:
ubifs: ubifs_dump_inode: Fix dumping field bulk_read
UBI: Fix static volume checks when Fastmap is used
UBI: Set free_count to zero before walking through erase list
UBI: Silence an unintialized variable warning
UBI: Clean up return in ubi_remove_volume()
UBI: Modify wrong comment in ubi_leb_map function.
UBI: Don't read back all data in ubi_eba_copy_leb()
UBI: Add ro-mode sysfs attribute
Older versions of gcc don't understand named initializers inside a
anonymous structure or union member. It can be worked around by adding
the bracin gin the initializer for the anonymous member.
Without this, gcc 4.4.4 will fail the build with
CC fs/nfs/nfs4state.o
fs/nfs/nfs4state.c:69: error: unknown field ‘data’ specified in initializer
fs/nfs/nfs4state.c:69: warning: missing braces around initializer
fs/nfs/nfs4state.c:69: warning: (near initialization for ‘zero_stateid.<anonymous>.data’)
make[2]: *** [fs/nfs/nfs4state.o] Error 1
introduced in commit 93b717fd81 ("NFSv4: Label stateids with the type")
Reported-and-tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Anna Schumaker <Anna.Schumaker@netapp.com>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfs fixes from Al Viro:
"Followups to the parallel lookup work:
- update docs
- restore killability of the places that used to take ->i_mutex
killably now that we have down_write_killable() merged
- Additionally, it turns out that I missed a prerequisite for
security_d_instantiate() stuff - ->getxattr() wasn't the only thing
that could be called before dentry is attached to inode; with smack
we needed the same treatment applied to ->setxattr() as well"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
switch ->setxattr() to passing dentry and inode separately
switch xattr_handler->set() to passing dentry and inode separately
restore killability of old mutex_lock_killable(&inode->i_mutex) users
add down_write_killable_nested()
update D/f/directory-locking
smack ->d_instantiate() uses ->setxattr(), so to be able to call it before
we'd hashed the new dentry and attached it to inode, we need ->setxattr()
instances getting the inode as an explicit argument rather than obtaining
it from dentry.
Similar change for ->getxattr() had been done in commit ce23e64. Unlike
->getxattr() (which is used by both selinux and smack instances of
->d_instantiate()) ->setxattr() is used only by smack one and unfortunately
it got missed back then.
Reported-by: Seung-Woo Kim <sw0312.kim@samsung.com>
Tested-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull overlayfs update from Miklos Szeredi:
"The meat of this is a change to use the mounter's credentials for
operations that require elevated privileges (such as whiteout
creation). This fixes behavior under user namespaces as well as being
a nice cleanup"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: Do d_type check only if work dir creation was successful
ovl: update documentation
ovl: override creds with the ones from the superblock mounter
Pull btrfs cleanups and fixes from Chris Mason:
"We have another round of fixes and a few cleanups.
I have a fix for short returns from btrfs_copy_from_user, which
finally nails down a very hard to find regression we added in v4.6.
Dave is pushing around gfp parameters, mostly to cleanup internal apis
and make it a little more consistent.
The rest are smaller fixes, and one speelling fixup patch"
* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (22 commits)
Btrfs: fix handling of faults from btrfs_copy_from_user
btrfs: fix string and comment grammatical issues and typos
btrfs: scrub: Set bbio to NULL before calling btrfs_map_block
Btrfs: fix unexpected return value of fiemap
Btrfs: free sys_array eb as soon as possible
btrfs: sink gfp parameter to convert_extent_bit
btrfs: make state preallocation more speculative in __set_extent_bit
btrfs: untangle gotos a bit in convert_extent_bit
btrfs: untangle gotos a bit in __clear_extent_bit
btrfs: untangle gotos a bit in __set_extent_bit
btrfs: sink gfp parameter to set_record_extent_bits
btrfs: sink gfp parameter to set_extent_new
btrfs: sink gfp parameter to set_extent_defrag
btrfs: sink gfp parameter to set_extent_delalloc
btrfs: sink gfp parameter to clear_extent_dirty
btrfs: sink gfp parameter to clear_record_extent_bits
btrfs: sink gfp parameter to clear_extent_bits
btrfs: sink gfp parameter to set_extent_bits
btrfs: make find_workspace warn if there are no workspaces
btrfs: make find_workspace always succeed
...
The do_brk() and vm_brk() return value was "unsigned long" and returned
the starting address on success, and an error value on failure. The
reasons are entirely historical, and go back to it basically behaving
like the mmap() interface does.
However, nobody actually wanted that interface, and it causes totally
pointless IS_ERR_VALUE() confusion.
What every single caller actually wants is just the simpler integer
return of zero for success and negative error number on failure.
So just convert to that much clearer and more common calling convention,
and get rid of all the IS_ERR_VALUE() uses wrt vm_brk().
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Most users of IS_ERR_VALUE() in the kernel are wrong, as they
pass an 'int' into a function that takes an 'unsigned long'
argument. This happens to work because the type is sign-extended
on 64-bit architectures before it gets converted into an
unsigned type.
However, anything that passes an 'unsigned short' or 'unsigned int'
argument into IS_ERR_VALUE() is guaranteed to be broken, as are
8-bit integers and types that are wider than 'unsigned long'.
Andrzej Hajda has already fixed a lot of the worst abusers that
were causing actual bugs, but it would be nice to prevent any
users that are not passing 'unsigned long' arguments.
This patch changes all users of IS_ERR_VALUE() that I could find
on 32-bit ARM randconfig builds and x86 allmodconfig. For the
moment, this doesn't change the definition of IS_ERR_VALUE()
because there are probably still architecture specific users
elsewhere.
Almost all the warnings I got are for files that are better off
using 'if (err)' or 'if (err < 0)'.
The only legitimate user I could find that we get a warning for
is the (32-bit only) freescale fman driver, so I did not remove
the IS_ERR_VALUE() there but changed the type to 'unsigned long'.
For 9pfs, I just worked around one user whose calling conventions
are so obscure that I did not dare change the behavior.
I was using this definition for testing:
#define IS_ERR_VALUE(x) ((unsigned long*)NULL == (typeof (x)*)NULL && \
unlikely((unsigned long long)(x) >= (unsigned long long)(typeof(x))-MAX_ERRNO))
which ends up making all 16-bit or wider types work correctly with
the most plausible interpretation of what IS_ERR_VALUE() was supposed
to return according to its users, but also causes a compile-time
warning for any users that do not pass an 'unsigned long' argument.
I suggested this approach earlier this year, but back then we ended
up deciding to just fix the users that are obviously broken. After
the initial warning that caused me to get involved in the discussion
(fs/gfs2/dir.c) showed up again in the mainline kernel, Linus
asked me to send the whole thing again.
[ Updated the 9p parts as per Al Viro - Linus ]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Andrzej Hajda <a.hajda@samsung.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lkml.org/lkml/2016/1/7/363
Link: https://lkml.org/lkml/2016/5/27/486
Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> # For nvmem part
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently direct writes inside i_size on a DIO_SKIP_HOLES filesystem are
not allowed to allocate blocks(get_more_blocks() sets 'create' to 0
before calling get_block() callback), if it's a sparse file, direct
writes fall back to buffered writes to avoid stale data exposure from
concurrent buffered read. But there're two cases that can result in
stale data exposure are not correctly detected.
1. The detection for "writing inside i_size" is not sufficient,
writes can be treated as "extending writes" wrongly. For example,
direct write 1FSB (file system block) to a 1FSB sparse file on
ext2/3/4, starting from offset 0, in this case it's writing inside
i_size, but 'create' is non-zero, because 'block_in_file' and
'(i_size_read(inode) >> blkbits' are both zero.
2. Direct writes starting from or beyong i_size (not inside i_size)
also could trigger block allocation and expose stale data. For
example, consider a sparse file with i_size of 2k, and a write to
offset 2k or 3k into the file, with a filesystem block size of 4k.
(Thanks to Jeff Moyer for pointing this case out in his review.)
The first problem can be demostrated by running ltp-aiodio test ADSP045
many times. When testing on extN filesystems, I see test failures
occasionally, buffered read could read non-zero (stale) data.
ADSP045: dio_sparse -a 4k -w 4k -s 2k -n 1
dio_sparse 0 TINFO : Dirtying free blocks
dio_sparse 0 TINFO : Starting I/O tests
non zero buffer at buf[0] => 0xffffffaa,ffffffaa,ffffffaa,ffffffaa
non-zero read at offset 0
dio_sparse 0 TINFO : Killing childrens(s)
dio_sparse 1 TFAIL : dio_sparse.c:191: 1 children(s) exited abnormally
The second problem can also be reproduced easily by a hacked dio_sparse
program, which accepts an option to specify the write offset.
What we should really do is to disable block allocation for writes that
could result in filling holes inside i_size.
Link: http://lkml.kernel.org/r/1463156728-13357-1-git-send-email-guaneryu@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Two new messages are added to support negotiating hb timeout. Stop
nodes frmo talking an old version to mount as they will cause the
negotiation to fail.
Link: http://lkml.kernel.org/r/1464231615-27939-1-git-send-email-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hr_last_timeout_start should be set as the last time where hb is
still OK. When hb write timeout, hung time will be (jiffies -
hr_last_timeout_start).
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Gang He <ghe@suse.com>
Cc: rwxybh <rwxybh@126.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sometimes io error is returned when storage is down for a while. Like
for iscsi device, stroage is made offline when session timeout, and this
will make all io return -EIO. For this case, nodes shouldn't do
negotiate timeout but should fence self. So let nodes fence self when
o2hb_do_disk_heartbeat return an error, this is the same behavior with
o2hb without negotiate timer.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Gang He <ghe@suse.com>
Cc: rwxybh <rwxybh@126.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Gang He <ghe@suse.com>
Cc: rwxybh <rwxybh@126.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This message is used to re-queue write timeout timer and negotiate timer
when all nodes suffer a write hung to storage, this makes node not fence
self if storage down.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Gang He <ghe@suse.com>
Cc: rwxybh <rwxybh@126.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This message is sent to master node when non-master nodes's negotiate
timer expired. Master node records these nodes in a bitmap which is
used to do write timeout timer re-queue decision.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Gang He <ghe@suse.com>
Cc: rwxybh <rwxybh@126.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This series of patches is to fix the issue that when storage down, all
nodes will fence self due to write timeout.
With this patch set, all nodes will keep going until storage back
online, except if the following issue happens, then all nodes will do as
before to fence self.
1. io error got
2. network between nodes down
3. nodes panic
This patch (of 6):
When storage down, all nodes will fence self due to write timeout. The
negotiate timer is designed to avoid this, with it node will wait until
storage up again.
Negotiate timer working in the following way:
1. The timer expires before write timeout timer, its timeout is half
of write timeout now. It is re-queued along with write timeout timer.
If expires, it will send NEGO_TIMEOUT message to master node(node with
lowest node number). This message does nothing but marks a bit in a
bitmap recording which nodes are negotiating timeout on master node.
2. If storage down, nodes will send this message to master node, then
when master node finds its bitmap including all online nodes, it sends
NEGO_APPROVL message to all nodes one by one, this message will
re-queue write timeout timer and negotiate timer. For any node doesn't
receive this message or meets some issue when handling this message, it
will be fenced. If storage up at any time, o2hb_thread will run and
re-queue all the timer, nothing will be affected by these two steps.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Gang He <ghe@suse.com>
Cc: rwxybh <rwxybh@126.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
d_type check requires successful creation of workdir as iterates
through work dir and expects work dir to be present in it. If that's
not the case, this check will always return d_type not supported even
if underlying filesystem might be supporting it.
So don't do this check if work dir creation failed in previous step.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
In user namespace the whiteout creation fails with -EPERM because the
current process isn't capable(CAP_SYS_ADMIN) when setting xattr.
A simple reproducer:
$ mkdir upper lower work merged lower/dir
$ sudo mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged
$ unshare -m -p -f -U -r bash
Now as root in the user namespace:
\# touch merged/dir/{1,2,3} # this will force a copy up of lower/dir
\# rm -fR merged/*
This ends up failing with -EPERM after the files in dir has been
correctly deleted:
unlinkat(4, "2", 0) = 0
unlinkat(4, "1", 0) = 0
unlinkat(4, "3", 0) = 0
close(4) = 0
unlinkat(AT_FDCWD, "merged/dir", AT_REMOVEDIR) = -1 EPERM (Operation not
permitted)
Interestingly, if you don't place files in merged/dir you can remove it,
meaning if upper/dir does not exist, creating the char device file works
properly in that same location.
This patch uses ovl_sb_creator_cred() to get the cred struct from the
superblock mounter and override the old cred with these new ones so that
the whiteout creation is possible because overlay is wrong in assuming that
the creds it will get with prepare_creds will be in the initial user
namespace. The old cap_raise game is removed in favor of just overriding
the old cred struct.
This patch also drops from ovl_copy_up_one() the following two lines:
override_cred->fsuid = stat->uid;
override_cred->fsgid = stat->gid;
This is because the correct uid and gid are taken directly with the stat
struct and correctly set with ovl_set_attr().
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
- We use a bit in an exceptional radix tree entry as a lock bit and use it
similarly to how page lock is used for normal faults. This fixes races
between hole instantiation and read faults of the same index.
- Filesystem DAX PMD faults are disabled, and will be re-enabled when PMD
locking is implemented.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXRKwLAAoJEJ/BjXdf9fLB+BkP/3HBm05KlAKDklvnBIPFDMUK
hA7g2K6vuvaEDZXZQ1ioc1Ajf1sCpVip7shXJsojZqwWmRz0/4nneF7ytluW9AjS
dBX+0qCgKGH1fnwyGFF+MN7fuj7kGrSDz34lG0OObRN6/oKiVNb2svXiYKkT6J6C
AgsWlWRUpMy9jrn1u/FduMjDhk92Z3ojarexuicr0i8NUlBClCIrdCEmUMi4orSB
DuiIjestLOc7+mERBUwrXkzoh9v8Z0FpIgnDLWwpeEkAvJwWkGe5eXrBJwF+hEbi
RYfTrOYc7bBQLo22LRb8pdighjrx3OW9EpNCfEmLDOjM3cYBbMK/d2i/ww52H6IK
Mw6iS5rXdGgJtQIGL8N96HLFk+cDyZ8J8xNUCwbYYBJqgpMzxzVkL3vTm72tyFnl
InWhih+miCMbBPytQSRd6+1wZG2piJTv6SsFTd5K1OaiRmJhBJZG47t2QTBRBu7Y
5A4FGPtlraV+iDJvD6VLO1Tp8twxdLluOJ2BwdGeiKXiGh6LP+FGGFF3aFa5N4Ro
xSslCTX7Q1G66zXQwD4+IMWLwS1FDNymPkUSsF6RQo6qfAnl9SrmYTc4xJ4QXy92
sUdrWEz2OBTfxKNqbGyc/KrXKZT3RnEkJNft8snB2h6WTCdOPaNYs/yETUwiwkSc
CXpuQFrxm69QYwNsqVu1
=Pkd0
-----END PGP SIGNATURE-----
Merge tag 'dax-locking-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull DAX locking updates from Ross Zwisler:
"Filesystem DAX locking for 4.7
- We use a bit in an exceptional radix tree entry as a lock bit and
use it similarly to how page lock is used for normal faults. This
fixes races between hole instantiation and read faults of the same
index.
- Filesystem DAX PMD faults are disabled, and will be re-enabled when
PMD locking is implemented"
* tag 'dax-locking-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: Remove i_mmap_lock protection
dax: Use radix tree entry lock to protect cow faults
dax: New fault locking
dax: Allow DAX code to replace exceptional entries
dax: Define DAX lock bit for radix tree exceptional entry
dax: Make huge page handling depend of CONFIG_BROKEN
dax: Fix condition for filling of PMD holes
- Until now, dax has been disabled if media errors were found on
any device. This enables the use of DAX in the presence of these
errors by making all sector-aligned zeroing go through the driver.
- The driver (already) has the ability to clear errors on writes that
are sent through the block layer using 'DSMs' defined in ACPI 6.1.
Other misc changes:
- When mounting DAX filesystems, check to make sure the partition
is page aligned. This is a requirement for DAX, and previously, we
allowed such unaligned mounts to succeed, but subsequent reads/writes
would fail.
- Misc/cleanup fixes from Jan that remove unused code from DAX related to
zeroing, writeback, and some size checks.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXQ4GKAAoJEHr6Yb6juE3/zowP/iclIhgXXXMQJRUHJlePMXC8
15sGZ32JS1ak9g7vrsmNVEDNynfNtiMYdBxtUyRuj6xqgwdZvFk3F55KOCPtaeA1
+yADkgeRkTAcwzmHw9WQVEzBCqyzSisdrwtEfH817qdq9FJdH66x2Kos6i+HeAVr
5Q/e4gs7lKrjf384/QBl+wxNZOndJaQAPd2VRHQqx2A9F33v0ljdwRaUG1r4fjK2
dtmhcZCqdQyuAGXW3piTnZc5ZFc3DPqO4FkEfqkEK3lFOflK0fd8wMsAZRp/Jd0j
GJsgnVSWSqG0Dz476djlG0w8t2p5Jv1g9cKChV+ZZEdFLKWHCOUFqXNj8uI8I4k5
cOEKCHyJ3IwfSHhNQqktEWrQN4T8ZXhWtuc9GuV4UZYuqJqHci6EdR/YsWsJjV+L
lm/qvK4ipDS1pivxOy8KX/iN0z7Io8J9GXpStDx3g8iWjLlh4YYlbJLWeeRepo/z
aPlV/QAKcHiGY6jzLExrZIyCWkzwo6O+0p1Kxerv9/7K/32HWbOodZ+tC8eD+N25
pV69nCGf+u50T2TtIx1+iann4NC1r7zg5yqnT9AgpyZpiwR5joCDzI5sXW+D0rcS
vPtfM84Ccdeq/e6mvfIpZgR0/npQapKnrmUest0J7P2BFPHiFPji1KzZ7M+1aFOo
9R6JdrAj0Sc+FBa+cGzH
=v6Of
-----END PGP SIGNATURE-----
Merge tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull misc DAX updates from Vishal Verma:
"DAX error handling for 4.7
- Until now, dax has been disabled if media errors were found on any
device. This enables the use of DAX in the presence of these
errors by making all sector-aligned zeroing go through the driver.
- The driver (already) has the ability to clear errors on writes that
are sent through the block layer using 'DSMs' defined in ACPI 6.1.
Other misc changes:
- When mounting DAX filesystems, check to make sure the partition is
page aligned. This is a requirement for DAX, and previously, we
allowed such unaligned mounts to succeed, but subsequent
reads/writes would fail.
- Misc/cleanup fixes from Jan that remove unused code from DAX
related to zeroing, writeback, and some size checks"
* tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: fix a comment in dax_zero_page_range and dax_truncate_page
dax: for truncate/hole-punch, do zeroing through the driver if possible
dax: export a low-level __dax_zero_page_range helper
dax: use sb_issue_zerout instead of calling dax_clear_sectors
dax: enable dax in the presence of known media errors (badblocks)
dax: fallback from pmd to pte on error
block: Update blkdev_dax_capable() for consistency
xfs: Add alignment check for DAX mount
ext2: Add alignment check for DAX mount
ext4: Add alignment check for DAX mount
block: Add bdev_dax_supported() for dax mount checks
block: Add vfs_msg() interface
dax: Remove redundant inode size checks
dax: Remove pointless writeback from dax_do_io()
dax: Remove zeroing from dax_io()
dax: Remove dead zeroing code from fault handlers
ext2: Avoid DAX zeroing to corrupt data
ext2: Fix block zeroing in ext2_get_blocks() for DAX
dax: Remove complete_unwritten argument
DAX: move RADIX_DAX_ definitions to dax.c
Previously, if a bad inode was found in ocfs2_iget(), -ESTALE was
returned back to the caller anyway. Since commit d2b9d71a2da7 ("ocfs2:
check/fix inode block for online file check") can handle with return
value from ocfs2_read_locked_inode() now, we know the exact errno
returned for us.
Link: http://lkml.kernel.org/r/1463970656-18413-1-git-send-email-zren@suse.com
Signed-off-by: Eric Ren <zren@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull Ceph updates from Sage Weil:
"This changeset has a few main parts:
- Ilya has finished a huge refactoring effort to sync up the
client-side logic in libceph with the user-space client code, which
has evolved significantly over the last couple years, with lots of
additional behaviors (e.g., how requests are handled when cluster
is full and transitions from full to non-full).
This structure of the code is more closely aligned with userspace
now such that it will be much easier to maintain going forward when
behavior changes take place. There are some locking improvements
bundled in as well.
- Zheng adds multi-filesystem support (multiple namespaces within the
same Ceph cluster)
- Zheng has changed the readdir offsets and directory enumeration so
that dentry offsets are hash-based and therefore stable across
directory fragmentation events on the MDS.
- Zheng has a smorgasbord of bug fixes across fs/ceph"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (71 commits)
ceph: fix wake_up_session_cb()
ceph: don't use truncate_pagecache() to invalidate read cache
ceph: SetPageError() for writeback pages if writepages fails
ceph: handle interrupted ceph_writepage()
ceph: make ceph_update_writeable_page() uninterruptible
libceph: make ceph_osdc_wait_request() uninterruptible
ceph: handle -EAGAIN returned by ceph_update_writeable_page()
ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM
ceph: block non-fatal signals for fault/page_mkwrite
ceph: make logical calculation functions return bool
ceph: tolerate bad i_size for symlink inode
ceph: improve fragtree change detection
ceph: keep leaf frag when updating fragtree
ceph: fix dir_auth check in ceph_fill_dirfrag()
ceph: don't assume frag tree splits in mds reply are sorted
ceph: fix inode reference leak
ceph: using hash value to compose dentry offset
ceph: don't forbid marking directory complete after forward seek
ceph: record 'offset' for each entry of readdir result
ceph: define 'end/complete' in readdir reply as bit flags
...
Highlights include:
Features:
- Add support for the NFS v4.2 COPY operation
- Add support for NFS/RDMA over IPv6
Bugfixes and cleanups:
- Avoid race that crashes nfs_init_commit()
- Fix oops in callback path
- Fix LOCK/OPEN race when unlinking an open file
- Choose correct stateids when using delegations in setattr, read and write
- Don't send empty SETATTR after OPEN_CREATE
- xprtrdma: Prevent server from writing a reply into memory client has released
- xprtrdma: Support using Read list and Reply chunk in one RPC call
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJXRu76AAoJENfLVL+wpUDrDVoQAKPKv1tEVJMRUQA3UVoKoixd
KjmmZMjl6GfpISwTZl+a8W549jyGuYH7Gl8vSbMaE9/FI+kJW6XZQniTYfFqY8/a
LbMSdNx1+yURisbkyO0vPqqwKw9r6UmsfGeUT8SpS3ff61yp4Oj436ra2qcPJsZ3
cWl/lHItzX7oKFAWmr0Nmq2X8ac/8+NFyK29+V/QGfwtp3qAPbpA8XM5HrHw3rA2
uk5uNSr3hwqz7P3+Hi7ZoO2m4nQTAbQnEunfYpxlOwz4IaM7qcGnntT6Jhwq1pGE
/1YasG7bHeiWjhynmZZ4CWuMkogau2UJ/G68Cz7ehLhPNr8rH/ZFCJZ+XX0e0CgI
1d+AwxZvgszIQVBY3S7sg8ezVSCPBXRFJ8rtzggGscqC53aP7L+rLfUFH+OKrhMg
6n7RQiq4EmGDJGviB/R2HixI9CpdOf2puNhDKSJmPOqiSS7UuHMw8QCq++vdru+1
GLGunGyO7D70yTV92KtsdzJlFlnfa/g+FIJrmaMpL3HH1h0stTctWX5xlTYmqEL3
z3aUuT8RySk2t1FTabSj6KRWqE/krK5BMZbX91kpF27WL4c/olXFaZPqBDsj0q4u
2rm1fIrc8RxLXctJan9ro092s/e9dup/1JxV5XWMq/EGS1ezvf+0XkCOtURaAWp3
2aPHlx7M8iuq2SouL6f7
=QMmY
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.7-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
"Highlights include:
Features:
- Add support for the NFS v4.2 COPY operation
- Add support for NFS/RDMA over IPv6
Bugfixes and cleanups:
- Avoid race that crashes nfs_init_commit()
- Fix oops in callback path
- Fix LOCK/OPEN race when unlinking an open file
- Choose correct stateids when using delegations in setattr, read and
write
- Don't send empty SETATTR after OPEN_CREATE
- xprtrdma: Prevent server from writing a reply into memory client
has released
- xprtrdma: Support using Read list and Reply chunk in one RPC call"
* tag 'nfs-for-4.7-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (61 commits)
pnfs: pnfs_update_layout needs to consider if strict iomode checking is on
nfs/flexfiles: Use the layout segment for reading unless it a IOMODE_RW and reading is disabled
nfs/flexfiles: Helper function to detect FF_FLAGS_NO_READ_IO
nfs: avoid race that crashes nfs_init_commit
NFS: checking for NULL instead of IS_ERR() in nfs_commit_file()
pnfs: make pnfs_layout_process more robust
pnfs: rework LAYOUTGET retry handling
pnfs: lift retry logic from send_layoutget to pnfs_update_layout
pnfs: fix bad error handling in send_layoutget
flexfiles: add kerneldoc header to nfs4_ff_layout_prepare_ds
flexfiles: remove pointless setting of NFS_LAYOUT_RETURN_REQUESTED
pnfs: only tear down lsegs that precede seqid in LAYOUTRETURN args
pnfs: keep track of the return sequence number in pnfs_layout_hdr
pnfs: record sequence in pnfs_layout_segment when it's created
pnfs: don't merge new ff lsegs with ones that have LAYOUTRETURN bit set
pNFS/flexfiles: When initing reads or writes, we might have to retry connecting to DSes
pNFS/flexfiles: When checking for available DSes, conditionally check for MDS io
pNFS/flexfile: Fix erroneous fall back to read/write through the MDS
NFS: Reclaim writes via writepage are opportunistic
NFSv4: Use the right stateid for delegations in setattr, read and write
...
Changes in this update:
o fixes for mount line parsing, sparse warnings, read-only compat
feature remount behaviour
o allow fast path symlink lookups for inline symlinks.
o attribute listing cleanups
o writeback goes direct to bios rather than indirecting through
bufferheads
o transaction allocation cleanup
o optimised kmem_realloc
o added configurable error handling for metadata write errors,
changed default error handling behaviour from "retry forever" to
"retry until unmount then fail"
o fixed several inode cluster writeback lookup vs reclaim race
conditions
o fixed inode cluster writeback checking wrong inode after lookup
o fixed bugs where struct xfs_inode freeing wasn't actually RCU safe
o cleaned up inode reclaim tagging
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXRo8LAAoJEK3oKUf0dfodxLgP+wQMd46i/nCncr6jMcdoVXfL
rE6cL1LJWWVOWzax/bmdlV1VNXqqW7n0ABAVMqikzqSEd+fBQe/HOkdBeVUywu7o
bmqgNxuofMqHaiYhiTvUijHLHWLFxIgd/jNT7L5oGRzZdmP260VGf3EPipN7aA9U
Y3KVhFQCqohpeIUeSV4Z/eIDdHN5LyUI1s+7zMLquHKCWyO4aO4GBX8YlyXdRRVe
cwCZb4TBryS0PBCIra31MZ5wBRwLx8PBXqcNsnTQSR5Uu+WeuwxofXz5q3kdmNOU
XGTWiabQbcvaC4twrzqnErHEX41PAs43tWxsI/qJH49QIFdfOYM+t8ERhEa2Q3DW
Ihl+Q/2qiOuZZterG/t5MrxhybrmQhEFVJT6Ib8b/CnwpRm+K8kWTead1YJL8Xzd
F9k8e57BCgTbDA7jWxWDbp7eQ1/4KglBD4sefFPpsuFgO882mmo5GmymALGjmitw
JH1v3HL3PeTkQoHfcay8ZM/zNjX643CXHwCWYEOAgD+e77TPiOs/cHLZaXbrBkLK
PpSJNfYiBe31eeSOEGsxivMapLpus+cHZyK3uR+XU+naJhjOdaBDTTo8RsAD9jS5
C/dzxc4l7o+gYT+UjV5KtyfEeVwtGo5mtR9XozPbNDjNQor8Vo6NQMZXMXpFYDZI
2XgzVNpkEf/74kexdEzV
=0tYo
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs updates from Dave Chinner:
"A pretty average collection of fixes, cleanups and improvements in
this request.
Summary:
- fixes for mount line parsing, sparse warnings, read-only compat
feature remount behaviour
- allow fast path symlink lookups for inline symlinks.
- attribute listing cleanups
- writeback goes direct to bios rather than indirecting through
bufferheads
- transaction allocation cleanup
- optimised kmem_realloc
- added configurable error handling for metadata write errors,
changed default error handling behaviour from "retry forever" to
"retry until unmount then fail"
- fixed several inode cluster writeback lookup vs reclaim race
conditions
- fixed inode cluster writeback checking wrong inode after lookup
- fixed bugs where struct xfs_inode freeing wasn't actually RCU safe
- cleaned up inode reclaim tagging"
* tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (39 commits)
xfs: fix warning in xfs_finish_page_writeback for non-debug builds
xfs: move reclaim tagging functions
xfs: simplify inode reclaim tagging interfaces
xfs: rename variables in xfs_iflush_cluster for clarity
xfs: xfs_iflush_cluster has range issues
xfs: mark reclaimed inodes invalid earlier
xfs: xfs_inode_free() isn't RCU safe
xfs: optimise xfs_iext_destroy
xfs: skip stale inodes in xfs_iflush_cluster
xfs: fix inode validity check in xfs_iflush_cluster
xfs: xfs_iflush_cluster fails to abort on error
xfs: remove xfs_fs_evict_inode()
xfs: add "fail at unmount" error handling configuration
xfs: add configuration handlers for specific errors
xfs: add configuration of error failure speed
xfs: introduce table-based init for error behaviors
xfs: add configurable error support to metadata buffers
xfs: introduce metadata IO error class
xfs: configurable error behavior via sysfs
xfs: buffer ->bi_end_io function requires irq-safe lock
...
As flexfiles has FF_FLAGS_NO_READ_IO, there is a need to generically
support enforcing that a IOMODE_RW segment will not allow READ I/O.
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We should reset i_requested_max_size before waking the waiters.
(zero i_requested_max_size make waiter re-request the max size)
Signed-off-by: Yan, Zheng <zyan@redhat.com>
truncate_pagecache() drops dirty pages, it's dangerous to use it
to invalidate read cache. Besides, we shouldn't start invalidating
read cache while there are buffer writers. Because buffer writers
may add dirty pages later.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
writepage() can be interrupted when it's called by direct memory
reclaimer (the direct memory relaimer is killed). To avoid lossing
data, we redirty the page.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
ceph_update_writeable_page() is used by ceph_write_begin(). It beaks
atomicity of write operation if it's interruptible.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
when ceph_update_writeable_page() return -EAGAIN, caller should
lock the page and call ceph_update_writeable_page() again.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Fault and page_mkwrite are supposed to be uninterruptable. But they
call ceph functions that are interruptible. So they should block
signals before calling functions that are interruptible
Signed-off-by: Yan, Zheng <zyan@redhat.com>
This patch makes serverl logical caculation functions return bool to
improve readability due to these particular functions only using 0/1
as their return value.
No functional change.
Signed-off-by: Zhang Zhuoyu <zhangzhuoyu@cmss.chinamobile.com>
Nodes in i_fragtree are sorted according to ceph_compare_frag().
It means frag node in i_fragtree always follow its direct parent
node. To check if a leaf node is valid, we just need to check if
it's child of previous split node.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
The algorithm that updates i_fragtree relies on that the frag tree
splits in mds reply are of the same order of i_fragtree. This is not
true because current MDS encodes frag tree splits in ascending order
of (unsigned)frag_t. But nodes in i_fragtree are sorted according to
ceph_frag_compare().
The fix is sort the frag tree splits first, then updates i_fragtree.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
If MDS sorts dentries in dirfrag in hash order, we use hash value to
compose dentry offset. dentry offset is:
(0xff << 52) | ((24 bits hash) << 28) |
(the nth entry hash hash collision)
This offset is stable across directory fragmentation. This alos means
there is no need to reset readdir offset if directory get fragmented
in the middle of readdir.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Forward seek within same frag does not update fi->last_name, it will
not affect contents of later readdir reply. So there is no need to
forbid marking directory complete
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Set a flag in readdir request, which indicates that client interprets
'end/complete' as bit flags. So that mds can reply additional flags in
readdir reply.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Setxattr with NULL value and XATTR_REPLACE flag should be equivalent
to removexattr. But current MDS does not support deleting vxattrs through
MDS_OP_SETXATTR request. The workaround is sending MDS_OP_RMXATTR request
if setxattr actually removs xattr.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
truncate_pagecache() may decrease inode's reference. This can cause
deadlock if inode's last reference is dropped and iput_final() wants
to evict the inode. (evict() calls inode_wait_for_writeback(), which
waits for ceph_writepages_start() to return).
The fix is use work thead to truncate dirty pages. Also add 'forced
umount' check to ceph_update_writeable_page(), which prevents new
pages getting dirty.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
When mds session gets killed, read/write operation may hang.
Client waits for Frw caps, but mds does not know what caps client
wants. To recover this, client sends an open request to mds. The
request will tell mds what caps client wants.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
To access non-default filesystem, we just need to subscribe to
mdsmap.<MDS_NAMESPACE_ID> and add a new mount option for mds
namespace id.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
[idryomov@gmail.com: switch to a new libceph API]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This is a major sync up, up to ~Jewel. The highlights are:
- per-session request trees (vs a global per-client tree)
- per-session locking (vs a global per-client rwlock)
- homeless OSD session
- no ad-hoc global per-client lists
- support for pool quotas
- foundation for watch/notify v2 support
- foundation for map check (pool deletion detection) support
The switchover is incomplete: lingering requests can be setup and
teared down but aren't ever reestablished. This functionality is
restored with the introduction of the new lingering infrastructure
(ceph_osd_linger_request, linger_work, etc) in a later commit.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Pull vfs xattr regression fixes from Al Viro.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
make xattr_resolve_handlers() safe to use with NULL ->s_xattr
xattr: Fail with -EINVAL for NULL attribute names
If you specify ACK | ONDISK and set ->r_unsafe_callback, both
->r_callback and ->r_unsafe_callback(true) are called on ack. This is
very confusing. Redo this so that only one of them is called:
->r_unsafe_callback(true), on ack
->r_unsafe_callback(false), on commit
or
->r_callback, on ack|commit
Decode everything in decode_MOSDOpReply() to reduce clutter.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
finish_read(), its only user, uses it to get to hdr.data_len, which is
what ->r_result is set to on success. This gains us the ability to
safely call callbacks from contexts other than reply, e.g. map check.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The crux of this is getting rid of ceph_osdc_build_request(), so that
MOSDOp can be encoded not before but after calc_target() calculates the
actual target. Encoding now happens within ceph_osdc_start_request().
Also nuked is the accompanying bunch of pointers into the encoded
buffer that was used to update fields on each send - instead, the
entire front is re-encoded. If we want to support target->name_len !=
base->name_len in the future, there is no other way, because oid is
surrounded by other fields in the encoded buffer.
Encoding OSD ops and adding data items to the request message were
mixed together in osd_req_encode_op(). While we want to re-encode OSD
ops, we don't want to add duplicate data items to the message when
resending, so all call to ceph_osdc_msg_data_add() are factored out
into a new setup_request_data().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Introduce ceph_osd_request_target, containing all mapping-related
fields of ceph_osd_request and calc_target() for calculating mappings
and populating it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Rename ceph_calc_pg_primary() to ceph_pg_to_acting_primary() to
emphasise that it returns acting primary.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Rename ceph_oloc_oid_to_pg() to ceph_object_locator_to_pg(). Emphasise
that returned is raw PG and return -ENOENT instead of -EIO if the pool
doesn't exist.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Given
struct foo {
u64 id;
struct rb_node bar_node;
};
generate insert_bar(), erase_bar() and lookup_bar() functions with
DEFINE_RB_FUNCS(bar, struct foo, id, bar_node)
The key is assumed to be an integer (u64, int, etc), compared with
< and >. nodefld has to be initialized with RB_CLEAR_NODE().
Start using it for MDS, MON and OSD requests and OSD sessions.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Currently ceph_object_id can hold object names of up to 100
(CEPH_MAX_OID_NAME_LEN) characters. This is enough for all use cases,
expect one - long rbd image names:
- a format 1 header is named "<imgname>.rbd"
- an object that points to a format 2 header is named "rbd_id.<imgname>"
We operate on these potentially long-named objects during rbd map, and,
for format 1 images, during header refresh. (A format 2 header name is
a small system-generated string.)
Lift this 100 character limit by making ceph_object_id be able to point
to an externally-allocated string. Apart from being able to work with
almost arbitrarily-long named objects, this allows us to reduce the
size of ceph_object_id from >100 bytes to 64 bytes.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The size of ->r_request and ->r_reply messages depends on the size of
the object name (ceph_object_id), while the size of ceph_osd_request is
fixed. Move message allocation into a separate function that would
have to be called after ceph_object_id and ceph_object_locator (which
is also going to become variable in size with RADOS namespaces) have
been filled in:
req = ceph_osdc_alloc_request(...);
<fill in req->r_base_oid>
<fill in req->r_base_oloc>
ceph_osdc_alloc_messages(req);
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Commit 98e9cb57 improved the xattr name checks in xattr_resolve_name but
didn't update the NULL attribute name check appropriately, so NULL
attribute names lead to NULL pointer dereferences. Turn that into
-EINVAL results instead.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
fs/xattr.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We usually call btrfs_put_bbio() when btrfs_map_block() failed,
btrfs_put_bbio() works right whether bbio is a valid value, or NULL.
But there is a exception, in some case, btrfs_map_block() will return
fail without touching *bbio(keeping its original value), and if bbio
was not initialized yet, invalid memory accessing will happened.
Above case is in scrub_missing_raid56_pages(), and similar case in
scrub_raid56_parity().
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs's fiemap is supposed to return 0 on success and return < 0 on
error. however, ret becomes 1 after looking up the last file extent:
btrfs_lookup_file_extent ->
btrfs_search_slot(..., ins_len=0, cow=0)
and if the offset is beyond EOF, we'll get 'path' pointed to the place
of potentail insertion, and ret == 1.
This may confuse applications using ioctl(FIEL_IOC_FIEMAP).
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While reading sys_chunk_array in superblock, btrfs creates a temporary
extent buffer. Since we don't use it after finishing reading
sys_chunk_array, we don't need to keep it in memory.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The mds can inform the client not to use the IOMODE_RW layout
segment for doing READs. I.e., it is basically a
IOMODE_WRITE layout segment.
It would do this to not interfere with the WRITEs.
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Check out-of-memory failure of the kstrdup option. Note that the argument
"arg" may be NULL (in that case kstrup returns NULL), so out of memory
condition happened if arg was non-NULL and kstrdup returned NULL.
The patch also changes the call to replace_mount_options - if we didn't
pass any filesystem-specific options, we don't call replace_mount_options
(thus we don't erase existing reported options).
Note that to properly report options after remount, the reiserfs
filesystem should implement the show_options method. Without the
show_options method, options changed with remount replace existing
options.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXRL2PAAoJECebzXlCjuG+c34P/1wnkehVxDozBJp7UEzhrsE/
U1dpwfykzVEIMh68TldBvyrt2Lb4ThLPZ7V2dVwNqA831S/VM6fWJyw8WerSgGgU
SUGOzdF04rNfy41lXQNpDiiC417Fbp4Js4O+Q5kd+8kqQbXYqCwz0ce3DVbAT571
JmJgBI8gZLhicyNRDOt0y6C+/3P+0bbXYvS8wkzY+CwbNczHJOCLhwViKzWTptm9
LCSgDGm68ckpR7mZkWfEF3WdiZ9+SxeI+pT9dcomzxNfbv8NluDplYmdLbepA2J8
uWHGprVe9WJMDnw4hJhrI2b3/rHIntpxuZYktmnb/z/ezBTyi3FXYWgAEdE1by+Y
Gf7OewKOp8XcQ/iHRZ8vwXNrheHAr9++SB49mGBZJ3qj6bO+FrISQKX9FRxo6PrJ
SDRgYjt5yUG2oD1AAs1NzuBPqZzR40mA6Yk4zuNAcxzK/S7DdRF/9Kjyk86TVv08
3E3O5i1RyVcU/A7JdnbiyeDFMQoRshdnN0HShIZcSfcfW+qFKghNlO9bFfSl904F
jlG6moNB5OBiV8FNOelY+HGAYoUdw120QxqQMv47oZGKCjv+rfK38aB4GBJ4iEuo
TrGqNmrMrs/AKdL3Sd+8LuJqSfXggrwUDc/KS6CFz/U0eBbp6k0kcd7FEyG/J8kW
JxQ0URgyJ+DHfc60E8LN
=k6RP
-----END PGP SIGNATURE-----
Merge tag 'nfsd-4.7' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"A very quiet cycle for nfsd, mainly just an RDMA update from Chuck
Lever"
* tag 'nfsd-4.7' of git://linux-nfs.org/~bfields/linux:
sunrpc: fix stripping of padded MIC tokens
svcrpc: autoload rdma module
svcrdma: Generalize svc_rdma_xdr_decode_req()
svcrdma: Eliminate code duplication in svc_rdma_recvfrom()
svcrdma: Drain QP before freeing svcrdma_xprt
svcrdma: Post Receives only for forward channel requests
svcrdma: Remove superfluous line from rdma_read_chunks()
svcrdma: svc_rdma_put_context() is invoked twice in Send error path
svcrdma: Do not add XDR padding to xdr_buf page vector
svcrdma: Support IPv6 with NFS/RDMA
nfsd: handle seqid wraparound in nfsd4_preprocess_layout_stateid
Remove unnecessary allocation
after a crash and a potential BUG_ON crash if a file has the data
journalling flag enabled while it has dirty delayed allocation blocks
that haven't been written yet. Also fix a potential crash in the new
project quota code and a maliciously corrupted file system.
In addition, fix some DAX-specific bugs, including when there is a
transient ENOSPC situation and races between writes via direct I/O and
an mmap'ed segment that could lead to lost I/O.
Finally the usual set of miscellaneous cleanups.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJXQ40fAAoJEPL5WVaVDYGjnwMH+wXHASgPfzZgtRInsTG8W/2L
jsmAcMlyMAYIATWMppNtPIq0td49z1dYO0YkKhtPVMwfzu230IFWhGWp93WqP9ve
XYHMmaBorFlMAzWgMKn1K0ExWZlV+ammmcTKgU0kU4qyZp0G/NnMtlXIkSNv2amI
9Mn6R+v97c20gn8e9HWP/IVWkgPr+WBtEXaSGjC7dL6yI8hL+rJMqN82D76oU5ea
vtwzrna/ISijy+etYmQzqHNYNaBKf40+B5HxQZw/Ta3FSHofBwXAyLaeEAr260Mf
V3Eg2NDcKQxiZ3adBzIUvrRnrJV381OmHoguo8Frs8YHTTRiZ0T/s7FGr2Q0NYE=
=7yIM
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Fix a number of bugs, most notably a potential stale data exposure
after a crash and a potential BUG_ON crash if a file has the data
journalling flag enabled while it has dirty delayed allocation blocks
that haven't been written yet. Also fix a potential crash in the new
project quota code and a maliciously corrupted file system.
In addition, fix some DAX-specific bugs, including when there is a
transient ENOSPC situation and races between writes via direct I/O and
an mmap'ed segment that could lead to lost I/O.
Finally the usual set of miscellaneous cleanups"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
ext4: pre-zero allocated blocks for DAX IO
ext4: refactor direct IO code
ext4: fix race in transient ENOSPC detection
ext4: handle transient ENOSPC properly for DAX
dax: call get_blocks() with create == 1 for write faults to unwritten extents
ext4: remove unmeetable inconsisteny check from ext4_find_extent()
jbd2: remove excess descriptions for handle_s
ext4: remove unnecessary bio get/put
ext4: silence UBSAN in ext4_mb_init()
ext4: address UBSAN warning in mb_find_order_for_block()
ext4: fix oops on corrupted filesystem
ext4: fix check of dqget() return value in ext4_ioctl_setproject()
ext4: clean up error handling when orphan list is corrupted
ext4: fix hang when processing corrupted orphaned inode list
ext4: remove trailing \n from ext4_warning/ext4_error calls
ext4: fix races between changing inode journal mode and ext4_writepages
ext4: handle unwritten or delalloc buffers before enabling data journaling
ext4: fix jbd2 handle extension in ext4_ext_truncate_extend_restart()
ext4: do not ask jbd2 to write data for delalloc buffers
jbd2: add support for avoiding data writes during transaction commits
...
The wrong field (xattr) is dumped here due to a copy-and-paste error.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Merge yet more updates from Andrew Morton:
- Oleg's "wait/ptrace: assume __WALL if the child is traced". It's a
kernel-based workaround for existing userspace issues.
- A few hotfixes
- befs cleanups
- nilfs2 updates
- sys_wait() changes
- kexec updates
- kdump
- scripts/gdb updates
- the last of the MM queue
- a few other misc things
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (84 commits)
kgdb: depends on VT
drm/amdgpu: make amdgpu_mn_get wait for mmap_sem killable
drm/radeon: make radeon_mn_get wait for mmap_sem killable
drm/i915: make i915_gem_mmap_ioctl wait for mmap_sem killable
uprobes: wait for mmap_sem for write killable
prctl: make PR_SET_THP_DISABLE wait for mmap_sem killable
exec: make exec path waiting for mmap_sem killable
aio: make aio_setup_ring killable
coredump: make coredump_wait wait for mmap_sem for write killable
vdso: make arch_setup_additional_pages wait for mmap_sem for write killable
ipc, shm: make shmem attach/detach wait for mmap_sem killable
mm, fork: make dup_mmap wait for mmap_sem for write killable
mm, proc: make clear_refs killable
mm: make vm_brk killable
mm, elf: handle vm_brk error
mm, aout: handle vm_brk failures
mm: make vm_munmap killable
mm: make vm_mmap killable
mm: make mmap_sem for write waits killable for mm syscalls
MAINTAINERS: add co-maintainer for scripts/gdb
...
setup_arg_pages requires mmap_sem for write. If the waiting task gets
killed by the oom killer it would block oom_reaper from asynchronous
address space reclaim and reduce the chances of timely OOM resolving.
Wait for the lock in the killable mode and return with EINTR if the task
got killed while waiting. All the callers are already handling error
path and the fatal signal doesn't need any additional treatment.
The same applies to __bprm_mm_init.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
aio_setup_ring waits for mmap_sem in writable mode. If the waiting task
gets killed by the oom killer it would block oom_reaper from
asynchronous address space reclaim and reduce the chances of timely OOM
resolving. Wait for the lock in the killable mode and return with EINTR
if the task got killed while waiting. This will also expedite the
return to the userspace and do_exit.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Benamin LaHaise <bcrl@kvack.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
coredump_wait waits for mmap_sem for write currently which can prevent
oom_reaper to reclaim the oom victims address space asynchronously
because that requires mmap_sem for read. This might happen if the oom
victim is multi threaded and some thread(s) is holding mmap_sem for read
(e.g. page fault) and it is stuck in the page allocator while other
thread(s) reached coredump_wait already.
This patch simply uses down_write_killable and bails out with EINTR if
the lock got interrupted by the fatal signal. do_coredump will return
right away and do_group_exit will take care to zap the whole thread
group.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CLEAR_REFS_MM_HIWATER_RSS and CLEAR_REFS_SOFT_DIRTY are relying on
mmap_sem for write. If the waiting task gets killed by the oom killer
and it would operate on the current's mm it would block oom_reaper from
asynchronous address space reclaim and reduce the chances of timely OOM
resolving. Wait for the lock in the killable mode and return with EINTR
if the task got killed while waiting. This will also expedite the
return to the userspace and do_exit even if the mm is remote.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Petr Cermak <petrcermak@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
load_elf_library doesn't handle vm_brk failure although nothing really
indicates it cannot do that because the function is allowed to fail due
to vm_mmap failures already. This might be not a problem now but later
patch will make vm_brk killable (resp. mmap_sem for write waiting will
become killable) and so the failure will be more probable.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vm_brk is allowed to fail but load_aout_binary simply ignores the error
and happily continues. I haven't noticed any problem from that in real
life but later patches will make the failure more likely because vm_brk
will become killable (resp. mmap_sem for write waiting will become
killable) so we should be more careful now.
The error handling should be quite straightforward because there are
calls to vm_mmap which check the error properly already. The only
notable exception is set_brk which is called after beyond_if label. But
nothing indicates that we cannot move it above set_binfmt as the two do
not depend on each other and fail before we do set_binfmt and alter
reference counting.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
remove_arg_zero() does free_arg_page() for no reason. This was needed
before and only if CONFIG_MMU=y: see commit 4fc75ff481 ("exec: fix
remove_arg_zero"), install_arg_page() was called for every page != NULL
in bprm->page[] array. Today install_arg_page() has already gone and
free_arg_page() is nop after another commit b6a2fea393 ("mm: variable
length argument support").
CONFIG_MMU=n does free_arg_pages() in free_bprm() and thus it doesn't
need remove_arg_zero()->free_arg_page() too; apart from get_arg_page()
it never checks if the page in bprm->page[] was allocated or not, so the
"extra" non-freed page is fine. OTOH, this free_arg_page() can add the
minor pessimization, the caller is going to do copy_strings_kernel()
right after remove_arg_zero() which will likely need to re-allocate the
same page again.
And as Hujunjie pointed out, the "offset == PAGE_SIZE" check is wrong
because we are going to increment bprm->p once again before return, so
CONFIG_MMU=n "leaks" the page anyway if '0' is the final byte in this
page.
NOTE: remove_arg_zero() assumes that argv[0] is null-terminated but this
is not necessarily true. copy_strings() does "len = strnlen_user(...)",
then copy_from_user(len) but another thread or debuger can overwrite the
trailing '0' in between. Afaics nothing really bad can happen because
we must always have the null-terminated bprm->filename copied by the 1st
copy_strings_kernel(), but perhaps we should change this code to check
"bprm->p < bprm->exec" anyway, and/or change copy_strings() to ensure
that the last byte in string is always zero.
Link: http://lkml.kernel.org/r/20160517155335.GA31435@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported by: hujunjie <jj.net@163.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes block comments with proper formatting to eliminate the
following checkpatch.pl warnings:
"WARNING: Block comments use * on subsequent lines"
"WARNING: Block comments use a trailing */ on a separate line"
Link: http://lkml.kernel.org/r/1462886671-3521-8-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes checkpatch.pl warning "WARNING: Single statement macros
should not use a do {} while (0) loop".
Link: http://lkml.kernel.org/r/1462886671-3521-7-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes the checkpatch.pl warning that suggests else is not
generally useful after a break or return.
Link: http://lkml.kernel.org/r/1462886671-3521-6-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes checkpatch.pl warning "WARNING: Prefer 'unsigned int' to
bare use of 'unsigned'".
Link: http://lkml.kernel.org/r/1462886671-3521-5-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes the checkpatch.pl warning "WARNING: space prohibited before
semicolon" at nilfs_store_magic_and_option().
Link: http://lkml.kernel.org/r/1462886671-3521-3-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This updates call sites of nilfs_warning() and nilfs_error() so that they
don't add a duplicate newline. These output functions are already
designed to add a trailing newline to the message.
Link: http://lkml.kernel.org/r/1462886671-3521-2-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Clean up checkpatch.pl warnings "WARNING: Missing a blank line after
declarations" from nilfs2.
Link: http://lkml.kernel.org/r/1461935747-10380-11-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Refactor nilfs_clear_inode() and nilfs_i_callback() so that cleanup
code or resource deallocation related to metadata file will be moved
out to mdt.c.
Link: http://lkml.kernel.org/r/1461935747-10380-9-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
nilfs_mdt_mark_block_dirty() can be replaced with primary functions
like nilfs_mdt_get_block() and mark_buffer_dirty(), and it's used only
by nilfs_ioctl_mark_blocks_dirty().
This gets rid of the function to simplify the interface of metadata
file.
Link: http://lkml.kernel.org/r/1461935747-10380-8-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
E-mail addresses of osrg.net domain are no longer available. This
removes them from authorship notices and prevents reporters from being
confused.
Link: http://lkml.kernel.org/r/1461935747-10380-5-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This removes the extra paragraph which mentions FSF address in GPL
notices from source code of nilfs2 and avoids the checkpatch.pl error
related to it.
Link: http://lkml.kernel.org/r/1461935747-10380-4-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix checkpatch.pl error "ERROR: space prohibited before that ','
(ctx:WxW)" at nilfs_sufile_set_suinfo().
This also fixes checkpatch.pl warning "WARNING: Prefer 'unsigned int' to
bare use of 'unsigned'" at nilfs_sufile_set_suinfo() and
nilfs_sufile_get_suinfo().
Link: http://lkml.kernel.org/r/1461935747-10380-3-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the following checkpatch.pl error and warnings:
ERROR: code indent should use tabs where possible
#1317: FILE: super.c:1317:
+ ^I^Is_new = true;$
WARNING: please, no space before tabs
#1317: FILE: super.c:1317:
+ ^I^Is_new = true;$
WARNING: please, no spaces at the start of a line
#1317: FILE: super.c:1317:
+ ^I^Is_new = true;$
Link: http://lkml.kernel.org/r/1461935747-10380-2-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The nilfs_sc_operations structures are never modified, so declare them
as const.
Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CONFIG_MIPS32_N32=y but CONFIG_BINFMT_ELF disabled results in the
following linker errors:
arch/mips/built-in.o: In function `elf_core_dump':
binfmt_elfn32.c:(.text+0x23dbc): undefined reference to `elf_core_extra_phdrs'
binfmt_elfn32.c:(.text+0x246e4): undefined reference to `elf_core_extra_data_size'
binfmt_elfn32.c:(.text+0x248d0): undefined reference to `elf_core_write_extra_phdrs'
binfmt_elfn32.c:(.text+0x24ac4): undefined reference to `elf_core_write_extra_data'
CONFIG_MIPS32_O32=y but CONFIG_BINFMT_ELF disabled results in the following
linker errors:
arch/mips/built-in.o: In function `elf_core_dump':
binfmt_elfo32.c:(.text+0x28a04): undefined reference to `elf_core_extra_phdrs'
binfmt_elfo32.c:(.text+0x29330): undefined reference to `elf_core_extra_data_size'
binfmt_elfo32.c:(.text+0x2951c): undefined reference to `elf_core_write_extra_phdrs'
binfmt_elfo32.c:(.text+0x29710): undefined reference to `elf_core_write_extra_data'
This is because binfmt_elfn32 and binfmt_elfo32 are using symbols from
elfcore but for these configurations elfcore will not be built.
Fixed by making elfcore selectable by a separate config symbol which
unlike the current mechanism can also be used from other directories
than kernel/, then having each flavor of ELF that relies on elfcore.o,
select it in Kconfig, including CONFIG_MIPS32_N32 and CONFIG_MIPS32_O32
which fixes this issue.
Link: http://lkml.kernel.org/r/20160520141705.GA1913@linux-mips.org
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Reviewed-by: James Hogan <james.hogan@imgtec.com>
Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1/ Device DAX for persistent memory:
Device DAX is the device-centric analogue of Filesystem DAX
(CONFIG_FS_DAX). It allows memory ranges to be allocated and mapped
without need of an intervening file system. Device DAX is strict,
precise and predictable. Specifically this interface:
a) Guarantees fault granularity with respect to a given page size
(pte, pmd, or pud) set at configuration time.
b) Enforces deterministic behavior by being strict about what fault
scenarios are supported.
Persistent memory is the first target, but the mechanism is also
targeted for exclusive allocations of performance/feature differentiated
memory ranges.
2/ Support for the HPE DSM (device specific method) command formats.
This enables management of these first generation devices until a
unified DSM specification materializes.
3/ Further ACPI 6.1 compliance with support for the common dimm
identifier format.
4/ Various fixes and cleanups across the subsystem.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXQhdeAAoJEB7SkWpmfYgCYP8P/RAgHkroL5lUKKU45TQUBKcY
diC9POeNSccme4tIRIQCGQUZ7+7mKM5ECv2ulF4xYOHvFBCcd/8OF6xKAXs48r3v
oguYhvX1YvIkBc9FUfBQbR1IsCOJ7uWp/UYiYCIQEXS5tS9Jv545j3ASqDt9xWoV
TWlceZn3yWSbASiV9qZ2eXhEkk75pg4yara++rsm2/7rs/TTXn5EIjBs+57BtAo+
6utI4fTy0CQvBYwVzam3m7y9dt2Z2jWXL4hgmT7pkvJ7HDoctVly0P9+bknJPUAo
g+NugKgTGeiqH5GYp5CTZ9KvL91sDF4q00pfinITVdFl0E3VE293cIHlAzSQBm5/
w58xxaRV958ZvpH7EaBmYQG82QDi/eFNqeHqVGn0xAM6MlaqO7avUMQp2lRPYMCJ
u1z/NloR5yo+sffHxsn5Luiq9KqOf6zk33PuxEkKbN74OayCSPn/SeVCO7rQR0B6
yPMJTTcTiCLnId1kOWAPaEmuK2U3BW/+ogg7hKgeCQSysuy5n6Ok5a2vEx/gJRAm
v9yF68RmIWumpHr+QB0TmB8mVbD5SY+xWTm3CqJb9MipuFIOF7AVsPyTgucBvE7s
v+i5F6MDO6tcVfiDT4AiZEt6D2TM5RbtckkUEX3ZTD6j7CGuR5D8bH0HNRrghrYk
KT1lAk6tjWBOGAHc5Ji7
=Y3Xv
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"The bulk of this update was stabilized before the merge window and
appeared in -next. The "device dax" implementation was revised this
week in response to review feedback, and to address failures detected
by the recently expanded ndctl unit test suite.
Not included in this pull request are two dax topic branches (dax
error handling, and dax radix-tree locking). These topics were
deferred to get a few more days of -next integration testing, and to
coordinate a branch baseline with Ted and the ext4 tree. Vishal and
Ross will send the error handling and locking topics respectively in
the next few days.
This branch has received a positive build result from the kbuild robot
across 226 configs.
Summary:
- Device DAX for persistent memory: Device DAX is the device-centric
analogue of Filesystem DAX (CONFIG_FS_DAX). It allows memory
ranges to be allocated and mapped without need of an intervening
file system. Device DAX is strict, precise and predictable.
Specifically this interface:
a) Guarantees fault granularity with respect to a given page size
(pte, pmd, or pud) set at configuration time.
b) Enforces deterministic behavior by being strict about what
fault scenarios are supported.
Persistent memory is the first target, but the mechanism is also
targeted for exclusive allocations of performance/feature
differentiated memory ranges.
- Support for the HPE DSM (device specific method) command formats.
This enables management of these first generation devices until a
unified DSM specification materializes.
- Further ACPI 6.1 compliance with support for the common dimm
identifier format.
- Various fixes and cleanups across the subsystem"
* tag 'libnvdimm-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (40 commits)
libnvdimm, dax: fix deletion
libnvdimm, dax: fix alignment validation
libnvdimm, dax: autodetect support
libnvdimm: release ida resources
Revert "block: enable dax for raw block devices"
/dev/dax, core: file operations and dax-mmap
/dev/dax, pmem: direct access to persistent memory
libnvdimm: stop requiring a driver ->remove() method
libnvdimm, dax: record the specified alignment of a dax-device instance
libnvdimm, dax: reserve space to store labels for device-dax
libnvdimm, dax: introduce device-dax infrastructure
nfit: add sysfs dimm 'family' and 'dsm_mask' attributes
tools/testing/nvdimm: ND_CMD_CALL support
nfit: disable vendor specific commands
nfit: export subsystem ids as attributes
nfit: fix format interface code byte order per ACPI6.1
nfit, libnvdimm: limited/whitelisted dimm command marshaling mechanism
nfit, libnvdimm: clarify "commands" vs "_DSMs"
libnvdimm: increase max envelope size for ioctl
acpi/nfit: Add sysfs "id" for NVDIMM ID
...
- fs-specific prefix for fscrypto
- fault injection facility
- expose validity bitmaps for user to be aware of fragmentation
- fallocate/rm/preallocation speed up
- use percpu counters
Bug fixes
- some inline_dentry/inline_data bugs
- error handling for atomic/volatile/orphan inodes
- recover broken superblock
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXQPu4AAoJEEAUqH6CSFDSILgP/1dj6fmtytr8c+55EBqXUGpo
M7rS93JTxlmU5BduIo9psJsEquTQoVEmxB/Gjd+ZnI5R6Rp1c/REaP0ba374rEhZ
ecMQh5QqzM1gRNFXrQhWFEL/KtfRqt3T80zebQP7pxFUm/m9NGMLWT43RzQ8AAhr
Y3P0NLdvxA4HAnipKptkPJcGZQlWnL9W/MR+LgsXLXqLDwJHkVu61GcF0y2ibcJM
lEtIRmyH5tg7hP5c5LTw9pKQFHkIZt5cHFLjrJ1x8FSm2TXOcJPbjOrThvcb+NKK
e0O+6R0meH2eMpak+BTkZp2YbPPyXOb1N00j//lmbPjCoJPd4ZuiJ+oRoHUlTxtU
FhO67t0brlDbMFQVRFrtv8VA8M6by+DTAAP3Ffx62I/TJkphKANCSoyQRhlWtxxO
kRU69N7ipnRNxO4WCv40FjaQjSIElCKysP1POazRmAOQm7UFTGT9Nj37+eqUcEPJ
HZ7O61DEHNemb0SMlJ8WSClstt0yUU+2cjRfTPAr2Wd3V8gYbRs0QUg5M2GLgywR
EmiJfpkXse3f/nR8W6g1hganSOXA0AZX+EUibed6VkV3oYemdFbm8OymeEmLmWpM
y2F3D7dPLW7MCoTXJqtwFWdoDwI+zkH4rJaPGTq5TVBRWVU/njX8OvoB47pOvKV1
kccL7zv2PekE1hSDO5WF
=6MSp
-----END PGP SIGNATURE-----
Merge tag 'for-f2fs-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, as Ted pointed out, fscrypto allows one more key prefix
given by filesystem to resolve backward compatibility issues. Other
than that, we've fixed several error handling cases by introducing
a fault injection facility. We've also achieved performance
improvement in some workloads as well as a bunch of bug fixes.
Summary:
Enhancements:
- fs-specific prefix for fscrypto
- fault injection facility
- expose validity bitmaps for user to be aware of fragmentation
- fallocate/rm/preallocation speed up
- use percpu counters
Bug fixes:
- some inline_dentry/inline_data bugs
- error handling for atomic/volatile/orphan inodes
- recover broken superblock"
* tag 'for-f2fs-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (73 commits)
f2fs: fix to update dirty page count correctly
f2fs: flush pending bios right away when error occurs
f2fs: avoid ENOSPC fault in the recovery process
f2fs: make exit_f2fs_fs more clear
f2fs: use percpu_counter for total_valid_inode_count
f2fs: use percpu_counter for alloc_valid_block_count
f2fs: use percpu_counter for # of dirty pages in inode
f2fs: use percpu_counter for page counters
f2fs: use bio count instead of F2FS_WRITEBACK page count
f2fs: manipulate dirty file inodes when DATA_FLUSH is set
f2fs: add fault injection to sysfs
f2fs: no need inc dirty pages under inode lock
f2fs: fix incorrect error path handling in f2fs_move_rehashed_dirents
f2fs: fix i_current_depth during inline dentry conversion
f2fs: correct return value type of f2fs_fill_super
f2fs: fix deadlock when flush inline data
f2fs: avoid f2fs_bug_on during recovery
f2fs: show # of orphan inodes
f2fs: support in batch fzero in dnode page
f2fs: support in batch multi blocks preallocation
...
Pull btrfs updates from Chris Mason:
"This has our merge window series of cleanups and fixes. These target
a wide range of issues, but do include some important fixes for
qgroups, O_DIRECT, and fsync handling. Jeff Mahoney moved around a
few definitions to make them easier for userland to consume.
Also whiteout support is included now that issues with overlayfs have
been cleared up.
I have one more fix pending for page faults during btrfs_copy_from_user,
but I wanted to get this bulk out the door first"
* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (90 commits)
btrfs: fix memory leak during RAID 5/6 device replacement
Btrfs: add semaphore to synchronize direct IO writes with fsync
Btrfs: fix race between block group relocation and nocow writes
Btrfs: fix race between fsync and direct IO writes for prealloc extents
Btrfs: fix number of transaction units for renames with whiteout
Btrfs: pin logs earlier when doing a rename exchange operation
Btrfs: unpin logs if rename exchange operation fails
Btrfs: fix inode leak on failure to setup whiteout inode in rename
btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT
Btrfs: pin log earlier when renaming
Btrfs: unpin log if rename operation fails
Btrfs: don't do unnecessary delalloc flushes when relocating
Btrfs: don't wait for unrelated IO to finish before relocation
Btrfs: fix empty symlink after creating symlink and fsync parent dir
Btrfs: fix for incorrect directory entries after fsync log replay
btrfs: build fixup for qgroup_account_snapshot
btrfs: qgroup: Fix qgroup accounting when creating snapshot
Btrfs: fix fspath error deallocation
btrfs: make find_workspace warn if there are no workspaces
btrfs: make find_workspace always succeed
...
Merge more updates from Andrew Morton:
- the rest of MM
- KASAN updates
- procfs updates
- exit, fork updates
- printk updates
- lib/ updates
- radix-tree testsuite updates
- checkpatch updates
- kprobes updates
- a few other misc bits
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits)
samples/kprobes: print out the symbol name for the hooks
samples/kprobes: add a new module parameter
kprobes: add the "tls" argument for j_do_fork
init/main.c: simplify initcall_blacklisted()
fs/efs/super.c: fix return value
checkpatch: improve --git <commit-count> shortcut
checkpatch: reduce number of `git log` calls with --git
checkpatch: add support to check already applied git commits
checkpatch: add --list-types to show message types to show or ignore
checkpatch: advertise the --fix and --fix-inplace options more
checkpatch: whine about ACCESS_ONCE
checkpatch: add test for keywords not starting on tabstops
checkpatch: improve CONSTANT_COMPARISON test for structure members
checkpatch: add PREFER_IS_ENABLED test
lib/GCD.c: use binary GCD algorithm instead of Euclidean
radix-tree: free up the bottom bit of exceptional entries for reuse
dax: move RADIX_DAX_ definitions to dax.c
radix-tree: make radix_tree_descend() more useful
radix-tree: introduce radix_tree_replace_clear_tags()
radix-tree: tidy up __radix_tree_create()
...
This reverts commit 5a023cdba5.
The functionality is superseded by the new "Device DAX" facility.
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jan Kara <jack@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Here's the "big" driver core update for 4.7-rc1.
Mostly just debugfs changes, the long-known and messy races with removing
debugfs files should be fixed thanks to the great work of Nicolai Stange. We
also have some isa updates in here (the x86 maintainers told me to take it
through this tree), a new warning when we run out of dynamic char major
numbers, and a few other assorted changes, details in the shortlog.
All have been in linux-next for some time with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlc/0mwACgkQMUfUDdst+ynjXACgjNxR5nMUiM8ZuuD0i4Xj7VXd
hnIAoM08+XDCv41noGdAcKv+2WZVZWMC
=i+0H
-----END PGP SIGNATURE-----
Merge tag 'driver-core-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here's the "big" driver core update for 4.7-rc1.
Mostly just debugfs changes, the long-known and messy races with
removing debugfs files should be fixed thanks to the great work of
Nicolai Stange. We also have some isa updates in here (the x86
maintainers told me to take it through this tree), a new warning when
we run out of dynamic char major numbers, and a few other assorted
changes, details in the shortlog.
All have been in linux-next for some time with no reported issues"
* tag 'driver-core-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (32 commits)
Revert "base: dd: don't remove driver_data in -EPROBE_DEFER case"
gpio: ws16c48: Utilize the ISA bus driver
gpio: 104-idio-16: Utilize the ISA bus driver
gpio: 104-idi-48: Utilize the ISA bus driver
gpio: 104-dio-48e: Utilize the ISA bus driver
watchdog: ebc-c384_wdt: Utilize the ISA bus driver
iio: stx104: Utilize the module_isa_driver and max_num_isa_dev macros
iio: stx104: Add X86 dependency to STX104 Kconfig option
Documentation: Add ISA bus driver documentation
isa: Implement the max_num_isa_dev macro
isa: Implement the module_isa_driver macro
pnp: pnpbios: Add explicit X86_32 dependency to PNPBIOS
isa: Decouple X86_32 dependency from the ISA Kconfig option
driver-core: use 'dev' argument in dev_dbg_ratelimited stub
base: dd: don't remove driver_data in -EPROBE_DEFER case
kernfs: Move faulting copy_user operations outside of the mutex
devcoredump: add scatterlist support
debugfs: unproxify files created through debugfs_create_u32_array()
debugfs: unproxify files created through debugfs_create_blob()
debugfs: unproxify files created through debugfs_create_bool()
...
Pull networking fixes and more updates from David Miller:
1) Tunneling fixes from Tom Herbert and Alexander Duyck.
2) AF_UNIX updates some struct sock bit fields with the socket lock,
whereas setsockopt() sets overlapping ones with locking. Seperate
out the synchronized vs. the AF_UNIX unsynchronized ones to avoid
corruption. From Andrey Ryabinin.
3) Mount BPF filesystem with mount_nodev rather than mount_ns, from
Eric Biederman.
4) A couple kmemdup conversions, from Muhammad Falak R Wani.
5) BPF verifier fixes from Alexei Starovoitov.
6) Don't let tunneled UDP packets get stuck in socket queues, if
something goes wrong during the encapsulation just drop the packet
rather than signalling an error up the call stack. From Hannes
Frederic Sowa.
7) SKB ref after free in batman-adv, from Florian Westphal.
8) TCP iSCSI, ocfs2, rds, and tipc have to disable BH in it's TCP
callbacks since the TCP stack runs pre-emptibly now. From Eric
Dumazet.
9) Fix crash in fixed_phy_add, from Rabin Vincent.
10) Fix length checks in xen-netback, from Paul Durrant.
11) Fix mixup in KEY vs KEYID macsec attributes, from Sabrina Dubroca.
12) RDS connection spamming bug fixes from Sowmini Varadhan
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (152 commits)
net: suppress warnings on dev_alloc_skb
uapi glibc compat: fix compilation when !__USE_MISC in glibc
udp: prevent skbs lingering in tunnel socket queues
bpf: teach verifier to recognize imm += ptr pattern
bpf: support decreasing order in direct packet access
net: usb: ch9200: use kmemdup
ps3_gelic: use kmemdup
net:liquidio: use kmemdup
bpf: Use mount_nodev not mount_ns to mount the bpf filesystem
net: cdc_ncm: update datagram size after changing mtu
tuntap: correctly wake up process during uninit
intel: Add support for IPv6 IP-in-IP offload
ip6_gre: Do not allow segmentation offloads GRE_CSUM is enabled with FOU/GUE
RDS: TCP: Avoid rds connection churn from rogue SYNs
RDS: TCP: rds_tcp_accept_worker() must exit gracefully when terminating rds-tcp
net: sock: move ->sk_shutdown out of bitfields.
ipv6: Don't reset inner headers in ip6_tnl_xmit
ip4ip6: Support for GSO/GRO
ip6ip6: Support for GSO/GRO
ipv6: Set features for IPv6 tunnels
...
Pull cifs fixes from Steve French:
"Two small cifs fixes, including one spnego upcall cifs security fix
for stable"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: Remove some obsolete comments
cifs: Create dedicated keyring for spnego operations
When sb_bread() fails, the return value should be -EIO, fix it.
Link: http://lkml.kernel.org/r/1463464943-4142-1-git-send-email-os@iscas.ac.cn
Signed-off-by: Heloise <os@iscas.ac.cn>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Firo Yang <firogm@gmail.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These don't belong in radix-tree.h any more than PAGECACHE_TAG_* do.
Let's try to maintain the idea that radix-tree simply implements an
abstract data type.
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The PR_DUMPABLE flag causes the pid related paths of the proc file
system to be owned by ROOT.
The implementation of pthread_set/getname_np however needs access to
/proc/<pid>/task/<tid>/comm. If PR_DUMPABLE is false this
implementation is locked out.
This patch installs a special permission function for the file "comm"
that grants read and write access to all threads of the same group
regardless of the ownership of the inode. For all other threads the
function falls back to the generic inode permission check.
[akpm@linux-foundation.org: fix spello in comment]
Signed-off-by: Janis Danisevskis <jdanis@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Minfei Huang <mnfhuang@gmail.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Calvin Owens <calvinowens@fb.com>
Cc: Jann Horn <jann@thejh.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's not possible to read the process umask without also modifying it,
which is what umask(2) does. A library cannot read umask safely,
especially if the main program might be multithreaded.
Add a new status line ("Umask") in /proc/<PID>/status. It contains the
file mode creation mask (umask) in octal. It is only shown for tasks
which have task->fs.
This patch is adapted from one originally written by Pierre Carrier.
The use case is that we have endless trouble with people setting weird
umask() values (usually on the grounds of "security"), and then
everything breaking. I'm on the hook to fix these. We'd like to add
debugging to our program so we can dump out the umask in debug reports.
Previous versions of the patch used a syscall so you could only read
your own umask. That's all I need. However there was quite a lot of
push-back from those, so this new version exports it in /proc.
See:
https://lkml.org/lkml/2016/4/13/704 [umask2]
https://lkml.org/lkml/2016/4/13/487 [getumask]
Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
Acked-by: Konstantin Khlebnikov <koct9i@gmail.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pierre Carrier <pierre@spotify.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
userfaultfd_file_create() increments mm->mm_users; this means that the
memory won't be unmapped/freed if mm owner exits/execs, and UFFDIO_COPY
after that can populate the orphaned mm more.
Change userfaultfd_file_create() and userfaultfd_ctx_put() to use
mm->mm_count to pin mm_struct. This means that
atomic_inc_not_zero(mm->mm_users) is needed when we are going to
actually play with this memory. Except handle_userfault() path doesn't
need this, the caller must already have a reference.
The patch adds the new trivial helper, mmget_not_zero(), it can have
more users.
Link: http://lkml.kernel.org/r/20160516172254.GA8595@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The nommu do_mmap expects f_op->get_unmapped_area to either succeed or
return -ENOSYS for VM_MAYSHARE (e.g. private read-only) mappings.
Returning addr in the non-MAP_SHARED case was completely wrong, and only
happened to work because addr was 0. However, it prevented VM_MAYSHARE
mappings from sharing backing with the fs cache, and forced such
mappings (including shareable program text) to be copied whenever the
number of mappings transitioned from 0 to 1, impacting performance and
memory usage. Subsequent mappings beyond the first still correctly
shared memory with the first.
Instead, treat VM_MAYSHARE identically to VM_SHARED at the file ops level;
do_mmap already handles the semantic differences between them.
Signed-off-by: Rich Felker <dalias@libc.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Abhi Das has two patches that fix a GFS2 splice issue (and an adjustment).
- Ben Marzinski has a patch which allows the proper unmount of a GFS2
file system after hitting a withdraw error.
- I have a patch to fix a problem where GFS2 would dereference an error
value, plus three cosmetic / refactoring patches.
- Daniel DeFreez has a patch to fix two glock reference count problems,
where GFS2 was not properly "uninitializing" its glock holder on error
paths.
- Denys Vlasenko has a patch to change a function to not be inlined,
thus reducing the memory footprint of the GFS2 module.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJXP2KLAAoJENeLYdPf93o7nScH/jlDCv+JzwBwAiFEVelgB657
lFY+Q3bPvmcTaHiOqXzco7U1EeyvAuZ+x9duaFBBEGC2/NydIbkAdCP1XjDOTAGZ
aVVAP8qov4adFiXTfBdzMYjJ7DArOy/JRnlC8WK5OiD/3yxsXJZNlAHTgjjL/ZMq
PkhjxX2Lc9ZHGF9YIFTyeoBlqf1qO6WbhkxvSNefgq3c702QPM+T9XND5T1duGDF
ThEcXS2DRRoE+6wr3h+ruvS5kcfxXRLLausAgqNpj2FYQOFCQ6XXgGK8pyXA5V/t
3Vh8mHW0VmSjAfrRWslCRjd+APXoOf+7QHqOHUAA1dvQ6CGfC9qTETH9Wp32JCI=
=8g+a
-----END PGP SIGNATURE-----
Merge tag 'gfs2-4.7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull GFS2 updates from Bob Peterson:
"We've got nine patches this time:
- Abhi Das has two patches that fix a GFS2 splice issue (and an
adjustment).
- Ben Marzinski has a patch which allows the proper unmount of a GFS2
file system after hitting a withdraw error.
- I have a patch to fix a problem where GFS2 would dereference an
error value, plus three cosmetic / refactoring patches.
- Daniel DeFreez has a patch to fix two glock reference count
problems, where GFS2 was not properly "uninitializing" its glock
holder on error paths.
- Denys Vlasenko has a patch to change a function to not be inlined,
thus reducing the memory footprint of the GFS2 module"
* tag 'gfs2-4.7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
GFS2: Refactor gfs2_remove_from_journal
GFS2: Remove allocation parms from gfs2_rbm_find
gfs2: use inode_lock/unlock instead of accessing i_mutex directly
GFS2: Add calls to gfs2_holder_uninit in two error handlers
GFS2: Don't dereference inode in gfs2_inode_lookup until it's valid
GFS2: fs/gfs2/glock.c: Deinline do_error, save 1856 bytes
gfs2: Use gfs2 wrapper to sync inode before calling generic_file_splice_read()
GFS2: Get rid of dead code in inode_go_demote_ok
GFS2: ignore unlock failures after withdraw
Merge updates from Andrew Morton:
- fsnotify fix
- poll() timeout fix
- a few scripts/ tweaks
- debugobjects updates
- the (small) ocfs2 queue
- Minor fixes to kernel/padata.c
- Maybe half of the MM queue
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (117 commits)
mm, page_alloc: restore the original nodemask if the fast path allocation failed
mm, page_alloc: uninline the bad page part of check_new_page()
mm, page_alloc: don't duplicate code in free_pcp_prepare
mm, page_alloc: defer debugging checks of pages allocated from the PCP
mm, page_alloc: defer debugging checks of freed pages until a PCP drain
cpuset: use static key better and convert to new API
mm, page_alloc: inline pageblock lookup in page free fast paths
mm, page_alloc: remove unnecessary variable from free_pcppages_bulk
mm, page_alloc: pull out side effects from free_pages_check
mm, page_alloc: un-inline the bad part of free_pages_check
mm, page_alloc: check multiple page fields with a single branch
mm, page_alloc: remove field from alloc_context
mm, page_alloc: avoid looking up the first zone in a zonelist twice
mm, page_alloc: shortcut watermark checks for order-0 pages
mm, page_alloc: reduce cost of fair zone allocation policy retry
mm, page_alloc: shorten the page allocator fast path
mm, page_alloc: check once if a zone has isolated pageblocks
mm, page_alloc: move __GFP_HARDWALL modifications out of the fastpath
mm, page_alloc: simplify last cpupid reset
mm, page_alloc: remove unnecessary initialisation from __alloc_pages_nodemask()
...
Remove some obsolete comments in the cifs inode_operations
structs that were pointed out by Stephen Rothwell.
CC: Stephen Rothwell <sfr@canb.auug.org.au>
CC: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Steve French <steve.french@primarydata.com>
The session key is the default keyring set for request_key operations.
This session key is revoked when the user owning the session logs out.
Any long running daemon processes started by this session ends up with
revoked session keyring which prevents these processes from using the
request_key mechanism from obtaining the krb5 keys.
The problem has been reported by a large number of autofs users. The
problem is also seen with multiuser mounts where the share may be used
by processes run by a user who has since logged out. A reproducer using
automount is available on the Red Hat bz.
The patch creates a new keyring which is used to cache cifs spnego
upcalls.
Red Hat bz: 1267754
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reported-by: Scott Mayhew <smayhew@redhat.com>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
The allocator fast path looks up the first usable zone in a zonelist and
then get_page_from_freelist does the same job in the zonelist iterator.
This patch preserves the necessary information.
4.6.0-rc2 4.6.0-rc2
fastmark-v1r20 initonce-v1r20
Min alloc-odr0-1 364.00 ( 0.00%) 359.00 ( 1.37%)
Min alloc-odr0-2 262.00 ( 0.00%) 260.00 ( 0.76%)
Min alloc-odr0-4 214.00 ( 0.00%) 214.00 ( 0.00%)
Min alloc-odr0-8 186.00 ( 0.00%) 186.00 ( 0.00%)
Min alloc-odr0-16 173.00 ( 0.00%) 173.00 ( 0.00%)
Min alloc-odr0-32 165.00 ( 0.00%) 165.00 ( 0.00%)
Min alloc-odr0-64 161.00 ( 0.00%) 162.00 ( -0.62%)
Min alloc-odr0-128 159.00 ( 0.00%) 161.00 ( -1.26%)
Min alloc-odr0-256 168.00 ( 0.00%) 170.00 ( -1.19%)
Min alloc-odr0-512 180.00 ( 0.00%) 181.00 ( -0.56%)
Min alloc-odr0-1024 190.00 ( 0.00%) 190.00 ( 0.00%)
Min alloc-odr0-2048 196.00 ( 0.00%) 196.00 ( 0.00%)
Min alloc-odr0-4096 202.00 ( 0.00%) 202.00 ( 0.00%)
Min alloc-odr0-8192 206.00 ( 0.00%) 205.00 ( 0.49%)
Min alloc-odr0-16384 206.00 ( 0.00%) 205.00 ( 0.49%)
The benefit is negligible and the results are within the noise but each
cycle counts.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many developers already know that field for reference count of the
struct page is _count and atomic type. They would try to handle it
directly and this could break the purpose of page reference count
tracepoint. To prevent direct _count modification, this patch rename it
to _refcount and add warning message on the code. After that, developer
who need to handle reference count will find that field should not be
accessed directly.
[akpm@linux-foundation.org: fix comments, per Vlastimil]
[akpm@linux-foundation.org: Documentation/vm/transhuge.txt too]
[sfr@canb.auug.org.au: sync ethernet driver changes]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Sunil Goutham <sgoutham@cavium.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Manish Chopra <manish.chopra@qlogic.com>
Cc: Yuval Mintz <yuval.mintz@qlogic.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The goto is not useful in ocfs2_put_slot(), so delete it.
Signed-off-by: Guozhonghua <guozhonghua@h3c.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Clean up unused parameter 'count' in o2hb_read_block_input().
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Clean up an unused variable 'wants_rotate' in ocfs2_truncate_rec.
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The comment in ocfs2_extended_slot has the offset wrong.
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
struct timespec is not y2038 safe. Even though timespec might be
sufficient to represent timeouts, use struct timespec64 here as the plan
is to get rid of all timespec reference in the kernel.
The patch transitions the common functions: poll_select_set_timeout()
and select_estimate_accuracy() to use timespec64. And, all the syscalls
that use these functions are transitioned in the same patch.
The restart block parameters for poll uses monotonic time. Use
timespec64 here as well to assign timeout value. This parameter in the
restart block need not change because this only holds the monotonic
timestamp at which timeout should occur. And, unsigned long data type
should be big enough for this timestamp.
The system call interfaces will be handled in a separate series.
Compat interfaces need not change as timespec64 is an alias to struct
timespec on a 64 bit system.
Link: http://lkml.kernel.org/r/1461947989-21926-3-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: John Stultz <john.stultz@linaro.org>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Inotify instance is destroyed when all references to it are dropped.
That not only means that the corresponding file descriptor needs to be
closed but also that all corresponding instance marks are freed (as each
mark holds a reference to the inotify instance). However marks are
freed only after SRCU period ends which can take some time and thus if
user rapidly creates and frees inotify instances, number of existing
inotify instances can exceed max_user_instances limit although from user
point of view there is always at most one existing instance. Thus
inotify_init() returns EMFILE error which is hard to justify from user
point of view. This problem is exposed by LTP inotify06 testcase on
some machines.
We fix the problem by making sure all group marks are properly freed
while destroying inotify instance. We wait for SRCU period to end in
that path anyway since we have to make sure there is no event being
added to the instance while we are tearing down the instance. So it
takes only some plumbing to allow for marks to be destroyed in that path
as well and not from a dedicated work item.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Tested-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently faults are protected against truncate by filesystem specific
i_mmap_sem and page lock in case of hole page. Cow faults are protected
DAX radix tree entry locking. So there's no need for i_mmap_lock in DAX
code. Remove it.
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
When doing cow faults, we cannot directly fill in PTE as we do for other
faults as we rely on generic code to do proper accounting of the cowed page.
We also have no page to lock to protect against races with truncate as
other faults have and we need the protection to extend until the moment
generic code inserts cowed page into PTE thus at that point we have no
protection of fs-specific i_mmap_sem. So far we relied on using
i_mmap_lock for the protection however that is completely special to cow
faults. To make fault locking more uniform use DAX entry lock instead.
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Currently DAX page fault locking is racy.
CPU0 (write fault) CPU1 (read fault)
__dax_fault() __dax_fault()
get_block(inode, block, &bh, 0) -> not mapped
get_block(inode, block, &bh, 0)
-> not mapped
if (!buffer_mapped(&bh))
if (vmf->flags & FAULT_FLAG_WRITE)
get_block(inode, block, &bh, 1) -> allocates blocks
if (page) -> no
if (!buffer_mapped(&bh))
if (vmf->flags & FAULT_FLAG_WRITE) {
} else {
dax_load_hole();
}
dax_insert_mapping()
And we are in a situation where we fail in dax_radix_entry() with -EIO.
Another problem with the current DAX page fault locking is that there is
no race-free way to clear dirty tag in the radix tree. We can always
end up with clean radix tree and dirty data in CPU cache.
We fix the first problem by introducing locking of exceptional radix
tree entries in DAX mappings acting very similarly to page lock and thus
synchronizing properly faults against the same mapping index. The same
lock can later be used to avoid races when clearing radix tree dirty
tag.
Reviewed-by: NeilBrown <neilb@suse.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
We will use lowest available bit in the radix tree exceptional entry for
locking of the entry. Define it. Also clean up definitions of DAX entry
type bits in DAX exceptional entries to use defined constants instead of
hardcoding numbers and cleanup checking of these bits to not rely on how
other bits in the entry are set.
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Currently the handling of huge pages for DAX is racy. For example the
following can happen:
CPU0 (THP write fault) CPU1 (normal read fault)
__dax_pmd_fault() __dax_fault()
get_block(inode, block, &bh, 0) -> not mapped
get_block(inode, block, &bh, 0)
-> not mapped
if (!buffer_mapped(&bh) && write)
get_block(inode, block, &bh, 1) -> allocates blocks
truncate_pagecache_range(inode, lstart, lend);
dax_load_hole();
This results in data corruption since process on CPU1 won't see changes
into the file done by CPU0.
The race can happen even if two normal faults race however with THP the
situation is even worse because the two faults don't operate on the same
entries in the radix tree and we want to use these entries for
serialization. So make THP support in DAX code depend on CONFIG_BROKEN
for now.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Currently dax_pmd_fault() decides to fill a PMD-sized hole only if
returned buffer has BH_Uptodate set. However that doesn't get set for
any mapping buffer so that branch is actually a dead code. The
BH_Uptodate check doesn't make any sense so just remove it.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
TCP stack can now run from process context.
Use read_lock_bh() variant to restore previous assumption.
Fixes: 5413d1babe ("net: do not block BH while processing socket backlog")
Fixes: d41a69f1d3 ("tcp: make tcp_sendmsg() aware of socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull MIPS updates from Ralf Baechle:
"This is the main pull request for MIPS for 4.7. Here's the summary of
the changes:
- ATH79: Support for DTB passuing using the UHI boot protocol
- ATH79: Remove support for builtin DTB.
- ATH79: Add zboot debug serial support.
- ATH79: Add initial support for Dragino MS14 (Dragine 2), Onion Omega
and DPT-Module.
- ATH79: Update devicetree clock support for AR9132 and AR9331.
- ATH79: Cleanup the DT code.
- ATH79: Support newer SOCs in ath79_ddr_ctrl_init.
- ATH79: Fix regression in PCI window initialization.
- BCM47xx: Move SPROM driver to drivers/firmware/
- BCM63xx: Enable partition parser in defconfig.
- BMIPS: BMIPS5000 has I cache filing from D cache
- BMIPS: BMIPS: Add cpu-feature-overrides.h
- BMIPS: Add Whirlwind support
- BMIPS: Adjust mips-hpt-frequency for BCM7435
- BMIPS: Remove maxcpus from BCM97435SVMB DTS
- BMIPS: Add missing 7038 L1 register cells to BCM7435
- BMIPS: Various tweaks to initialization code.
- BMIPS: Enable partition parser in defconfig.
- BMIPS: Cache tweaks.
- BMIPS: Add UART, I2C and SATA devices to DT.
- BMIPS: Add BCM6358 and BCM63268support
- BMIPS: Add device tree example for BCM6358.
- BMIPS: Improve Improve BCM6328 and BCM6368 device trees
- Lantiq: Add support for device tree file from boot loader
- Lantiq: Allow build with no built-in DT.
- Loongson 3: Reserve 32MB for RS780E integrated GPU.
- Loongson 3: Fix build error after ld-version.sh modification
- Loongson 3: Move chipset ACPI code from drivers to arch.
- Loongson 3: Speedup irq processing.
- Loongson 3: Add basic Loongson 3A support.
- Loongson 3: Set cache flush handlers to nop.
- Loongson 3: Invalidate special TLBs when needed.
- Loongson 3: Fast TLB refill handler.
- MT7620: Fallback strategy for invalid syscfg0.
- Netlogic: Fix CP0_EBASE redefinition warnings
- Octeon: Initialization fixes
- Octeon: Add DTS files for the D-Link DSR-1000N and EdgeRouter Lite
- Octeon: Enable add Octeon-drivers in cavium_octeon_defconfig
- Octeon: Correctly handle endian-swapped initramfs images.
- Octeon: Support CN73xx, CN75xx and CN78xx.
- Octeon: Remove dead code from cvmx-sysinfo.
- Octeon: Extend number of supported CPUs past 32.
- Octeon: Remove some code limiting NR_IRQS to 255.
- Octeon: Simplify octeon_irq_ciu_gpio_set_type.
- Octeon: Mark some functions __init in smp.c
- Octeon: Octeon: Add Octeon III CN7xxx interface detection
- PIC32: Add serial driver and bindings for it.
- PIC32: Add PIC32 deadman timer driver and bindings.
- PIC32: Add PIC32 clock timer driver and bindings.
- Pistachio: Determine SoC revision during boot
- Sibyte: Fix Kconfig dependencies of SIBYTE_BUS_WATCHER.
- Sibyte: Strip redundant comments from bcm1480_regs.h.
- Panic immediately if panic_on_oops is set.
- module: fix incorrect IS_ERR_VALUE macro usage.
- module: Make consistent use of pr_*
- Remove no longer needed work_on_cpu() call.
- Remove CONFIG_IPV6_PRIVACY from defconfigs.
- Fix registers of non-crashing CPUs in dumps.
- Handle MIPSisms in new vmcore_elf32_check_arch.
- Select CONFIG_HANDLE_DOMAIN_IRQ and make it work.
- Allow RIXI to be used on non-R2 or R6 cores.
- Reserve nosave data for hibernation
- Fix siginfo.h to use strict POSIX types.
- Don't unwind user mode with EVA.
- Fix watchpoint restoration
- Ptrace watchpoints for R6.
- Sync icache when it fills from dcache
- I6400 I-cache fills from dcache.
- Various MSA fixes.
- Cleanup MIPS_CPU_* definitions.
- Signal: Move generic copy_siginfo to signal.h
- Signal: Fix uapi include in exported asm/siginfo.h
- Timer fixes for sake of KVM.
- XPA TLB refill fixes.
- Treat perf counter feature
- Update John Crispin's email address
- Add PIC32 watchdog and bindings.
- Handle R10000 LL/SC bug in set_pte()
- cpufreq: Various fixes for Longson1.
- R6: Fix R2 emulation.
- mathemu: Cosmetic fix to ADDIUPC emulation, plenty of other small fixes
- ELF: ABI and FP fixes.
- Allow for relocatable kernel and use that to support KASLR.
- Fix CPC_BASE_ADDR mask
- Plenty fo smp-cps, CM, R6 and M6250 fixes.
- Make reset_control_ops const.
- Fix kernel command line handling of leading whitespace.
- Cleanups to cache handling.
- Add brcm, bcm6345-l1-intc device tree bindings.
- Use generic clkdev.h header
- Remove CLK_IS_ROOT usage.
- Misc small cleanups.
- CM: Fix compilation error when !MIPS_CM
- oprofile: Fix a preemption issue
- Detect DSP ASE v3 support:1"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus: (275 commits)
MIPS: pic32mzda: fix getting timer clock rate.
MIPS: ath79: fix regression in PCI window initialization
MIPS: ath79: make ath79_ddr_ctrl_init() compatible for newer SoCs
MIPS: Fix VZ probe gas errors with binutils <2.24
MIPS: perf: Fix I6400 event numbers
MIPS: DEC: Export `ioasic_ssr_lock' to modules
MIPS: MSA: Fix a link error on `_init_msa_upper' with older GCC
MIPS: CM: Fix compilation error when !MIPS_CM
MIPS: Fix genvdso error on rebuild
USB: ohci-jz4740: Remove obsolete driver
MIPS: JZ4740: Probe OHCI platform device via DT
MIPS: JZ4740: Qi LB60: Remove support for AVT2 variant
MIPS: pistachio: Determine SoC revision during boot
MIPS: BMIPS: Adjust mips-hpt-frequency for BCM7435
mips: mt7620: fallback to SDRAM when syscfg0 does not have a valid value for the memory type
MIPS: Prevent "restoration" of MSA context in non-MSA kernels
MIPS: cevt-r4k: Dynamically calculate min_delta_ns
MIPS: malta-time: Take seconds into account
MIPS: malta-time: Start GIC count before syncing to RTC
MIPS: Force CPUs to lose FP context during mode switches
...
Pull security subsystem updates from James Morris:
"Highlights:
- A new LSM, "LoadPin", from Kees Cook is added, which allows forcing
of modules and firmware to be loaded from a specific device (this
is from ChromeOS, where the device as a whole is verified
cryptographically via dm-verity).
This is disabled by default but can be configured to be enabled by
default (don't do this if you don't know what you're doing).
- Keys: allow authentication data to be stored in an asymmetric key.
Lots of general fixes and updates.
- SELinux: add restrictions for loading of kernel modules via
finit_module(). Distinguish non-init user namespace capability
checks. Apply execstack check on thread stacks"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (48 commits)
LSM: LoadPin: provide enablement CONFIG
Yama: use atomic allocations when reporting
seccomp: Fix comment typo
ima: add support for creating files using the mknodat syscall
ima: fix ima_inode_post_setattr
vfs: forbid write access when reading a file into memory
fs: fix over-zealous use of "const"
selinux: apply execstack check on thread stacks
selinux: distinguish non-init user namespace capability checks
LSM: LoadPin for kernel file loading restrictions
fs: define a string representation of the kernel_read_file_id enumeration
Yama: consolidate error reporting
string_helpers: add kstrdup_quotable_file
string_helpers: add kstrdup_quotable_cmdline
string_helpers: add kstrdup_quotable
selinux: check ss_initialized before revalidating an inode label
selinux: delay inode label lookup as long as possible
selinux: don't revalidate an inode's label when explicitly setting it
selinux: Change bool variable name to index.
KEYS: Add KEYCTL_DH_COMPUTE command
...
UDF/OSTA terminology is confusing. Partition Numbers (PNs) are arbitrary
16-bit values, one for each physical partition in the volume. Partition
Reference Numbers (PRNs) are indices into the the Partition Map Table
and do not necessarily equal the PN of the mapped partition.
The current metadata code mistakenly uses the PN instead of the PRN when
mapping metadata blocks to physical/sparable blocks. Windows-created
UDF 2.5 discs for some reason use large, arbitrary PNs, resulting in
mount failure and KASAN read warnings in udf_read_inode().
For example, a NetBSD UDF 2.5 partition might look like this:
PRN PN Type
--- -- ----
0 0 Sparable
1 0 Metadata
Since PRN == PN, we are fine.
But Windows could gives us:
PRN PN Type
--- ---- ----
0 8192 Sparable
1 8192 Metadata
So udf_read_inode() will start out by checking the partition length in
sbi->s_partmaps[8192], which is obviously out of bounds.
Fix this by creating a new field (s_phys_partition_ref) in struct
udf_meta_data, referencing whatever physical or sparable map has the
same partition number as the metadata partition.
[JK: Add comment about s_phys_partition_ref, change its name]
Signed-off-by: Alden Tondettar <alden.tondettar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently when udf_get_pblock_meta25() fails to map a block using the
primary metadata file, it will attempt to load the mirror file entry by
calling udf_find_metadata_inode_efe(). That function will return a ERR_PTR
if it fails, but the return value is only checked against NULL. Test the
return value using IS_ERR() and change it to NULL if needed.
Signed-off-by: Alden Tondettar <alden.tondettar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently, if a metadata partition map is missing its partition descriptor,
then udf_get_pblock_meta25() will BUG() out the first time it is called.
This is rather drastic for a corrupted filesystem, so just treat this case
as an invalid mapping instead.
Signed-off-by: Alden Tondettar <alden.tondettar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
init_f2fs_fs does:
1) f2fs_build_trace_ios
2) init_inodecache
3) create_node_manager_caches
4) create_segment_manager_caches
5) create_checkpoint_caches
6) create_extent_cache
7) kset_create_and_add
8) kobject_init_and_add
9) register_shrinker
10) register_filesystem
11) f2fs_create_root_stats
12) proc_mkdir
exit_f2fs_fs should do cleanup in the reverse order
to make the code more clear.
Signed-off-by: Tiezhu Yang <kernelpatch@126.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>