At btrfs_tree_mod_log_free_eb() we check if we are dealing with a leaf,
and if so, return immediately and do nothing. However this check can be
removed, because after it we call tree_mod_need_log(), which returns
false when given an extent buffer that corresponds to a leaf.
So just remove the leaf check and pass the extent buffer to
tree_mod_need_log().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of exposing implementation details of the tree mod log to check
if there are active tree mod log users at btrfs_free_tree_block(), use
the new bit BTRFS_FS_TREE_MOD_LOG_USERS for fs_info->flags instead. This
way extent-tree.c does not need to known about any of the internals of
the tree mod log and avoids taking a lock unnecessarily as well.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree modification log functions are called very frequently, basically
they are called every time a btree is modified (a pointer added or removed
to a node, a new root for a btree is set, etc). Because of that, to avoid
heavy lock contention on the lock that protects the list of tree mod log
users, we have checks that test the emptiness of the list with a full
memory barrier before the checks, so that when there are no tree mod log
users we avoid taking the lock.
Replace the memory barrier and list emptiness check with a test for a new
bit set at fs_info->flags. This bit is used to indicate when there are
tree mod log users, set whenever a user is added to the list and cleared
when the last user is removed from the list. This makes the intention a
bit more obvious and possibly more efficient (assuming test_bit() may be
cheaper than a full memory barrier on some architectures).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Several functions of the tree modification log use integers as booleans,
so change them to use booleans instead, making their use more clear.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree modification log, which records modifications done to btrees, is
quite large and currently spread all over ctree.c, which is a huge file
already.
To make things better organized, move all that code into its own separate
source and header files. Functions and definitions that are used outside
of the module (mostly by ctree.c) are renamed so that they start with a
"btrfs_" prefix. Everything else remains unchanged.
This makes it easier to go over the tree modification log code every
time I need to go read it to fix a bug.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor comment updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
btrfsic_read_block() (which calls kmap()) and
btrfsic_release_block_ctx() (which calls kunmap()) are always called
within a single thread of execution.
Therefore the mappings created within these calls can be a thread local
mapping.
Convert the kmap() of bloc_ctx->pagev to kmap_local_page(). Luckily the
unmap loops backwards through the array pointer so no adjustment needs
to be made to the unmapping order.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Again there is an array of pointers which must be unmapped in the correct
order.
Convert the kmap()'s to kmap_local_page() and adjust the unmapping
to work backwards through the unmapping loop.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These kmaps are thread local and don't need to be atomic. So they can use
the more efficient kmap_local_page(). However, the mapping of pages in
the stripes and the additional parity and qstripe pages are a bit
trickier because the unmapping must occur in the opposite order from the
mapping. Furthermore, the pointer array in __raid_recover_end_io() may
get reordered.
Convert these calls to kmap_local_page() taking care to reverse the
unmappings of any page arrays as well as being careful with the mappings
of any special pages such as the parity and qstripe pages.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use a simple coccinelle script to help convert the most common
kmap()/kunmap() patterns to kmap_local_page()/kunmap_local().
Note that some kmaps which were caught by this script needed to be
handled by hand because of the strict unmapping order of kunmap_local()
so they are not included in this patch. But this script got us started.
There's another temp variable added for the final length write to the
first page so it does not interfere with cpage_out that is used for
mapping other pages.
The development of this patch was aided by the follow script:
// <smpl>
// SPDX-License-Identifier: GPL-2.0-only
// Find kmap and replace with kmap_local_page then mark kunmap
//
// Confidence: Low
// Copyright: (C) 2021 Intel Corporation
// URL: http://coccinelle.lip6.fr/
@ catch_all @
expression e, e2;
@@
(
-kmap(e)
+kmap_local_page(e)
)
...
(
-kunmap(...)
+kunmap_local()
)
// </smpl>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The in_range() macro is defined twice in btrfs' source, once in ctree.h
and once in misc.h.
Remove the definition in ctree.h and include misc.h in the files depending
on it.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_inode_in_log() checks the list of modified extents of the
inode, and has a comment mentioning why, as it used to be necessary to
make sure if we did something like the following:
mmap write range A
mmap write range B
msync range A (ranged fsync)
msync range B (ranged fsync)
we ended up with both ranges being logged.
If we did not check it, then the second fsync would do nothing because
btrfs_inode_in_log() would return true. This was added in 125c4cf9f3
("Btrfs: set inode's logged_trans/last_log_commit after ranged fsync") and
test case generic/325 from fstests exercises that scenario.
However, as of commit 487781796d ("btrfs: make fast fsyncs wait only
for writeback"), every ranged fsync is now turned into a full ranged fsync
(operates on the range from 0 to LLONG_MAX), so it is now pointless to
test of emptiness of the list of modified extents, and the comment is
clearly outdated.
So just remove the comment and list emptiness check, while also changing
the function's return type to be a boolean instead of an integer.
In case one day we get support for ranged fsyncs again, it will be easy
to notice the check is necessary again, because it will make generic/325
always fail.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a race between marking that an inode needs to be logged, either
at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between
btrfs_sync_log(). The following steps describe how the race happens.
1) We are at transaction N;
2) Inode I was previously fsynced in the current transaction so it has:
inode->logged_trans set to N;
3) The inode's root currently has:
root->log_transid set to 1
root->last_log_commit set to 0
Which means only one log transaction was committed to far, log
transaction 0. When a log tree is created we set ->log_transid and
->last_log_commit of its parent root to 0 (at btrfs_add_log_tree());
4) One more range of pages is dirtied in inode I;
5) Some task A starts an fsync against some other inode J (same root), and
so it joins log transaction 1.
Before task A calls btrfs_sync_log()...
6) Task B starts an fsync against inode I, which currently has the full
sync flag set, so it starts delalloc and waits for the ordered extent
to complete before calling btrfs_inode_in_log() at btrfs_sync_file();
7) During ordered extent completion we have btrfs_update_inode() called
against inode I, which in turn calls btrfs_set_inode_last_trans(),
which does the following:
spin_lock(&inode->lock);
inode->last_trans = trans->transaction->transid;
inode->last_sub_trans = inode->root->log_transid;
inode->last_log_commit = inode->root->last_log_commit;
spin_unlock(&inode->lock);
So ->last_trans is set to N and ->last_sub_trans set to 1.
But before setting ->last_log_commit...
8) Task A is at btrfs_sync_log():
- it increments root->log_transid to 2
- starts writeback for all log tree extent buffers
- waits for the writeback to complete
- writes the super blocks
- updates root->last_log_commit to 1
It's a lot of slow steps between updating root->log_transid and
root->last_log_commit;
9) The task doing the ordered extent completion, currently at
btrfs_set_inode_last_trans(), then finally runs:
inode->last_log_commit = inode->root->last_log_commit;
spin_unlock(&inode->lock);
Which results in inode->last_log_commit being set to 1.
The ordered extent completes;
10) Task B is resumed, and it calls btrfs_inode_in_log() which returns
true because we have all the following conditions met:
inode->logged_trans == N which matches fs_info->generation &&
inode->last_subtrans (1) <= inode->last_log_commit (1) &&
inode->last_subtrans (1) <= root->last_log_commit (1) &&
list inode->extent_tree.modified_extents is empty
And as a consequence we return without logging the inode, so the
existing logged version of the inode does not point to the extent
that was written after the previous fsync.
It should be impossible in practice for one task be able to do so much
progress in btrfs_sync_log() while another task is at
btrfs_set_inode_last_trans() right after it reads root->log_transid and
before it reads root->last_log_commit. Even if kernel preemption is enabled
we know the task at btrfs_set_inode_last_trans() can not be preempted
because it is holding the inode's spinlock.
However there is another place where we do the same without holding the
spinlock, which is in the memory mapped write path at:
vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
{
(...)
BTRFS_I(inode)->last_trans = fs_info->generation;
BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
(...)
So with preemption happening after setting ->last_sub_trans and before
setting ->last_log_commit, it is less of a stretch to have another task
do enough progress at btrfs_sync_log() such that the task doing the memory
mapped write ends up with ->last_sub_trans and ->last_log_commit set to
the same value. It is still a big stretch to get there, as the task doing
btrfs_sync_log() has to start writeback, wait for its completion and write
the super blocks.
So fix this in two different ways:
1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the
value of ->last_sub_trans minus 1;
2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just
like we do for buffered and direct writes at btrfs_file_write_iter(),
which is all we need to make sure multiple writes and fsyncs to an
inode in the same transaction never result in an fsync missing that
the inode changed and needs to be logged. Turn this into a helper
function and use it both at btrfs_page_mkwrite() and at
btrfs_file_write_iter() - this also fixes the problem that at
btrfs_page_mkwrite() we were setting those fields without the
protection of the inode's spinlock.
This is an extremely unlikely race to happen in practice.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing an fsync we flush all delalloc, lock the inode (VFS lock), flush
any new delalloc that might have been created before taking the lock and
then wait either for the ordered extents to complete or just for the
writeback to complete (depending on whether the full sync flag is set or
not). We then start logging the inode and assume that while we are doing it
no one else is touching the inode's file extent items (or adding new ones).
That is generally true because all operations that modify an inode acquire
the inode's lock first, including buffered and direct IO writes. However
there is one exception: memory mapped writes, which do not and can not
acquire the inode's lock.
This can cause two types of issues: ending up logging file extent items
with overlapping ranges, which is detected by the tree checker and will
result in aborting the transaction when starting writeback for a log
tree's extent buffers, or a silent corruption where we log a version of
the file that never existed.
Scenario 1 - logging overlapping extents
The following steps explain how we can end up with file extents items with
overlapping ranges in a log tree due to a race between a fsync and memory
mapped writes:
1) Task A starts an fsync on inode X, which has the full sync runtime flag
set. First it starts by flushing all delalloc for the inode;
2) Task A then locks the inode and flushes any other delalloc that might
have been created after the previous flush and waits for all ordered
extents to complete;
3) In the inode's root we have the following leaf:
Leaf N, generation == current transaction id:
---------------------------------------------------------
| (...) [ file extent item, offset 640K, length 128K ] |
---------------------------------------------------------
The last file extent item in leaf N covers the file range from 640K to
768K;
4) Task B does a memory mapped write for the page corresponding to the
file range from 764K to 768K;
5) Task A starts logging the inode. At copy_inode_items_to_log() it uses
btrfs_search_forward() to search for leafs modified in the current
transaction that contain items for the inode. It finds leaf N and copies
all the inode items from that leaf into the log tree.
Now the log tree has a copy of the last file extent item from leaf N.
At the end of the while loop at copy_inode_items_to_log(), we have the
minimum key set to:
min_key.objectid = <inode X number>
min_key.type = BTRFS_EXTENT_DATA_KEY
min_key.offset = 640K
Then we increment the key's offset by 1 so that the next call to
btrfs_search_forward() leaves us at the first key greater than the key
we just processed.
But before btrfs_search_forward() is called again...
6) Dellaloc for the page at offset 764K, dirtied by task B, is started.
It can be started for several reasons:
- The async reclaim task is attempting to satisfy metadata or data
reservation requests, and it has reached a point where it decided
to flush delalloc;
- Due to memory pressure the VMM triggers writeback of dirty pages;
- The system call sync_file_range(2) is called from user space.
7) When the respective ordered extent completes, it trims the length of
the existing file extent item for file offset 640K from 128K to 124K,
and a new file extent item is added with a key offset of 764K and a
length of 4K;
8) Task A calls btrfs_search_forward(), which returns us a path pointing
to the leaf (can be leaf N or some other) containing the new file extent
item for file offset 764K.
We end up copying this item to the log tree, which overlaps with the
last copied file extent item, which covers the file range from 640K to
768K.
When writeback is triggered for log tree's extent buffers, the issue
will be detected by the tree checker which will dump a trace and an
error message on dmesg/syslog. If the writeback is triggered when
syncing the log, which typically is, then we also end up aborting the
current transaction.
This is the same type of problem fixed in 0c713cbab6 ("Btrfs: fix race
between ranged fsync and writeback of adjacent ranges").
Scenario 2 - logging a version of the file that never existed
This scenario only happens when using the NO_HOLES feature and results in
a silent corruption, in the sense that is not detectable by 'btrfs check'
or the tree checker:
1) We have an inode I with a size of 1M and two file extent items, one
covering an extent with disk_bytenr == X for the file range [0, 512K)
and another one covering another extent with disk_bytenr == Y for the
file range [512K, 1M);
2) A hole is punched for the file range [512K, 1M);
3) Task A starts an fsync of inode I, which has the full sync runtime flag
set. It starts by flushing all existing delalloc, locks the inode (VFS
lock), starts any new delalloc that might have been created before
taking the lock and waits for all ordered extents to complete;
4) Some other task does a memory mapped write for the page corresponding to
the file range [640K, 644K) for example;
5) Task A then logs all items of the inode with the call to
copy_inode_items_to_log();
6) In the meanwhile delalloc for the range [640K, 644K) is started. It can
be started for several reasons:
- The async reclaim task is attempting to satisfy metadata or data
reservation requests, and it has reached a point where it decided
to flush delalloc;
- Due to memory pressure the VMM triggers writeback of dirty pages;
- The system call sync_file_range(2) is called from user space.
7) The ordered extent for the range [640K, 644K) completes and a file
extent item for that range is added to the subvolume tree, pointing
to a 4K extent with a disk_bytenr == Z;
8) Task A then calls btrfs_log_holes(), to scan for implicit holes in
the subvolume tree. It finds two implicit holes:
- one for the file range [512K, 640K)
- one for the file range [644K, 1M)
As a result we end up neither logging a hole for the range [640K, 644K)
nor logging the file extent item with a disk_bytenr == Z.
This means that if we have a power failure and replay the log tree we
end up getting the following file extent layout:
[ disk_bytenr X ] [ hole ] [ disk_bytenr Y ] [ hole ]
0 512K 512K 640K 640K 644K 644K 1M
Which does not corresponding to any layout the file ever had before
the power failure. The only two valid layouts would be:
[ disk_bytenr X ] [ hole ]
0 512K 512K 1M
and
[ disk_bytenr X ] [ hole ] [ disk_bytenr Z ] [ hole ]
0 512K 512K 640K 640K 644K 644K 1M
This can be fixed by serializing memory mapped writes with fsync, and there
are two ways to do it:
1) Make a fsync lock the entire file range, from 0 to (u64)-1 / LLONG_MAX
in the inode's io tree. This prevents the race but also blocks any reads
during the duration of the fsync, which has a negative impact for many
common workloads;
2) Make an fsync write lock the i_mmap_lock semaphore in the inode. This
semaphore was recently added by Josef's patch set:
btrfs: add a i_mmap_lock to our inode
btrfs: cleanup inode_lock/inode_unlock uses
btrfs: exclude mmaps while doing remap
btrfs: exclude mmap from happening during all fallocate operations
and is used to solve races between memory mapped writes and
clone/dedupe/fallocate. This also makes us have the same behaviour we
have regarding other writes (buffered and direct IO) and fsync - block
them while the inode logging is in progress.
This change uses the second approach due to the performance impact of the
first one.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a small window where a deadlock can happen between fallocate and
mmap. This is described in detail by Filipe:
"""
When doing a fallocate operation we lock the inode, flush delalloc within
the target range, wait for any ordered extents to complete and then lock
the file range. Before we lock the range and after we flush delalloc,
there is a time window where another task can come in and do a memory
mapped write for a page within the fallocate range.
This means that after fallocate locks the range, there can be a dirty page
in the range. More often than not, this does not cause any problem.
The exception is when we are low on available metadata space, because an
fallocate operation needs to start a transaction while holding the file
range locked, either through btrfs_prealloc_file_range() or through the
call to btrfs_fallocate_update_isize(). If that's the case, we can end up
in a deadlock. The following list of steps explains how that happens:
1) A fallocate operation starts, locks the inode, flushes delalloc in the
range and waits for ordered extents in the range to complete;
2) Before the fallocate task locks the file range, another task does a
memory mapped write for a page in the fallocate target range. This is
possible since memory mapped writes do not (and can not) lock the
inode;
3) The fallocate task locks the file range. At this point there is one
dirty page in the range (due to the memory mapped write);
4) When the fallocate task attempts to start a transaction, it blocks when
attempting to reserve metadata space, since we are low on available
metadata space. Before blocking (wait on its reservation ticket), it
starts the async reclaim task (if not running already);
5) The async reclaim task is not able to release space through any other
means, so it decides to flush delalloc for inodes with dirty pages.
It finds that the inode used in the fallocate operation has a dirty
page and therefore queues a job (fs_info->flush_workers workqueue) to
flush delalloc for that inode and waits on that job to complete;
6) The flush job blocks when attempting to lock the file range because
it is currently locked by the fallocate task;
7) The fallocate task keeps waiting for its metadata reservation, waiting
for a wakeup on its reservation ticket. The async reclaim task is
waiting on the flush job, which in turn is waiting for locking the file
range that is currently locked by the fallocate task. So unless some
other task is able to release enough metadata space, for example an
ordered extent for some other inode completes, we end up in a deadlock
between all these tasks.
When this happens stack traces like the following show up in dmesg/syslog:
INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000
Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
Call Trace:
__schedule+0x5d1/0xcf0
schedule+0x45/0xe0
lock_extent_bits+0x1e6/0x2d0 [btrfs]
? finish_wait+0x90/0x90
btrfs_invalidatepage+0x32c/0x390 [btrfs]
? __mod_memcg_state+0x8e/0x160
__extent_writepage+0x2d4/0x400 [btrfs]
extent_write_cache_pages+0x2b2/0x500 [btrfs]
? lock_release+0x20e/0x4c0
? trace_hardirqs_on+0x1b/0xf0
extent_writepages+0x43/0x90 [btrfs]
? lock_acquire+0x1a3/0x490
do_writepages+0x43/0xe0
? __filemap_fdatawrite_range+0xa4/0x100
__filemap_fdatawrite_range+0xc5/0x100
btrfs_run_delalloc_work+0x17/0x40 [btrfs]
btrfs_work_helper+0xf1/0x600 [btrfs]
process_one_work+0x24e/0x5e0
worker_thread+0x50/0x3b0
? process_one_work+0x5e0/0x5e0
kthread+0x153/0x170
? kthread_mod_delayed_work+0xc0/0xc0
ret_from_fork+0x22/0x30
INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
Call Trace:
__schedule+0x5d1/0xcf0
? kvm_clock_read+0x14/0x30
? wait_for_completion+0x81/0x110
schedule+0x45/0xe0
schedule_timeout+0x30c/0x580
? _raw_spin_unlock_irqrestore+0x3c/0x60
? lock_acquire+0x1a3/0x490
? try_to_wake_up+0x7a/0xa20
? lock_release+0x20e/0x4c0
? lock_acquired+0x199/0x490
? wait_for_completion+0x81/0x110
wait_for_completion+0xab/0x110
start_delalloc_inodes+0x2af/0x390 [btrfs]
btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
flush_space+0x24f/0x660 [btrfs]
btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
process_one_work+0x24e/0x5e0
worker_thread+0x20f/0x3b0
? process_one_work+0x5e0/0x5e0
kthread+0x153/0x170
? kthread_mod_delayed_work+0xc0/0xc0
ret_from_fork+0x22/0x30
(...)
several tasks waiting for the inode lock held by the fallocate task below
(...)
RIP: 0033:0x7f61efe73fff
Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5.
RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c
RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff
RDX: 00000000ffffff9c RSI: 0000560fbd5d90a0 RDI: 00000000ffffff9c
RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
R10: 0000560fbd5d7ad0 R11: 0000000000000202 R12: 0000000000000001
R13: 000000000000005e R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
task:fdm-stress state:D stack: 0 pid:2508243 ppid:2508153 flags:0x00000000
Call Trace:
__schedule+0x5d1/0xcf0
? _raw_spin_unlock_irqrestore+0x3c/0x60
schedule+0x45/0xe0
__reserve_bytes+0x4a4/0xb10 [btrfs]
? finish_wait+0x90/0x90
btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
btrfs_block_rsv_add+0x1f/0x50 [btrfs]
start_transaction+0x2d1/0x760 [btrfs]
btrfs_replace_file_extents+0x120/0x930 [btrfs]
? btrfs_fallocate+0xdcf/0x1260 [btrfs]
btrfs_fallocate+0xdfb/0x1260 [btrfs]
? filename_lookup+0xf1/0x180
vfs_fallocate+0x14f/0x440
ioctl_preallocate+0x92/0xc0
do_vfs_ioctl+0x66b/0x750
? __do_sys_newfstat+0x53/0x60
__x64_sys_ioctl+0x62/0xb0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
"""
Fix this by disallowing mmaps from happening while we're doing any of
the fallocate operations on this inode.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Darrick reported a potential issue to me where we could allow mmap
writes after validating a page range matched in the case of dedupe.
Generally we rely on lock page -> lock extent with the ordered flush to
protect us, but this is done after we check the pages because we use the
generic helpers, so we could modify the page in between doing the check
and locking the range.
There also exists a deadlock, as described by Filipe
"""
When cloning a file range, we lock the inodes, flush any delalloc within
the respective file ranges, wait for any ordered extents and then lock the
file ranges in both inodes. This means that right after we flush delalloc
and before we lock the file ranges, memory mapped writes can come in and
dirty pages in the file ranges of the clone operation.
Most of the time this is harmless and causes no problems. However, if we
are low on available metadata space, we can later end up in a deadlock
when starting a transaction to replace file extent items. This happens if
when allocating metadata space for the transaction, we need to wait for
the async reclaim thread to release space and the reclaim thread needs to
flush delalloc for the inode that got the memory mapped write and has its
range locked by the clone task.
Basically what happens is the following:
1) A clone operation locks inodes A and B, flushes delalloc for both
inodes in the respective file ranges and waits for any ordered extents
in those ranges to complete;
2) Before the clone task locks the file ranges, another task does a
memory mapped write (which does not lock the inode) for one of the
inodes of the clone operation. So now we have a dirty page in one of
the ranges used by the clone operation;
3) The clone operation locks the file ranges for inodes A and B;
4) Later, when iterating over the file extents of inode A, the clone
task attempts to start a transaction. There's not enough available
free metadata space, so the async reclaim task is started (if not
running already) and we wait for someone to wake us up on our
reservation ticket;
5) The async reclaim task is not able to release space by any other
means and decides to flush delalloc for the inode of the clone
operation;
6) The workqueue job used to flush the inode blocks when starting
delalloc for the inode, since the file range is currently locked by
the clone task;
7) But the clone task is waiting on its reservation ticket and the async
reclaim task is waiting on the flush job to complete, which can't
progress since the clone task has the file range locked. So unless
some other task is able to release space, for example an ordered
extent for some other inode completes, we have a deadlock between all
these tasks;
When this happens stack traces like the following show up in dmesg/syslog:
INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000
Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
Call Trace:
__schedule+0x5d1/0xcf0
schedule+0x45/0xe0
lock_extent_bits+0x1e6/0x2d0 [btrfs]
? finish_wait+0x90/0x90
btrfs_invalidatepage+0x32c/0x390 [btrfs]
? __mod_memcg_state+0x8e/0x160
__extent_writepage+0x2d4/0x400 [btrfs]
extent_write_cache_pages+0x2b2/0x500 [btrfs]
? lock_release+0x20e/0x4c0
? trace_hardirqs_on+0x1b/0xf0
extent_writepages+0x43/0x90 [btrfs]
? lock_acquire+0x1a3/0x490
do_writepages+0x43/0xe0
? __filemap_fdatawrite_range+0xa4/0x100
__filemap_fdatawrite_range+0xc5/0x100
btrfs_run_delalloc_work+0x17/0x40 [btrfs]
btrfs_work_helper+0xf1/0x600 [btrfs]
process_one_work+0x24e/0x5e0
worker_thread+0x50/0x3b0
? process_one_work+0x5e0/0x5e0
kthread+0x153/0x170
? kthread_mod_delayed_work+0xc0/0xc0
ret_from_fork+0x22/0x30
INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
Call Trace:
__schedule+0x5d1/0xcf0
? kvm_clock_read+0x14/0x30
? wait_for_completion+0x81/0x110
schedule+0x45/0xe0
schedule_timeout+0x30c/0x580
? _raw_spin_unlock_irqrestore+0x3c/0x60
? lock_acquire+0x1a3/0x490
? try_to_wake_up+0x7a/0xa20
? lock_release+0x20e/0x4c0
? lock_acquired+0x199/0x490
? wait_for_completion+0x81/0x110
wait_for_completion+0xab/0x110
start_delalloc_inodes+0x2af/0x390 [btrfs]
btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
flush_space+0x24f/0x660 [btrfs]
btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
process_one_work+0x24e/0x5e0
worker_thread+0x20f/0x3b0
? process_one_work+0x5e0/0x5e0
kthread+0x153/0x170
? kthread_mod_delayed_work+0xc0/0xc0
ret_from_fork+0x22/0x30
(...)
several other tasks blocked on inode locks held by the clone task below
(...)
RIP: 0033:0x7f61efe73fff
Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5.
RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c
RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff
RDX: 00000000ffffff9c RSI: 0000560fbd604690 RDI: 00000000ffffff9c
RBP: 00007ffc3371beb0 R08: 0000000000000002 R09: 0000560fbd5d75f0
R10: 0000560fbd5d81f0 R11: 0000000000000202 R12: 0000000000000002
R13: 000000000000000b R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
task: fdm-stress state:D stack: 0 pid:2508234 ppid:2508153 flags:0x00004000
Call Trace:
__schedule+0x5d1/0xcf0
? _raw_spin_unlock_irqrestore+0x3c/0x60
schedule+0x45/0xe0
__reserve_bytes+0x4a4/0xb10 [btrfs]
? finish_wait+0x90/0x90
btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
btrfs_block_rsv_add+0x1f/0x50 [btrfs]
start_transaction+0x2d1/0x760 [btrfs]
btrfs_replace_file_extents+0x120/0x930 [btrfs]
? lock_release+0x20e/0x4c0
btrfs_clone+0x3e4/0x7e0 [btrfs]
? btrfs_lookup_first_ordered_extent+0x8e/0x100 [btrfs]
btrfs_clone_files+0xf6/0x150 [btrfs]
btrfs_remap_file_range+0x324/0x3d0 [btrfs]
do_clone_file_range+0xd4/0x1f0
vfs_clone_file_range+0x4d/0x230
? lock_release+0x20e/0x4c0
ioctl_file_clone+0x8f/0xc0
do_vfs_ioctl+0x342/0x750
__x64_sys_ioctl+0x62/0xb0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
"""
Fix both of these issues by excluding mmaps from happening we are doing
any sort of remap, which prevents this race completely.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A few places we intermix btrfs_inode_lock with a inode_unlock, and some
places we just use inode_lock/inode_unlock instead of btrfs_inode_lock.
None of these places are using this incorrectly, but as we adjust some
of these callers it would be nice to keep everything consistent, so
convert everybody to use btrfs_inode_lock/btrfs_inode_unlock.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We need to be able to exclude page_mkwrite from happening concurrently
with certain operations. To facilitate this, add a i_mmap_lock to our
inode, down_read() it in our mkwrite, and add a new ILOCK flag to
indicate that we want to take the i_mmap_lock as well. I used pahole to
check the size of the btrfs_inode, the sizes are as follows
no lockdep:
before: 1120 (3 per 4k page)
after: 1160 (3 per 4k page)
lockdep:
before: 2072 (1 per 4k page)
after: 2224 (1 per 4k page)
We're slightly larger but it doesn't change how many objects we can fit
per page.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter mirror is not used and does not make sense for checksum
verification of the given bio.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
force_cow can be calculated from inode and does not need to be passed as
an argument.
This simplifies run_delalloc_nocow() call from btrfs_run_delalloc_range()
A new function, should_nocow() checks if the range should be NOCOWed or
not. The function returns true iff either BTRFS_INODE_NODATA or
BTRFS_INODE_PREALLOC, but is not a defrag extent.
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix the following coccicheck warnings:
./fs/btrfs/volumes.c:1462:10-11: WARNING: return of 0/1 in function
'dev_extent_hole_check_zoned' with return type bool.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we do not do btree read ahead when doing an incremental send,
however we know that we will read and process any node or leaf in the
send root that has a generation greater than the generation of the parent
root. So triggering read ahead for such nodes and leafs is beneficial
for an incremental send.
This change does that, triggers read ahead of any node or leaf in the
send root that has a generation greater then the generation of the
parent root. As for the parent root, no readahead is triggered because
knowing in advance which nodes/leaves are going to be read is not so
linear and there's often a large time window between visiting nodes or
leaves of the parent root. So I opted to leave out the parent root,
and triggering read ahead for its nodes/leaves seemed to have not made
significant difference.
The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of ram:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))
echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done
wait ${worker_pids[@]}
sync
echo
echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
}
initial_file_count=500000
add_files $initial_file_count 0 4
echo
echo "Creating first snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap1
echo
echo "Adding more files..."
add_files $((initial_file_count / 4)) $initial_file_count 4
echo
echo "Updating 1/50th of the initial files..."
for ((i = 1; i < $initial_file_count; i += 50)); do
xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
done
echo
echo "Creating second snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap2
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing full send..."
start=$(date +%s)
btrfs send $MNT/snap1 > /dev/null
end=$(date +%s)
echo
echo "Full send took $((end - start)) seconds"
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing incremental send..."
start=$(date +%s)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s)
echo
echo "Incremental send took $((end - start)) seconds"
umount $MNT
Before this change, incremental send duration:
with $initial_file_count == 200000: 51 seconds
with $initial_file_count == 500000: 168 seconds
After this change, incremental send duration:
with $initial_file_count == 200000: 39 seconds (-26.7%)
with $initial_file_count == 500000: 125 seconds (-29.4%)
For $initial_file_count == 200000 there are 62600 nodes and leaves in the
btree of the first snapshot, and 77759 nodes and leaves in the btree of
the second snapshot. The root nodes were at level 2.
While for $initial_file_count == 500000 there are 152476 nodes and leaves
in the btree of the first snapshot, and 190511 nodes and leaves in the
btree of the second snapshot. The root nodes were at level 2 as well.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a full send we know that we are going to be reading every node
and leaf of the send root, so we benefit from enabling read ahead for the
btree.
This change enables read ahead for full send operations only, incremental
sends will have read ahead enabled in a different way by a separate patch.
The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of RAM:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))
echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done
wait ${worker_pids[@]}
sync
echo
echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
}
initial_file_count=500000
add_files $initial_file_count 0 4
echo
echo "Creating first snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap1
echo
echo "Adding more files..."
add_files $((initial_file_count / 4)) $initial_file_count 4
echo
echo "Updating 1/50th of the initial files..."
for ((i = 1; i < $initial_file_count; i += 50)); do
xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
done
echo
echo "Creating second snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap2
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing full send..."
start=$(date +%s)
btrfs send $MNT/snap1 > /dev/null
end=$(date +%s)
echo
echo "Full send took $((end - start)) seconds"
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing incremental send..."
start=$(date +%s)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s)
echo
echo "Incremental send took $((end - start)) seconds"
umount $MNT
Before this change, full send duration:
with $initial_file_count == 200000: 165 seconds
with $initial_file_count == 500000: 407 seconds
After this change, full send duration:
with $initial_file_count == 200000: 149 seconds (-10.2%)
with $initial_file_count == 500000: 353 seconds (-14.2%)
For $initial_file_count == 200000 there are 62600 nodes and leaves in the
btree of the first snapshot, while for $initial_file_count == 500000 there
are 152476 nodes and leaves. The roots were at level 2.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_block_rsv_add can return only ENOSPC since it's called with
NO_FLUSH modifier. This so simplify the logic in
btrfs_delayed_inode_reserve_metadata to exploit this invariant.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add assert and comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
It's only used for tracepoint to obtain the inode number, but we already
have the ino from btrfs_delayed_node::inode_id.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's no longer expected to call this function with an open transaction
so all the workarounds concerning this can be removed. In fact it'll
constitute a bug to call this function with a transaction already held
so WARN in this case.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Drop function declarations at the beginning of the file scrub.c. These
functions are defined before they are used in the same file and don't
need forward declaration.
No functional changes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_extent_readonly() checks if the block group is readonly, the bool
return type should be used.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_extent_readonly() is used by can_nocow_extent() in inode.c. So
move it from extent-tree.c to inode.c and declare it as static.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_inc_block_group_ro wants to ensure that the current transaction is
not running dirty block groups, if it is it waits and loops again.
That logic is currently implemented using a goto label. Actually using
a proper do {} while() construct doesn't hurt readability nor does it
introduce excessive nesting and makes the relevant code stand out by
being encompassed in the loop construct. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
No point in duplicating the functionality just use the generic helper
that has the same semantics.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is small error in comment about BTRFS_ORDERED_* flags, added in
commit 3c198fe064 ("btrfs: rework the order of
btrfs_ordered_extent::flags") but the fixup did not get merged in time.
The 4 types are for ordered extent itself, not for direct io.
Only 3 types support direct io, REGULAR/NOCOW/PREALLOC.
Fix the comment to reflect that.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Another smaller set of fixes for three of the Arm platforms:
TI OMAP:
Fix swapped mmc device order also for omap3 that got changed with the
recent PROBE_PREFER_ASYNCHRONOUS changes. While eventually the aliases
should be board specific, all the mmc device instances are all there in
the SoC, and we do probe them by default so that PM runtime can idle the
devices if left enabled from the bootloader.
Qualcomm Snapdragon:
This bypasses the, recently introduced, interconnect handling in the
GENI (serial engine) driver when running off ACPI, as this causes the
GENI probe to fail and the Lenovo Yoga C630 to boot without keyboard
and touchpad.
Allwinner:
One 32kHz clock fix for the beelink gs1, a CD polarity fix for the SoPine,
some MAINTAINERS maintainance, and a clk / reset switch to our headers.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAmB8hfwACgkQmmx57+YA
GNn5QA//TBcW23bLfjYI8kPl7yJ9KSD6UdNGHXYizJry5hAoyLhvCVSq6quPmAPy
psempGKQBYiRb0Ftewc2+v00u4XdOTxqFw2MDs6UoladfiqyYfkEJxPgXG/k0msJ
gGIOT5ysDeRiqNAFND0wO6z/wPmlgJl37yTztOrbghWwYLvwlUkqsXzJ9B72FCzM
MGwrv1LZfEiljuaJAT+nVNkStKxCxSWjzIvYMgC/K9xbAjjtJNZby2tNJObMiARe
d3G2nGYmo414eQGNb+SDBx5h4aPZGR0ZxdLbzhAFrdw+uUzwlnJ1ufJQnEr6CXql
4MziHYWRYOAF90uLVeWiH8ZEh/CbxdnenmYCooOj+LAkn6IHAErRlFeZAfjWnckh
pwcdeebk4SQ9SNDPIWwwVYKVeGtnMM7q8HucDulMRxYmDL5sTprMhJVwxXbshivw
dnYWzV86FUIOgegUyFgzPKSTVqHbG68dxz2yRhR8yP56pTLnzh/lsB+0DmtiHcIx
O8chRnvtJib5/XspF6CVXqWYDrvIR5L8h7JMbSU/IetADJwQyEYz9CVh/DyNuiJQ
+oZY8Xqt3NzC9xOP/pTP6NFDYsVvKwsQRdwT3CBoV7lEM9X4wEypHOR+QmvO8q8m
AhddgIhx3P7olKnKgylPXS0kjQ3AuBarmnUMI9eaS2tHO2n5z/Y=
=f2ly
-----END PGP SIGNATURE-----
Merge tag 'arm-fixes-5.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
"Another smaller set of fixes for three of the Arm platforms:
TI OMAP:
Fix swapped mmc device order also for omap3 that got changed with
the recent PROBE_PREFER_ASYNCHRONOUS changes. While eventually the
aliases should be board specific, all the mmc device instances are
all there in the SoC, and we do probe them by default so that PM
runtime can idle the devices if left enabled from the bootloader.
Qualcomm Snapdragon:
This bypasses the recently introduced interconnect handling in
the GENI (serial engine) driver when running off ACPI, as this
causes the GENI probe to fail and the Lenovo Yoga C630 to boot
without keyboard and touchpad.
Allwinner:
One 32kHz clock fix for the beelink gs1, a CD polarity fix for the
SoPine, some MAINTAINERS maintainance, and a clk / reset switch to
our headers"
* tag 'arm-fixes-5.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
arm64: dts: allwinner: h6: beelink-gs1: Remove ext. 32 kHz osc reference
MAINTAINERS: Match on allwinner keyword
MAINTAINERS: Add our new mailing-list
arm64: dts: allwinner: Fix SD card CD GPIO for SOPine systems
arm64: dts: allwinner: h6: Switch to macros for RSB clock/reset indices
ARM: OMAP2+: Fix uninitialized sr_inst
ARM: dts: Fix swapped mmc order for omap3
ARM: OMAP2+: Fix warning for omap_init_time_of()
soc: qcom: geni: shield geni_icc_get() for ACPI boot
- Halve maximum number of CPUs if DEBUG_KMAP_LOCAL is enabled
- Fix conversion for_each_membock() to for_each_mem_range()
- Fix footbridge PCI mapping
- Avoid uprobes hooking on thumb instructions
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEuNNh8scc2k/wOAE+9OeQG+StrGQFAmB8eFQACgkQ9OeQG+St
rGSOeA//W7unQvu7fKda6PdewSk4S35Pz5EAPwFyisYg7HQEikfhBqlLQltM+JGk
z8/NewqFUHjOdOStyBZgx8ZCecmABeRVQQ8yG8IEb4n8GkgagTrXEtmXT7iF0dZQ
4S8PWbf/wgOQZ88xTRnpruIFqJZw8m4xbaQ64NbQM3f5rzMIFiuXdm84ZsUcGVWq
xOCzXUcvwi8xSU0dSfp7iZ//ODE/2qs4qeQ8fTz5ZYmSQRctsGWHLnI6wbtUolFn
gOu19QSkLaIh7iO+CoNvTsMQOm11hQGzyMhSix835zA9NX8K4yCvUT0ZRp6q2AI+
D7gvkkng/VD9j25govHHqkJkqXn8KNIm+SQtm+9bTcx93FpQ3Iq0627zjIOamnmR
7kGIRXL5YB8SPYpl62K8DGN05rZsb71a8lJ1sIsXrBSApB/p4H82UXEIqCWtgZGe
ivOubX+X4LfLerAzwJ0w86Uh3TqFHUH61rswhJPuUFWHUw1lYgM3XnKBzJbYpQfo
9xw2nXOGvQPdn8UgJ8fBhxqtPIOaVIIZY1olCal/yN5348r1U6YuYNbwfCQ68evl
lr3x0m0MxeQE0Q1E0IltOpgCR8FxNYg62HfdAtQ/W+71hh37nP1JCvION3OdQJNR
QNwyRRGsLwSChVPN+YBq347YBaOHr9CmYDYDaK2ucpEwd8l6ORQ=
=LKyZ
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm
Pull ARM fixes from Russell King:
- Halve maximum number of CPUs if DEBUG_KMAP_LOCAL is enabled
- Fix conversion for_each_membock() to for_each_mem_range()
- Fix footbridge PCI mapping
- Avoid uprobes hooking on thumb instructions
* tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: 9071/1: uprobes: Don't hook on thumb instructions
ARM: footbridge: fix PCI interrupt mapping
ARM: 9069/1: NOMMU: Fix conversion for_each_membock() to for_each_mem_range()
ARM: 9063/1: mm: reduce maximum number of CPUs if DEBUG_KMAP_LOCAL is enabled
Since uprobes is not supported for thumb, check that the thumb bit is
not set when matching the uprobes instruction hooks.
The Arm UDF instructions used for uprobes triggering
(UPROBE_SWBP_ARM_INSN and UPROBE_SS_ARM_INSN) coincidentally share the
same encoding as a pair of unallocated 32-bit thumb instructions (not
UDF) when the condition code is 0b1111 (0xf). This in effect makes it
possible to trigger the uprobes functionality from thumb, and at that
using two unallocated instructions which are not permanently undefined.
Signed-off-by: Fredrik Strupe <fredrik@strupe.net>
Cc: stable@vger.kernel.org
Fixes: c7edc9e326 ("ARM: add uprobes support")
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
This libsas fix is for a problem that occurs when trying to change the
cache type of an ATA device and the libiscsi one is a regression fix
from this merge window.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCYHuPRSYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pisha52AQCXc8n0
6VAjfc+8aCqjX2Hpw4YCGeW5RYoNj1WXhiDv/AD+L4FVBMdQ4DE9ukH12YW7YBRS
qP03aNSHLCl8wfVon8Q=
=Btn4
-----END PGP SIGNATURE-----
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Two fixes: the libsas fix is for a problem that occurs when trying to
change the cache type of an ATA device and the libiscsi one is a
regression fix from this merge window"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: libsas: Reset num_scatter if libata marks qc as NODATA
scsi: iscsi: Fix iSCSI cls conn state
vmwgfx:
- fixed unpinning before destruction
- lockdep init reordering
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJge3NBAAoJEAx081l5xIa+U28P/04l35qKLMQhK3EwBZS1cyis
bii8W0RKfAuWl7dEp53nEtHEHG+z0cDH+/RRDhXtr3NV+wCuDF7I8r+P1xn2P+Xy
siWRBsbg5Mp3sH2UegpMrDwgl0uAV7mhkpMZJ08ASCVzsgLNBMiDiDPfDrbSA6ne
yfV3n4Fv4VPdJwMhp33OYB5OYqDpIlgGsei5R0w9zrv7bKnX2193UW5qaVuAd6k2
quKd0aGBATLm/BWBRL4pe1xgR7DJRFe4x4Tihn818tHe8gGaH2xvO6V5k894QlUM
h5lk8xjLFqICj1kDuzmtZhNKU0iOK357NLcimxZYgSf5gXLLZsTwYCaFPAgEquX6
s+q48J9hfo+1TkcMBRih4zpxsRnjaPRIvNmUE6pYgUiUtmJzE9eu/Jblll/keHzr
4miyjiZHOiV8Q+WhpgLbJdPOdqNavxJj8BGCNQajMEQkzN7pZr4ilkhHYMNnslQO
ug+ZqxWM6pPXvOMs/k+M4SOcPEaGcnxVZpAOL/DbtD8iPUlj9J6DkzL943rvpo+o
pMBdEiVuVneJ03ev8IsAzKPID3UJJ0pmnlrZgJpkRpWWkoH75hHzG4Ps4zOAvUsa
ykIY+lcf5GxI2hGVDnOUAfAm/URNooiDIhVc/rf8i5/UJBaK48ORfC8DeZw6r2Wq
pLvZW1PKN/b+IYrOrl4n
=nr7Z
-----END PGP SIGNATURE-----
Merge tag 'drm-fixes-2021-04-18' of git://anongit.freedesktop.org/drm/drm
Pull vmwgfx fixes from Dave Airlie:
"This contains two regression fixes for vmwgfx, one due to a refactor
which meant locks were being used before initialisation, and the other
in fixing up some warnings from the core when destroying pinned
buffers.
vmwgfx:
- fixed unpinning before destruction
- lockdep init reordering"
* tag 'drm-fixes-2021-04-18' of git://anongit.freedesktop.org/drm/drm:
drm/vmwgfx: Make sure bo's are unpinned before putting them back
drm/vmwgfx: Fix the lockdep breakage
drm/vmwgfx: Make sure we unpin no longer needed buffers
Here's a set of 3 patches fixing ugly regressions
in the vmwgfx driver. We broke lock initialization
code and ended up using spinlocks before initialization
breaking lockdep.
Also there was a bit of a fallout from drm changes
which made the core validate that unreferenced buffers
have been unpinned. vmwgfx pinning code predates a lot
of the core drm and wasn't written to account for those
semantics. Fortunately changes required to fix it
are not too intrusive.
The changes have been validated by our internal ci.
Signed-off-by: Zack Rusin <zackr@vmware.com>
-----BEGIN PGP SIGNATURE-----
iQHFBAABCgAvFiEEQfMqp0dj5HcMqHypdmlGesw5fx8FAmB3VVoRHHphY2tyQHZt
d2FyZS5jb20ACgkQdmlGesw5fx/u9AwAhlJqpBH57xPv71rQC93GNg1P1OnIcYBE
EfF65/cynVrbCvRdFI8J7sOMIcDR1Ny6E6vpkKvdKObugu/jSh4MFZgKNmcReNwa
eK7c69LEigmZaZdeDeYgpS7GPkZ9fqBw4j+Y+hX0H/Djztf5Ghd6bGCwd6Z/clZW
b6kJQDfzvDzU0McTyVBH1ctXGQHhKAbkDKie+mbWRJtdMLmvTJQltgIaXnYMqqZq
/fknvD+IYh1pjCQz4g8/5Jn5ljmeuMB1BNhUaLhxHR2EaAZ0BvAXw+4urAezCWJf
YGvZ3rMPfd7NNTvrPbrSa9uIpH9hgZcew6ZGyhHPdfjbp+WkgG4oDQsNwDjkUqws
CQ+N/E8L2/2HTG0yOS5TSazUU2Omz9g2A50puKXhIrB/24baZlnyhbaExWH1Avi6
hMFzKIxeCOaGOUXtdizeD+mF63tlBicS+bcHyon/GgwpfIZLhTbY5nCibWiFCb7t
lGZOUAoJHqx34rI7PKGUU3xyD3+ha45H
=ETTT
-----END PGP SIGNATURE-----
Merge tag 'vmwgfx-fixes-2021-04-14' of gitlab.freedesktop.org:zack/vmwgfx into drm-fixes
vmwgfx fixes for regressions in 5.12
Here's a set of 3 patches fixing ugly regressions
in the vmwgfx driver. We broke lock initialization
code and ended up using spinlocks before initialization
breaking lockdep.
Also there was a bit of a fallout from drm changes
which made the core validate that unreferenced buffers
have been unpinned. vmwgfx pinning code predates a lot
of the core drm and wasn't written to account for those
semantics. Fortunately changes required to fix it
are not too intrusive.
The changes have been validated by our internal ci.
Signed-off-by: Zack Rusin <zackr@vmware.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Zack Rusin <zackr@vmware.com>
Link: https://patchwork.freedesktop.org/patch/msgid/f7add0a2-162e-3bd2-b1be-344a94f2acbf@vmware.com
Pull i2c fix from Wolfram Sang:
"One more driver bugfix for I2C"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: mv64xxx: Fix random system lock caused by runtime PM
This does the directory entry name verification for the legacy
"fillonedir" (and compat) interface that goes all the way back to the
dark ages before we had a proper dirent, and the readdir() system call
returned just a single entry at a time.
Nobody should use this interface unless you still have binaries from
1991, but let's do it right.
This came up during discussions about unsafe_copy_to_user() and proper
checking of all the inputs to it, as the networking layer is looking to
use it in a few new places. So let's make sure the _old_ users do it
all right and proper, before we add new ones.
See also commit 8a23eb804c ("Make filldir[64]() verify the directory
entry filename is valid") which did the proper modern interfaces that
people actually use. It had a note:
Note that I didn't bother adding the checks to any legacy interfaces
that nobody uses.
which this now corrects. Note that we really don't care about POSIX and
the presense of '/' in a directory entry, but verify_dirent_name() also
ends up doing the proper name length verification which is what the
input checking discussion was about.
[ Another option would be to remove the support for this particular very
old interface: any binaries that use it are likely a.out binaries, and
they will no longer run anyway since we removed a.out binftm support
in commit eac6165570 ("x86: Deprecate a.out support").
But I'm not sure which came first: getdents() or ELF support, so let's
pretend somebody might still have a working binary that uses the
legacy readdir() case.. ]
Link: https://lore.kernel.org/lkml/CAHk-=wjbvzCAhAtvG0d81W5o0-KT5PPTHhfJ5ieDFq+bGtgOYg@mail.gmail.com/
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
and bpf. BPF verifier changes stand out, otherwise things have
slowed down.
Current release - regressions:
- gro: ensure frag0 meets IP header alignment
- Revert "net: stmmac: re-init rx buffers when mac resume back"
- ethernet: macb: fix the restore of cmp registers
Previous releases - regressions:
- ixgbe: Fix NULL pointer dereference in ethtool loopback test
- ixgbe: fix unbalanced device enable/disable in suspend/resume
- phy: marvell: fix detection of PHY on Topaz switches
- make tcp_allowed_congestion_control readonly in non-init netns
- xen-netback: Check for hotplug-status existence before watching
Previous releases - always broken:
- bpf: mitigate a speculative oob read of up to map value size by
tightening the masking window
- sctp: fix race condition in sctp_destroy_sock
- sit, ip6_tunnel: Unregister catch-all devices
- netfilter: nftables: clone set element expression template
- netfilter: flowtable: fix NAT IPv6 offload mangling
- net: geneve: check skb is large enough for IPv4/IPv6 header
- netlink: don't call ->netlink_bind with table lock held
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmB6aBQACgkQMUZtbf5S
Iruu2BAAqHKdB5Qd1iBGaA1md8f+elErsotzjONz+eh2yqDKRaOW84+Fo9TPKgu6
se0WmAY1HMUO3TEVdFeBsrgrs+bTY1E1OdfoZ39PFNkMdKMM80Ks1rn94nrPOohy
q1uoNxe9jjT3nRQBTKHWdB3ZC3Jetwf3LP7G2b8SoA+gNd9xl+b1H/drmv7WdE/n
pY7/GND7wd4qqidLRDgAaavaiGIdqym8V0bZEpz7cZtjT/U6RhjkBLKSB8JFGUxP
PQ1NFrYKmLDM1zYTSObLOrKUmEaWzPPSsXmWqGkCE4qjJ8euX0e+5EbxF98JHdYW
O+HMtdgr4UJGWAoxyGaxk7h9w0ydVyC1+Xgi6jAFWdXP7wgvXXQrldLnO44pX/6I
dYlIM+Br/5VmnKiS1i1gBUURREBRSEy7ZYxtREjGC7dFSUn9RPm+0s0x/DCRBS9/
MtNo0lCiuWsyaZ2v57aEKLX4YvGpilzg4UU3/45RNW6OnFzQubvjMBJPfap6EUAC
Ii8uUc/vX0Jq4nZVZzDZ7vlkRcJTQgUqKrzgamUuwJmyPqzefkDcbSZub3tM8G39
eetiHS1nqe3QwuP+TYM3MaBjw0bdgNz9Wt3xmY3Ehnf3pujMR5fbAsCbcdowV5/+
OI2ZcTUZculeAW2q9DgsOCtyS/1huwMHG0zO32TgadbFv45UCS0=
=LN+J
-----END PGP SIGNATURE-----
Merge tag 'net-5.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.12-rc8, including fixes from netfilter, and
bpf. BPF verifier changes stand out, otherwise things have slowed
down.
Current release - regressions:
- gro: ensure frag0 meets IP header alignment
- Revert "net: stmmac: re-init rx buffers when mac resume back"
- ethernet: macb: fix the restore of cmp registers
Previous releases - regressions:
- ixgbe: Fix NULL pointer dereference in ethtool loopback test
- ixgbe: fix unbalanced device enable/disable in suspend/resume
- phy: marvell: fix detection of PHY on Topaz switches
- make tcp_allowed_congestion_control readonly in non-init netns
- xen-netback: Check for hotplug-status existence before watching
Previous releases - always broken:
- bpf: mitigate a speculative oob read of up to map value size by
tightening the masking window
- sctp: fix race condition in sctp_destroy_sock
- sit, ip6_tunnel: Unregister catch-all devices
- netfilter: nftables: clone set element expression template
- netfilter: flowtable: fix NAT IPv6 offload mangling
- net: geneve: check skb is large enough for IPv4/IPv6 header
- netlink: don't call ->netlink_bind with table lock held"
* tag 'net-5.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (52 commits)
netlink: don't call ->netlink_bind with table lock held
MAINTAINERS: update my email
bpf: Update selftests to reflect new error states
bpf: Tighten speculative pointer arithmetic mask
bpf: Move sanitize_val_alu out of op switch
bpf: Refactor and streamline bounds check into helper
bpf: Improve verifier error messages for users
bpf: Rework ptr_limit into alu_limit and add common error path
bpf: Ensure off_reg has no mixed signed bounds for all types
bpf: Move off_reg into sanitize_ptr_alu
bpf: Use correct permission flag for mixed signed bounds arithmetic
ch_ktls: do not send snd_una update to TCB in middle
ch_ktls: tcb close causes tls connection failure
ch_ktls: fix device connection close
ch_ktls: Fix kernel panic
i40e: fix the panic when running bpf in xdpdrv mode
net/mlx5e: fix ingress_ifindex check in mlx5e_flower_parse_meta
net/mlx5e: Fix setting of RS FEC mode
net/mlx5: Fix setting of devlink traps in switchdev mode
Revert "net: stmmac: re-init rx buffers when mac resume back"
...
- Fix a regression of read-only handling in the pmem driver.
- Fix a compile warning.
- Fix support for platform cache flush commands on powerpc/papr
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEf41QbsdZzFdA8EfZHtKRamZ9iAIFAmB6UwwACgkQHtKRamZ9
iAJRbg//a/JsvdhuqiM6y/ZbdwMqf1Iwl+OwAhX5BDdGl+IibN6BacAPrau1ba7c
Kww2lbk5E2OrNz71eVdpJZTTTI3Q0/Zrw95e8CBjyHNw+r5ROisrqKVmqnCYhhHQ
GI0JitZbQouXnHUxFpF4rEXmVOQMjlR87f2unAbQL3H6lm5NIAj4lzvBJUc4HR/6
SJVbLdRfNc2vxQm8WnUZzGBKox8fseQ0gK7a5MGGhjJSkqqtIb4LUNKh7XD4nXs+
ELD7i11TFuHiwh7aycm0lqC3EICw3OtyhOlOwBF7sVvIKSGnt+w4zk1BkkU7/zTx
CwaGJFcof9uPQLOShbV2JLaUxPGq9Pmt7gMGhVBqC4uQxLCjdoMadohpE163TCZZ
r8jerU+zZGrKw3fJxkFL5FiIMKwtkfvt8l1YAHg4L1KQuUBnjO47bu1Oe+E0WCl8
T41jw6bQkhWehnluPVFZxXDt1/uKC3xM7bUipULG/ou76QO4oHxNpZewv3cbblWR
IQVn1VVRkS+qBjF0JTFbyFqRy8Ip/AbaKlMO3exf3hKP2YBWDa34qXCTA1RDqkCS
kJVFTYhc3cxgo9g0g79K2ly6kfmrLkk3QB12ZIUvl/yjOCNowbr0nsxtSgLgVku8
XpLy6rVbI5Ds9P6tL0kiEz5FYtHJdwRfrUP3lFteVJ0VJGw0lCc=
=L0d1
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-fixes-for-5.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
"The largest change is for a regression that landed during -rc1 for
block-device read-only handling. Vaibhav found a new use for the
ability (originally introduced by virtio_pmem) to call back to the
platform to flush data, but also found an original bug in that
implementation. Lastly, Arnd cleans up some compile warnings in dax.
This has all appeared in -next with no reported issues.
Summary:
- Fix a regression of read-only handling in the pmem driver
- Fix a compile warning
- Fix support for platform cache flush commands on powerpc/papr"
* tag 'libnvdimm-fixes-for-5.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm/region: Fix nvdimm_has_flush() to handle ND_REGION_ASYNC
libnvdimm: Notify disk drivers to revalidate region read-only
dax: avoid -Wempty-body warnings
- Fix support for CXL memory devices with registers offset from the BAR
base.
- Fix the reporting of device capacity.
- Fix the driver commands list definition to be disconnected from the
UAPI command list.
- Replace percpu_ref with rwsem to fix initialization error path.
- Fix leaks in the driver initialization error path.
- Drop the power/ directory from CXL device sysfs.
- Use the recommended sysfs helper for attribute 'show' implementations.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEf41QbsdZzFdA8EfZHtKRamZ9iAIFAmB6TpMACgkQHtKRamZ9
iAIrLBAAsmxYIItUvSP9OSkRTv/UHkk4swVef1nsaNpf+yOhpXMCXwmNkphlBFUL
aD4fHyCPrDDHoFfUZY7sovq+KCEqcCD47qMdaS/E1VlEAsrKfsbCyKoJk54TJ0SK
IDMB367LGN+wKAZl94hLFDcSW8bXq79swqB4AW1W2wXJKkJrzodh+IwUA7mJhV3g
05GQ3Is+brIkZ7iwho/50KEteswXu5jQXfFR3fzHXbevnKq6Aom7Iud4grEP9ztR
xqgw/exJXNrrIymxyFz3uQy5WRr53U/YzNuxPHYJPoKxOOCc++kjlk+wKBsAcvGt
ZiBA8VkBBWBHVDYrKQ/KfkHZYT/gUB+5Nj6jTx1h0VkALq17wD15NA2uokSV0oFe
sFpZsTqQCI1/PoyUMWjF4FrftrfIqCBNCbtkI5A0JOzL6d5/YZPnGu/KyxbK/FpI
qUDPzyxSfPnODKq6j359zvT6HYi4uf2AyCskJS0DDS1lZGoWlVb23RNP8lPv9rhF
UhFzdNvbwRr82Am9jZJ0R9RaF1eyTKC0GOC/KEOxZOofEPJ9fKqG9sbhoJK06tAM
+vfyw49tMN1+7fxEBrlggYlD6h2BIZD6+vN7hqOWdQWqSpS5/lwafKqM+bL1IEi+
BwhrdEsHp/0z0/Qhqos7DmdTghF0AVdxjc/TdtZ3Y8d+BNSfR0g=
=Nh/e
-----END PGP SIGNATURE-----
Merge tag 'cxl-fixes-for-5.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull CXL memory class fixes from Dan Williams:
"A collection of fixes for the CXL memory class driver introduced in
this release cycle.
The driver was primarily developed on a work-in-progress QEMU
emulation of the interface and we have since found a couple places
where it hid spec compliance bugs in the driver, or had a spec
implementation bug itself.
The biggest change here is replacing a percpu_ref with an rwsem to
cleanup a couple bugs in the error unwind path during ioctl device
init. Lastly there were some minor cleanups to not export the
power-management sysfs-ABI for the ioctl device, use the proper sysfs
helper for emitting values, and prevent subtle bugs as new
administration commands are added to the supported list.
The bulk of it has appeared in -next save for the top commit which was
found today and validated on a fixed-up QEMU model.
Summary:
- Fix support for CXL memory devices with registers offset from the
BAR base.
- Fix the reporting of device capacity.
- Fix the driver commands list definition to be disconnected from the
UAPI command list.
- Replace percpu_ref with rwsem to fix initialization error path.
- Fix leaks in the driver initialization error path.
- Drop the power/ directory from CXL device sysfs.
- Use the recommended sysfs helper for attribute 'show'
implementations"
* tag 'cxl-fixes-for-5.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl/mem: Fix memory device capacity probing
cxl/mem: Fix register block offset calculation
cxl/mem: Force array size of mem_commands[] to CXL_MEM_COMMAND_ID_MAX
cxl/mem: Disable cxl device power management
cxl/mem: Do not rely on device_add() side effects for dev_set_name() failures
cxl/mem: Fix synchronization mechanism for device removal vs ioctl operations
cxl/mem: Use sysfs_emit() for attribute show routines
The CXL Identify Memory Device output payload emits capacity in 256MB
units. The driver is treating the capacity field as bytes. This was
missed because QEMU reports bytes when it should report bytes / 256MB.
Fixes: 8adaf747c9 ("cxl/mem: Find device capabilities")
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Cc: Ben Widawsky <ben.widawsky@intel.com>
Link: https://lore.kernel.org/r/161862021044.3259705.7008520073059739760.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
When I added support to allow generic netlink multicast groups to be
restricted to subscribers with CAP_NET_ADMIN I was unaware that a
genl_bind implementation already existed in the past.
It was reverted due to ABBA deadlock:
1. ->netlink_bind gets called with the table lock held.
2. genetlink bind callback is invoked, it grabs the genl lock.
But when a new genl subsystem is (un)registered, these two locks are
taken in reverse order.
One solution would be to revert again and add a comment in genl
referring 1e82a62fec, "genetlink: remove genl_bind").
This would need a second change in mptcp to not expose the raw token
value anymore, e.g. by hashing the token with a secret key so userspace
can still associate subflow events with the correct mptcp connection.
However, Paolo Abeni reminded me to double-check why the netlink table is
locked in the first place.
I can't find one. netlink_bind() is already called without this lock
when userspace joins a group via NETLINK_ADD_MEMBERSHIP setsockopt.
Same holds for the netlink_unbind operation.
Digging through the history, commit f773608026
("netlink: access nlk groups safely in netlink bind and getname")
expanded the lock scope.
commit 3a20773bee ("net: netlink: cap max groups which will be considered in netlink_bind()")
... removed the nlk->ngroups access that the lock scope
extension was all about.
Reduce the lock scope again and always call ->netlink_bind without
the table lock.
The Fixes tag should be vs. the patch mentioned in the link below,
but that one got squash-merged into the patch that came earlier in the
series.
Fixes: 4d54cc3211 ("mptcp: avoid lock_fast usage in accept path")
Link: https://lore.kernel.org/mptcp/20210213000001.379332-8-mathew.j.martineau@linux.intel.com/T/#u
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Xin Long <lucien.xin@gmail.com>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Sean Tranchetti <stranche@codeaurora.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmB5/80QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpgJWD/0YyQ7YlMKw4x2nS6TiojeD5KN8aTvVVHlf
TetYVXIOaMNPatUDREVl7vgaQVFi2cgWLvshYPIarsQK9ESNJ/UqKi0c6pJsa29w
02K8EJsEdmQmJlyC6tUv/Drvdfzinx+jAm9doz8FrJ+2v5+pYqNFhFTh/JKRebI8
kBEKH/e3eHg305TDXcWOoNxX3jD77v1IJo8JavLt1vncQKb6P3dahGwmkFRpRcHZ
sY33baI2jdO/79ybH0F0fNgD+XGnBYShKB7iOlJfT9SeI8NfpR/OCuuPl3gxeeKZ
PwHcsPHA/b0AAUKpWY0sX/86xMUUab2DPJV5nQR5SrgniJLs0Xqm0dvz1r6QM2BK
7EfwbXK48xm6Wop5WFlo+wqzW11ES/nRHm9leRomeeKo6Goe/4kyEgK93QJOkckv
/kt8t4PIxALlWKmgJgKk5kEWiP/QT0LjaZE2lYy9VcfpSoC7E7ZnK1KlvwbYOfZK
QqjNNsRr2z7zQaUCFHDc3mPe/WkV8DygRrXZYl8qfopqndOduceFaGqNe5iZV2IX
NnC7FGY66dcmjqGmOqr+4RvhUYR2+GtazaYgV14LM1cGBO5OtqlS1m+DRDbiavp4
fpHeGGQx78O97LURNJG4Ti/AR5dfSQLQwZzb60UBk43a9ZwVsLtxFquSaV8dkm3U
0hfxANWYXg==
=KEBW
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.12-2021-04-16' of git://git.kernel.dk/linux-block
Pull io_uring fix from Jens Axboe:
"Fix for a potential hang at exit with SQPOLL from Pavel"
* tag 'io_uring-5.12-2021-04-16' of git://git.kernel.dk/linux-block:
io_uring: fix early sqd_list removal sqpoll hangs