The s_id is already printed by message helpers.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Several members of compressed_bio are of type that's unnecessarily big
for the values that they'd hold:
- the size of the uncompressed and compressed data is 128K now, we can
keep is as int
- same for number of pages
- the compress type fits to a byte
- the errors is 0/1
The size of the unpatched structure is 80 bytes with several holes.
Reordering nr_pages next to the pages the hole after pending_bios is
filled and the resulting size is 56 bytes. This keeps the csums array
aligned to 8 bytes, which is nice. Further size optimizations may be
possible but right now it looks good to me:
struct compressed_bio {
refcount_t pending_bios; /* 0 4 */
unsigned int nr_pages; /* 4 4 */
struct page * * compressed_pages; /* 8 8 */
struct inode * inode; /* 16 8 */
u64 start; /* 24 8 */
unsigned int len; /* 32 4 */
unsigned int compressed_len; /* 36 4 */
u8 compress_type; /* 40 1 */
u8 errors; /* 41 1 */
/* XXX 2 bytes hole, try to pack */
int mirror_num; /* 44 4 */
struct bio * orig_bio; /* 48 8 */
u8 sums[]; /* 56 0 */
/* size: 56, cachelines: 1, members: 12 */
/* sum members: 54, holes: 1, sum holes: 2 */
/* last cacheline: 56 bytes */
};
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are common values set for the stripe constraints, some of them
are already factored out. Do that for increment and mirror_num as well.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a log recovery is in progress, lots of operations have to take that
into account, so we keep this status per tree during the operation. Long
time ago error handling revamp patch 79787eaab4 ("btrfs: replace many
BUG_ONs with proper error handling") removed clearing of the status in
an error branch. Add it back as was intended in e02119d5a7 ("Btrfs:
Add a write ahead tree log to optimize synchronous operations").
There are probably no visible effects, log replay is done only during
mount and if it fails all structures are cleared so the stale status
won't be kept.
Fixes: 79787eaab4 ("btrfs: replace many BUG_ONs with proper error handling")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The defrag loop processes leaves in batches and starting transaction for
each. The whole defragmentation on a given root is protected by a bit
but in case the transaction fails, the bit is not cleared
In case the transaction fails the bit would prevent starting
defragmentation again, so make sure it's cleared.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The type of discard_bitmap_bytes and discard_extent_bytes is u64 so the
format should be %llu, though the actual values would hardly ever
overflow to negative values.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While stress testing our error handling I noticed that sometimes we
would still commit the transaction even though we had aborted the
transaction.
Currently we track if a trans handle has dirtied any metadata, and if it
hasn't we mark the filesystem as having an error (so no new transactions
can be started), but we will allow the current transaction to complete
as we do not mark the transaction itself as having been aborted.
This sounds good in theory, but we were not properly tracking IO errors
in btrfs_finish_ordered_io, and thus committing the transaction with
bogus free space data. This isn't necessarily a problem per-se with the
free space cache, as the other guards in place would have kept us from
accepting the free space cache as valid, but highlights a real world
case where we had a bug and could have corrupted the filesystem because
of it.
This "skip abort on empty trans handle" is nice in theory, but assumes
we have perfect error handling everywhere, which we clearly do not.
Also we do not allow further transactions to be started, so all this
does is save the last transaction that was happening, which doesn't
necessarily gain us anything other than the potential for real
corruption.
Remove this particular bit of code, if we decide we need to abort the
transaction then abort the current one and keep us from doing real harm
to the file system, regardless of whether this specific trans handle
dirtied anything or not.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_truncate() where we truncate the inode either to the same size
or to a smaller size, we always set the full sync flag on the inode.
This is needed in case the truncation drops or trims any file extent items
that start beyond or cross the new inode size, so that the next fsync
drops all inode items from the log and scans again the fs/subvolume tree
to find all items that must be logged.
However if the truncation does not drop or trims any file extent items, we
do not need to set the full sync flag and force the next fsync to use the
slow code path. So do not set the full sync flag in such cases.
One use case where it is frequent to do truncations that do not change
the inode size and do not drop any extents (no prealloc extents beyond
i_size) is when running Microsoft's SQL Server inside a Docker container.
One example workload is the one Philipp Fent reported recently, in the
thread with a link below. In this workload a large number of fsyncs are
preceded by such truncate operations.
After this change I constantly get the runtime for that workload from
Philipp to be reduced by about -12%, for example from 184 seconds down
to 162 seconds.
Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/
Tested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The comment at the top of btrfs_truncate() mentions that csum items are
dropped or truncated to the new i_size, but this is wrong and non sense,
as they are unrelated to the i_size and are located in the csums tree and
not on a tree with inode items (fs/subvolume tree or a log tree). Instead
that claim applies to file extent items, so fix the comment to refer to
them instead.
While at it make the whole comment for the function more descriptive and
follow the kernel doc style.
Tested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fail to update the delayed inode we need to abort the transaction,
because we could leave an inode with the improper counts or some other
such corruption behind.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we get an error while looking up the inode item we'll simply bail
without cleaning up the delayed node. This results in this style of
warning happening on commit:
WARNING: CPU: 0 PID: 76403 at fs/btrfs/delayed-inode.c:1365 btrfs_assert_delayed_root_empty+0x5b/0x90
CPU: 0 PID: 76403 Comm: fsstress Tainted: G W 5.13.0-rc1+ #373
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
RIP: 0010:btrfs_assert_delayed_root_empty+0x5b/0x90
RSP: 0018:ffffb8bb815a7e50 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff95d6d07e1888 RCX: ffff95d6c0fa3000
RDX: 0000000000000002 RSI: 000000000029e91c RDI: ffff95d6c0fc8060
RBP: ffff95d6c0fc8060 R08: 00008d6d701a2c1d R09: 0000000000000000
R10: ffff95d6d1760ea0 R11: 0000000000000001 R12: ffff95d6c15a4d00
R13: ffff95d6c0fa3000 R14: 0000000000000000 R15: ffffb8bb815a7e90
FS: 00007f490e8dbb80(0000) GS:ffff95d73bc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6e75555cb0 CR3: 00000001101ce001 CR4: 0000000000370ef0
Call Trace:
btrfs_commit_transaction+0x43c/0xb00
? finish_wait+0x80/0x80
? vfs_fsync_range+0x90/0x90
iterate_supers+0x8c/0x100
ksys_sync+0x50/0x90
__do_sys_sync+0xa/0x10
do_syscall_64+0x3d/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Because the iref isn't dropped and this leaves an elevated node->count,
so any release just re-queues it onto the delayed inodes list. Fix this
by going to the out label to handle the proper cleanup of the delayed
node.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Right now we only cleanup the delayed iref if we have
BTRFS_DELAYED_NODE_DEL_IREF set on the node. However we have some error
conditions that need to cleanup the iref if it still exists, so to make
this code cleaner move the test_bit into btrfs_release_delayed_iref
itself and unconditionally call it in each of the cases instead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add sysfs interface to limit io during scrub. We relied on the ionice
interface to do that, eg. the idle class let the system usable while
scrub was running. This has changed when mq-deadline got widespread and
did not implement the scheduling classes. That was a CFQ thing that got
deleted. We've got numerous complaints from users about degraded
performance.
Currently only BFQ supports that but it's not a common scheduler and we
can't ask everybody to switch to it.
Alternatively the cgroup io limiting can be used but that also a
non-trivial setup (v2 required, the controller must be enabled on the
system). This can still be used if desired.
Other ideas that have been explored: piggy-back on ionice (that is set
per-process and is accessible) and interpret the class and classdata as
bandwidth limits, but this does not have enough flexibility as there are
only 8 allowed and we'd have to map fixed limits to each value. Also
adjusting the value would need to lookup the process that currently runs
scrub on the given device, and the value is not sticky so would have to
be adjusted each time scrub runs.
Running out of options, sysfs does not look that bad:
- it's accessible from scripts, or udev rules
- the name is similar to what MD-RAID has
(/proc/sys/dev/raid/speed_limit_max or /sys/block/mdX/md/sync_speed_max)
- the value is sticky at least for filesystem mount time
- adjusting the value has immediate effect
- sysfs is available in constrained environments (eg. system rescue)
- the limit also applies to device replace
Sysfs:
- raw value is in bytes
- values written to the file accept suffixes like K, M
- file is in the per-device directory /sys/fs/btrfs/FSID/devinfo/DEVID/scrub_speed_max
- 0 means use default priority of IO
The scheduler is a simple deadline one and the accuracy is up to nearest
128K.
Signed-off-by: David Sterba <dsterba@suse.com>
To be able to construct a zone append bio we need to look up the
btrfs_device. The code doing the chunk map lookup to get the device is
present in btrfs_submit_compressed_write and submit_extent_page.
Factor out the lookup calls into a helper and use it in the submission
paths.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When inode defrag is canceled, the error is set to EAGAIN but then
overwritten by number of defragmented bytes. As this would hide the
error, rather return EAGAIN. This does not harm 'btrfs fi defrag', it
will print the error and continue to next file (as it does in for any
other error).
Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
The io_failure_record::in_validation was introduced to handle failed bio
which cross several sectors. In such case, we still need to verify
which sectors are corrupted.
But since we've changed the way how we handle corrupted sectors, by only
submitting repair for each corrupted sector, there is no need for extra
validation any more.
This patch will cleanup all io_failure_record::in_validation related
code.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_submit_read_repair() has some extra check on whether the
failed bio needs extra validation for repair. But we can avoid all
these extra mechanisms if we submit the repair for each sector.
By this, each read repair can be easily handled without the need to
verify which sector is corrupted.
This will also benefit subpage, as one subpage bvec can contain several
sectors, making the extra verification more complex.
So this patch will:
- Introduce repair_one_sector()
The main code submitting repair, which is more or less the same as old
btrfs_submit_read_repair().
But this time, it only repairs one sector.
- Make btrfs_submit_read_repair() to handle sectors differently
There are 3 different cases:
* Good sector
We need to release the page and extent, set the range uptodate.
* Bad sector and failed to submit repair bio
We need to release the page and extent, but not set the range
uptodate.
* Bad sector but repair bio submitted
The page and extent release will be handled by the submitted repair
bio. Nothing needs to be done.
Since btrfs_submit_read_repair() will handle the page and extent
release now, we need to skip to next bvec even we hit some error.
- Change the lifespan of @uptodate in end_bio_extent_readpage()
Since now btrfs_submit_read_repair() will handle the full bvec
which contains any corruption, we don't need to bother updating
@uptodate bit anymore.
Just let @uptodate to be local variable inside the main loop,
so that any error from one bvec won't affect later bvec.
- Only export btrfs_repair_one_sector(), unexport
btrfs_submit_read_repair()
The only outside caller for read repair is DIO, which already submits
its repair for just one sector.
Only export btrfs_repair_one_sector() for DIO.
This patch will focus on the change on the repair path, the extra
validation code is still kept as is, and will be cleaned up later.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This will provide the basis for later per-sector repair for subpage,
while still keeping the existing code happy.
As if all csums match, the return value will be 0, same as now.
Only when csum mismatches, the return value is different.
The new return value will be a bitmap, for 4K sectorsize and 4K page
size, it will be either 1, instead of the -EIO (which is not used
directly by the callers, no effective change).
But for 4K sectorsize and 64K page size, aka subpage case, since the
bvec can contain multiple sectors, knowing which sectors are corrupted
will allow us to submit repair only for corrupted sectors.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The 'check_async_write' function is a helper used in
'btrfs_submit_metadata_bio' and it checks if asynchronous writing can be
used for metadata.
Make the function return bool and get rid of the local variable async in
btrfs_submit_metadata_bio storing the result of check_async_write's
tests.
As this is touching all function call sites, also rename it to
should_async_write as this is more in line with the naming we use.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we can't read a reliable write pointer from a sequential zone fail
creating the block group with an I/O error.
Also if the read write pointer is beyond the end of the respective zone,
fail the creation of the block group on this zone with an I/O error.
While this could also happen in real world scenarios with misbehaving
drives, this issue addresses a problem uncovered by fstests' test case
generic/475.
CC: stable@vger.kernel.org # 5.12+
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This extends patch 784daf2b96 ("btrfs: zoned: sanity check zone
type"), the message was supposed to be there but was lost during merge.
We want to make the error noticeable so add it.
Fixes: 784daf2b96 ("btrfs: zoned: sanity check zone type")
CC: stable@vger.kernel.org # 5.12+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we decide to flush delalloc from the preemptive flusher, we really do
not want to wait on ordered extents, as it gains us nothing. However
there was logic to go ahead and wait on ordered extents if there was
more ordered bytes than delalloc bytes. We do not want this behavior,
so pass through whether this flushing is for preemption, and do not wait
for ordered extents if that's the case. Also break out of the shrink
loop after the first flushing, as we just want to one shot shrink
delalloc.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While testing heavy delalloc workloads I noticed that sometimes we'd
just stop preemptively flushing when we had loads of delalloc available
to flush. This is because we skip preemptive flushing if delalloc <=
ordered. However if we start with say 4gib of delalloc, and we flush
2gib of that, we'll stop flushing there, when we still have 2gib of
delalloc to flush.
Instead adjust the ordered bytes down by half, this way if 2/3 of our
outstanding delalloc reservations are tied up by ordered extents we
don't bother preemptive flushing, as we're getting close to the state
where we need to wait on ordered extents.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When deciding if we should preemptively flush space, we will add in the
amount of space used by all block rsvs. However this also includes the
global block rsv, which isn't flushable so shouldn't be accounted for in
this calculation. If we decide to use ->bytes_may_use in our used
calculation we need to subtract the global rsv size from this amount so
it most closely matches the flushable space.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We calculate the amount of "free" space available for normal
reservations by taking the total space and subtracting out the hard used
space, which is readonly, used, and reserved space.
However we weren't taking into account the global block rsv, which is
essentially hard used space. Handle this by subtracting it from the
available free space, so that our threshold more closely mirrors
reality.
We need to do the check because it's possible that the global_rsv_size +
used is > total_bytes, sometimes the global reserve can end up being
calculated as larger than the available size (think small filesystems
where we only have the original 8MiB chunk of metadata). It doesn't
usually happen, but that can get us into trouble so this is safer.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Global rsv can't be used for normal allocations, and for very full file
systems we can decide to try and async flush constantly even though
there's really not a lot of space to reclaim. Deal with this by
including the global block rsv size in the "total used" calculation.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We were clamping the threshold for preemptive reclaim any time we added
a ticket to wait on, which if we have a lot of threads means we'd
essentially max out the clamp the first time we start to flush.
Instead of doing this, simply do it every time we have to start
flushing, this will make us ramp up gradually instead of going to max
clamping as soon as we start needing to do flushing.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
need_preemptive_reclaim() does some calculations, which aren't heavy,
but if we're already running preemptive reclaim there's no reason to do
them at all, so re-order the checks so that we don't do the calculation
if we're already doing reclaim.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit b2598edf8b ("btrfs: remove unused argument seed from
btrfs_find_device") removed the argument seed from btrfs_find_device
but forgot the comment, so remove it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
try_lock_extent() returns 1 on success or 0 for failure and not an error
code. If try_lock_extent() fails, read_extent_buffer_subpage() returns
zero indicating subpage extent read success.
Return EAGAIN/EWOULDBLOCK if try_lock_extent() fails in locking the
extent.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmDO65oACgkQEsHwGGHe
VUqEuw//R3hYrPE6luEeDb8V3er2QXJn320G0Jv2yqycmZlZYzN0tuCg0heRUWGS
Mh8jnKlNZ0GWh/9CWApxQFbqMnt/G3kkLhMZRbIfNm/dvRxdVVB8MP98+5sgcMv0
mbuLNrrANB6HANMxs0lM0IvzH24YDDVI92J17zSZEi4mKjDPmHAvtZgjfhzlNU4p
EZEMx7UZLyBsP4/cMXpq2SEmY8A3evjKkSjVIXhBMf929mpbGYzSM5RSPy+abGDt
ai/E5644RT27HWo/g7mh/szC9OZbg6TNbsF9J6msInh6kCDLBv6Awh/0NUM3bDu4
r7H2qgxTIv3oisTZNf9qjyx1uStcJJDGF66t7NhYXRVqChQbOYNBoYJ85kVSh/FD
jjiow9WtwYrbjQ8bT/+NkGu0poUL5gxnGqUfFyucofq46ct48+I36pnr+3V12OTj
BxJb0sIDAJNPxv1NODpOMtEJJeiumtROU0usFHb9wnTz8jGhU/PFyKdKk8wa41f2
kG3fQp39gI6T+D0Va3tMCY3UNOFocDmDYhsYZfRtUPp+d1T2jTPP23Yx5XCQ6gew
cxliIwttQAO6W4dJe4H5Txm5jmjzDfWJ9zrAXuABkVghRxXswiGHhhfUn2AhJK0w
9Sz+E7xYA/G5ht6pCvXpFfsJ0S0R4YFHddu7rS8S/EGNLe2729o=
=KnMk
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Borislav Petkov:
"A single fix to restore fairness between control groups with equal
priority"
* tag 'sched_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Correctly insert cfs_rq's to list on unthrottle
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmDO6X8ACgkQEsHwGGHe
VUpz8hAAwaEfQmTDwJ+4PKOklmEdn7sgIeuIVXFrpCn1psDB0btqkrzAPkzPAM63
ISnvKuySId3uDQ5mmgLwaQDsa3j1yciOhdyA3dGBCidtR6zcm/hCM48y3iUs1kRH
CS8Ai/MpUQzi8Y/bFDMkQ3yedQG5CMApy63xk3MVlNh+jUBZkQ3fSPynrVN+jVR0
nBbJXkKcMD7CGFgQnNO7weqnYJrcxWuQZSHALotJDBoVas0sgj97CLDDLmA5n8NW
42QLW7OUxEUkMfRWb/iCqkzZ7vrKVUHZC2d/rBzWCRIQy5SHIwwVg4FFeAr51cTt
72+MTf6lnA8aXQffVyvMPnHuhSp1ynin5NOMsu3YXMbF1lIU8ptKy5V3ttvF+Bcb
cktI5i076PjScbvxbikTrI0QmLoeb2QTEnDgErrUN7CZLcVZQFLXtYbjrQvI9ycF
8Ezw3a76tIzb4uSaxar8e4Sn1lc8VpEDUMNlhAu1/g/mFzlHF86QLuh4y4mjrReV
9P1hMqtlfbDvLpQVu8S6KlrwXv+znqpRg9utA6SgJzt6yjnywOzlcptIdjHg3fE2
gxhGs7/edp3NkN936o5Bh5vydtrCGUMtAS2KcxuRXKyEdusOgYJUgo85/QX2mI1h
xNzn1yVjQuvU/Hp91nM7V/9boeDYgGqzRCQKR9frVTDHilezn9Q=
=Ss4I
-----END PGP SIGNATURE-----
Merge tag 'irq_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Borislav Petkov:
"A single fix for GICv3 to not take an interrupt in an NMI context"
* tag 'irq_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v3: Workaround inconsistent PMR setting on NMI entry
(There's a lot more in the pipe):
- Prevent corruption of the XSTATE buffer in signal handling by
validating what is being copied from userspace first.
- Invalidate other task's preserved FPU registers on XRSTOR failure
(#PF) because latter can still modify some of them.
- Restore the proper PKRU value in case userspace modified it
- Reset FPU state when signal restoring fails
Other:
- Map EFI boot services data memory as encrypted in a SEV guest so that
the guest can access it and actually boot properly
- Two SGX correctness fixes: proper resources freeing and a NUMA fix
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmDO5vQACgkQEsHwGGHe
VUrUjw//fRU8BPZ3/SWNQO188QhHdFpm3jqtjRJsZD1FfnnLdxIg2SCP4RjFxv+Y
eFyN0nYLekG8a3CMV081H9Rhr5tt3bflk0oTcGAar7m2qQiCiqaAH0wptIlQonSu
nQCSs+PeaaK4nRCtW+TUJnwG0ZU/y7fEXa3pxJ6hSMnxZjz3lj70zKhpA1nQtqRZ
OOStvBNtaWcDdTTE4r8XuFIxuMUUEuwHlQQmkAVHQYUf6vxGYfnDYEg83Wddvq1E
1leSRNFlLcCAbPUV/fax3KGvaekeJ1U411uWqXlain6m105+mk+irmrLxtur/lJ5
cWTVb5CbIHFZnJvC5jzNPv/03GbIIQaVm4jPI2qB1AZbjcVlAPKj1Ne+U1fzvmDT
wNUob/rnIXiGptvtUMNYGURxBTj65Nnom3iAJV+AdMOThDwYMvsJJjFkMnC5wO2n
ZAexumWPnUzWoxSMTraT7a6b/kilFUrcPljxSrFd9yVeU8E6a1OSW35oWoQ3itrc
xx/ne8RodLmCPC9DjecFcQR+qUuXsF+XCCj07QpfKNTAObr17e9nsKJneR6MX79C
Lpc7Ka/CiTGYcebWX7tqtjwGPfa6iqekswxYRRp7j54bQ4sHmKyordZy0Q8+c079
gmMlPdNbqQg3YwHyXW2yeJETDS1HBp61RRojAP15BsL73wyYQNE=
=AuXr
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
"A first set of urgent fixes to the FPU/XSTATE handling mess^W code.
(There's a lot more in the pipe):
- Prevent corruption of the XSTATE buffer in signal handling by
validating what is being copied from userspace first.
- Invalidate other task's preserved FPU registers on XRSTOR failure
(#PF) because latter can still modify some of them.
- Restore the proper PKRU value in case userspace modified it
- Reset FPU state when signal restoring fails
Other:
- Map EFI boot services data memory as encrypted in a SEV guest so
that the guest can access it and actually boot properly
- Two SGX correctness fixes: proper resources freeing and a NUMA fix"
* tag 'x86_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Avoid truncating memblocks for SGX memory
x86/sgx: Add missing xa_destroy() when virtual EPC is destroyed
x86/fpu: Reset state for all signal restore failures
x86/pkru: Write hardware init value to PKRU when xstate is init
x86/process: Check PF_KTHREAD and not current->mm for kernel threads
x86/fpu: Invalidate FPU state after a failed XRSTOR from a user buffer
x86/fpu: Prevent state corruption in __fpu__restore_sig()
x86/ioremap: Map EFI-reserved memory as encrypted for SEV
Fix initrd corruption caused by our recent change to use relative jump labels.
Fix a crash using perf record on systems without a hardware PMU backend.
Rework our 64-bit signal handling slighty to make it more closely match the old behaviour,
after the recent change to use unsafe user accessors.
Thanks to: Anastasia Kovaleva, Athira Rajeev, Christophe Leroy, Daniel Axtens, Greg Kurz,
Roman Bolshakov.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmDOeA0THG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgHZgD/9NbskRdhx9Vj+lWBCa8K37Cckf+aYu
bQxszcVDA65xwhASk9CotSy6NC1HgyxB7n3VO7FbCku50JNapT85/Onl07R/Aiiv
PhHOuDs5Gj8hB8rdpxYQjas3C2XW/UJR6ebogMcNxf4BN6fjHHoLbGmig1o+X2Jm
rd9l5dWiiK27o8McqCsTESW1VKVtjov7owX7xh/HW/U6hbDkuLdVyMSViaisIwi2
I1wfzmMWcN8JpBUv1G7pWFuKgatsTfr2p3bsdVFmPl3LjUXXcyJ8zQS5yoV6uD5F
laFx/BG1T06y1ny1yvEL3sTlNHE0tQPF+FVZ75hYmPKnE7tjo5rxeGFiul4RWo5q
oiSVDOrjt2urNeRVv10oSCUgs2epRosIaTqXJx1JyK/yYF2oI4FvqkTMBjPxeGot
ZHqFV8QNm19ZxlMtvzNSpagp6FX8kEOVxJaGersfmzBhSZudcR/VxX0086rWguw4
RA+0+qY6KGBi+qxylmyvzpJjsp59houykDNhhbED7tpQJACex7JMDJiiH88VP8l3
vf9h1z+NbBoiR3y0a/uv1nDTMLsTQ5PbcnNbZr9u+Oc5vwu8DP1gCq41lm6BOMbz
F6IxdcOqBHn3HGM11ZtAt5u6ep3ZDfPx8lMth1z3kaAXjm3nlDAJPC4N//p6vz+a
IzL1Iv0r2X7qvQ==
=Qrh9
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.13-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Fix initrd corruption caused by our recent change to use relative jump
labels.
Fix a crash using perf record on systems without a hardware PMU
backend.
Rework our 64-bit signal handling slighty to make it more closely
match the old behaviour, after the recent change to use unsafe user
accessors.
Thanks to Anastasia Kovaleva, Athira Rajeev, Christophe Leroy, Daniel
Axtens, Greg Kurz, and Roman Bolshakov"
* tag 'powerpc-5.13-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/perf: Fix crash in perf_instruction_pointer() when ppmu is not set
powerpc: Fix initrd corruption with relative jump labels
powerpc/signal64: Copy siginfo before changing regs->nip
powerpc/mem: Add back missing header to fix 'no previous prototype' error
* A build fix to always build modules with the medany code model, as
the module loader doesn't support medlow.
* A Kconfig warning fix for the SiFive errata.
* A pair of fixes that for regressions to the recent memory layout
changes.
* A fix for the FU740 device tree.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmDOEBITHHBhbG1lckBk
YWJiZWx0LmNvbQAKCRAuExnzX7sYiZhsD/90A+I0pI0MWsjHYbzpSIK/jWKQ6kgY
Y6HtzZrt31BX3Cq5IshOxsOlHyQkGiu8rHXH2kWPId3WX1bM0q2bSO3EoaTch5/s
OJ2KKxlp2L758VRaK41ec098QHxjs4Iv1YwnNmcKhCkYvYNV1sq/Vp2RgBXVIoS+
Jk7sWPN+5CCCcDjKuig+mwZILwHlMsLtm9w61bB4+GWCz3OHKHSIWmPkSfmQT7Sk
n/KvKrLAonVQxTuBI3syEywy5uHcBooSSNf8kMi8BGwbqmo6CI0CMO9dddeubizK
ehmColwQWnke3fR02Mt2ATurGeE0epZHTND1ZTyOSCLUSbJecPsAhmGkUkjPoFoH
wufhc+1+KHAT731/o9KQ4TMt0SQHIZHeMOxIBjnw9evg6pVx5iYistvKmfTvsf/j
YsRGdrm7//HSZ406Who2qkiTPRYnKJdKpN4wac/XZ/NXN65LL05e5UPxjrH8RShs
2PpIdyNdfBx7d9VlRkHnNNNF7EFC1S5lg7SkW6CXqOTbfVHRO4szH6OUd6QdD83Y
MqKRjT2VitnW+tFkX+iYOMEj4bAfFKlmVgjWXUMBJrgnDg6KiZEXCwNbynR/nZA+
8ssbJqQ2oDZsR/G4s+ePzrNz8Pd2R/9gfmnx1nY016p3JJ/dvX8pBzjQN/X9r8SS
3dKmB8XZ9i/LUA==
=HsNS
-----END PGP SIGNATURE-----
Merge tag 'riscv-for-linus-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- A build fix to always build modules with the 'medany' code model, as
the module loader doesn't support 'medlow'.
- A Kconfig warning fix for the SiFive errata.
- A pair of fixes that for regressions to the recent memory layout
changes.
- A fix for the FU740 device tree.
* tag 'riscv-for-linus-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: dts: fu740: fix cache-controller interrupts
riscv: Ensure BPF_JIT_REGION_START aligned with PMD size
riscv: kasan: Fix MODULES_VADDR evaluation due to local variables' name
riscv: sifive: fix Kconfig errata warning
riscv32: Use medany C model for modules
pending requests are purged.
- Two fixes for the machine check handler in the entry code.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEE3QHqV+H2a8xAv27vjYWKoQLXFBgFAmDNvpgACgkQjYWKoQLX
FBiuvQf/VrYPeFQotoit8ZQm5EUUMHeryxp3DVOeHFilPdANbVh1Q++NxZvf2XSw
ejDUwTBuWaMmb2U8/7Yp+dbLYRu6XGKVKHLIArkXzsDzUNaJ+tMszAVKDK+SRHYe
zaBYMG6Lfn/ByoME1rqQZ6e/dx+4f0rzorR5A8IlslsSFFyTzt8CSLnrDcu6aj4i
bGSf8lS7dMoy49OYzuNYaPl2KcPEyrWMB//vuwj5jDru3eJcqW/m/q/7JRskTicm
YsRw0x2waDs2x+SOQJ+haSrNRGY9YKSVUISVgDPX0Yz4PF9noNq1iOCRfu5JcrcR
vlm3rJI8uFdZ8p4ZUMJHMS7T7Z5bRg==
=GRT1
-----END PGP SIGNATURE-----
Merge tag 's390-5.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Vasily Gorbik:
- Fix zcrypt ioctl hang due to AP queue msg counter dropping below 0
when pending requests are purged.
- Two fixes for the machine check handler in the entry code.
* tag 's390-5.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/ap: Fix hanging ioctl caused by wrong msg counter
s390/mcck: fix invalid KVM guest condition check
s390/mcck: fix calculation of SIE critical section size
To pick the changes in:
3218274773 ("icmp: don't send out ICMP messages with a source address of 0.0.0.0")
That don't result in any change in tooling, as INADDR_ are not used to
generate id->string tables used by 'perf trace'.
This addresses this build warning:
Warning: Kernel ABI header at 'tools/include/uapi/linux/in.h' differs from latest version at 'include/uapi/linux/in.h'
diff -u tools/include/uapi/linux/in.h include/uapi/linux/in.h
Cc: David S. Miller <davem@davemloft.net>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To pick the changes in:
8b1462b67f ("quota: finish disable quotactl_path syscall")
Those headers are used in some arches to generate the syscall table used
in 'perf trace' to translate syscall numbers into strings.
This addresses this perf build warning:
Warning: Kernel ABI header at 'tools/include/uapi/asm-generic/unistd.h' differs from latest version at 'include/uapi/asm-generic/unistd.h'
diff -u tools/include/uapi/asm-generic/unistd.h include/uapi/asm-generic/unistd.h
Cc: Jan Kara <jack@suse.cz>
Cc: Marcin Juszkiewicz <marcin@juszkiewicz.com.pl>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To pick the changes in:
ea6932d70e ("net: make get_net_ns return error if NET_NS is disabled")
That don't result in any changes in the tables generated from that
header.
This silences this perf build warning:
Warning: Kernel ABI header at 'tools/perf/trace/beauty/include/linux/socket.h' differs from latest version at 'include/linux/socket.h'
diff -u tools/perf/trace/beauty/include/linux/socket.h include/linux/socket.h
Cc: Changbin Du <changbin.du@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
$(( .. )) is a bash feature but the test's interpreter is !/bin/sh,
switch the code to use expr.
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: bpf@vger.kernel.org
Link: http://lore.kernel.org/lkml/20210617184216.2075588-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
ASan reported a memory leak of BPF-related ksymbols map and dso. The
leak is caused by refount never reaching 0, due to missing __put calls
in the function machine__process_ksymbol_register.
Once the dso is inserted in the map, dso__put() should be called
(map__new2() increases the refcount to 2).
The same thing applies for the map when it's inserted into maps
(maps__insert() increases the refcount to 2).
$ sudo ./perf record -- sleep 5
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.025 MB perf.data (8 samples) ]
=================================================================
==297735==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 6992 byte(s) in 19 object(s) allocated from:
#0 0x4f43c7 in calloc (/home/user/linux/tools/perf/perf+0x4f43c7)
#1 0x8e4e53 in map__new2 /home/user/linux/tools/perf/util/map.c:216:20
#2 0x8cf68c in machine__process_ksymbol_register /home/user/linux/tools/perf/util/machine.c:778:10
[...]
Indirect leak of 8702 byte(s) in 19 object(s) allocated from:
#0 0x4f43c7 in calloc (/home/user/linux/tools/perf/perf+0x4f43c7)
#1 0x8728d7 in dso__new_id /home/user/linux/tools/perf/util/dso.c:1256:20
#2 0x872015 in dso__new /home/user/linux/tools/perf/util/dso.c:1295:9
#3 0x8cf623 in machine__process_ksymbol_register /home/user/linux/tools/perf/util/machine.c:774:21
[...]
Indirect leak of 1520 byte(s) in 19 object(s) allocated from:
#0 0x4f43c7 in calloc (/home/user/linux/tools/perf/perf+0x4f43c7)
#1 0x87b3da in symbol__new /home/user/linux/tools/perf/util/symbol.c:269:23
#2 0x888954 in map__process_kallsym_symbol /home/user/linux/tools/perf/util/symbol.c:710:8
[...]
Indirect leak of 1406 byte(s) in 19 object(s) allocated from:
#0 0x4f43c7 in calloc (/home/user/linux/tools/perf/perf+0x4f43c7)
#1 0x87b3da in symbol__new /home/user/linux/tools/perf/util/symbol.c:269:23
#2 0x8cfbd8 in machine__process_ksymbol_register /home/user/linux/tools/perf/util/machine.c:803:8
[...]
Signed-off-by: Riccardo Mancini <rickyman7@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tommi Rantala <tommi.t.rantala@nokia.com>
Link: http://lore.kernel.org/lkml/20210612173751.188582-1-rickyman7@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The error code is not set at all in the sys event iter function.
This may lead to an uninitialized value of "ret" in
metricgroup__add_metric() when no CPU metric is added.
Fix by properly setting the error code.
It is not necessary to init "ret" to 0 in metricgroup__add_metric(), as
if we have no CPU or sys event metric matching, then "has_match" should
be 0 and "ret" is set to -EINVAL.
However gcc cannot detect that it may not have been set after the
map_for_each_metric() loop for CPU metrics, which is strange.
Fixes: be335ec28e ("perf metricgroup: Support adding metrics for system PMUs")
Signed-off-by: John Garry <john.garry@huawei.com>
Acked-by: Ian Rogers <irogers@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/1623335580-187317-3-git-send-email-john.garry@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The following command segfaults on my x86 broadwell:
$ ./perf stat -M frontend_bound,retiring,backend_bound,bad_speculation sleep 1
WARNING: grouped events cpus do not match, disabling group:
anon group { raw 0x10e }
anon group { raw 0x10e }
perf: util/evsel.c:1596: get_group_fd: Assertion `!(!leader->core.fd)' failed.
Aborted (core dumped)
The issue shows itself as a use-after-free in evlist__check_cpu_maps(),
whereby the leader of an event selector (evsel) has been deleted (yet we
still attempt to verify for an evsel).
Fundamentally the problem comes from metricgroup__setup_events() ->
find_evsel_group(), and has developed from the previous fix attempt in
commit 9c880c24cb ("perf metricgroup: Fix for metrics containing
duration_time").
The problem now is that the logic in checking if an evsel is in the same
group is subtly broken for the "cycles" event. For the "cycles" event,
the pmu_name is NULL; however the logic in find_evsel_group() may set an
event matched against "cycles" as used, when it should not be.
This leads to a condition where an evsel is set, yet its leader is not.
Fix the check for evsel pmu_name by not matching evsels when either has a
NULL pmu_name.
There is still a pre-existing metric issue whereby the ordering of the
metrics may break the 'stat' function, as discussed at:
https://lore.kernel.org/lkml/49c6fccb-b716-1bf0-18a6-cace1cdb66b9@huawei.com/
Fixes: 9c880c24cb ("perf metricgroup: Fix for metrics containing duration_time")
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> # On a Thinkpad T450S
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/1623335580-187317-2-git-send-email-john.garry@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The order of interrupt numbers is incorrect.
The order for FU740 is: DirError, DataError, DataFail, DirFail
From SiFive FU740-C000 Manual:
19 - L2 Cache DirError
20 - L2 Cache DirFail
21 - L2 Cache DataError
22 - L2 Cache DataFail
Signed-off-by: David Abdurachmanov <david.abdurachmanov@sifive.com>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Andreas reported commit fc8504765e ("riscv: bpf: Avoid breaking W^X")
breaks booting with one kind of defconfig, I reproduced a kernel panic
with the defconfig:
[ 0.138553] Unable to handle kernel paging request at virtual address ffffffff81201220
[ 0.139159] Oops [#1]
[ 0.139303] Modules linked in:
[ 0.139601] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.13.0-rc5-default+ #1
[ 0.139934] Hardware name: riscv-virtio,qemu (DT)
[ 0.140193] epc : __memset+0xc4/0xfc
[ 0.140416] ra : skb_flow_dissector_init+0x1e/0x82
[ 0.140609] epc : ffffffff8029806c ra : ffffffff8033be78 sp : ffffffe001647da0
[ 0.140878] gp : ffffffff81134b08 tp : ffffffe001654380 t0 : ffffffff81201158
[ 0.141156] t1 : 0000000000000002 t2 : 0000000000000154 s0 : ffffffe001647dd0
[ 0.141424] s1 : ffffffff80a43250 a0 : ffffffff81201220 a1 : 0000000000000000
[ 0.141654] a2 : 000000000000003c a3 : ffffffff81201258 a4 : 0000000000000064
[ 0.141893] a5 : ffffffff8029806c a6 : 0000000000000040 a7 : ffffffffffffffff
[ 0.142126] s2 : ffffffff81201220 s3 : 0000000000000009 s4 : ffffffff81135088
[ 0.142353] s5 : ffffffff81135038 s6 : ffffffff8080ce80 s7 : ffffffff80800438
[ 0.142584] s8 : ffffffff80bc6578 s9 : 0000000000000008 s10: ffffffff806000ac
[ 0.142810] s11: 0000000000000000 t3 : fffffffffffffffc t4 : 0000000000000000
[ 0.143042] t5 : 0000000000000155 t6 : 00000000000003ff
[ 0.143220] status: 0000000000000120 badaddr: ffffffff81201220 cause: 000000000000000f
[ 0.143560] [<ffffffff8029806c>] __memset+0xc4/0xfc
[ 0.143859] [<ffffffff8061e984>] init_default_flow_dissectors+0x22/0x60
[ 0.144092] [<ffffffff800010fc>] do_one_initcall+0x3e/0x168
[ 0.144278] [<ffffffff80600df0>] kernel_init_freeable+0x1c8/0x224
[ 0.144479] [<ffffffff804868a8>] kernel_init+0x12/0x110
[ 0.144658] [<ffffffff800022de>] ret_from_exception+0x0/0xc
[ 0.145124] ---[ end trace f1e9643daa46d591 ]---
After some investigation, I think I found the root cause: commit
2bfc6cd81b ("move kernel mapping outside of linear mapping") moves
BPF JIT region after the kernel:
| #define BPF_JIT_REGION_START PFN_ALIGN((unsigned long)&_end)
The &_end is unlikely aligned with PMD size, so the front bpf jit
region sits with part of kernel .data section in one PMD size mapping.
But kernel is mapped in PMD SIZE, when bpf_jit_binary_lock_ro() is
called to make the first bpf jit prog ROX, we will make part of kernel
.data section RO too, so when we write to, for example memset the
.data section, MMU will trigger a store page fault.
To fix the issue, we need to ensure the BPF JIT region is PMD size
aligned. This patch acchieve this goal by restoring the BPF JIT region
to original position, I.E the 128MB before kernel .text section. The
modification to kasan_init.c is inspired by Alexandre.
Fixes: fc8504765e ("riscv: bpf: Avoid breaking W^X")
Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
commit 2bfc6cd81b ("riscv: Move kernel mapping outside of linear
mapping") makes use of MODULES_VADDR to populate kernel, BPF, modules
mapping. Currently, MODULES_VADDR is defined as below for RV64:
| #define MODULES_VADDR (PFN_ALIGN((unsigned long)&_end) - SZ_2G)
But kasan_init() has two local variables which are also named as _start,
_end, so MODULES_VADDR is evaluated with the local variable _end
rather than the global "_end" as we expected. Fix this issue by
renaming the two local variables.
Fixes: 2bfc6cd81b ("riscv: Move kernel mapping outside of linear mapping")
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
bluetooth, netfilter and can.
Current release - regressions:
- mlxsw: spectrum_qdisc: Pass handle, not band number to find_class()
to fix modifying offloaded qdiscs
- lantiq: net: fix duplicated skb in rx descriptor ring
- rtnetlink: fix regression in bridge VLAN configuration, empty info
is not an error, bot-generated "fix" was not needed
- libbpf: s/rx/tx/ typo on umem->rx_ring_setup_done to fix
umem creation
Current release - new code bugs:
- ethtool: fix NULL pointer dereference during module EEPROM dump via
the new netlink API
- mlx5e: don't update netdev RQs with PTP-RQ, the special purpose queue
should not be visible to the stack
- mlx5e: select special PTP queue only for SKBTX_HW_TSTAMP skbs
- mlx5e: verify dev is present in get devlink port ndo, avoid a panic
Previous releases - regressions:
- neighbour: allow NUD_NOARP entries to be force GCed
- further fixes for fallout from reorg of WiFi locking
(staging: rtl8723bs, mac80211, cfg80211)
- skbuff: fix incorrect msg_zerocopy copy notifications
- mac80211: fix NULL ptr deref for injected rate info
- Revert "net/mlx5: Arm only EQs with EQEs" it may cause missed IRQs
Previous releases - always broken:
- bpf: more speculative execution fixes
- netfilter: nft_fib_ipv6: skip ipv6 packets from any to link-local
- udp: fix race between close() and udp_abort() resulting in a panic
- fix out of bounds when parsing TCP options before packets
are validated (in netfilter: synproxy, tc: sch_cake and mptcp)
- mptcp: improve operation under memory pressure, add missing wake-ups
- mptcp: fix double-lock/soft lookup in subflow_error_report()
- bridge: fix races (null pointer deref and UAF) in vlan tunnel egress
- ena: fix DMA mapping function issues in XDP
- rds: fix memory leak in rds_recvmsg
Misc:
- vrf: allow larger MTUs
- icmp: don't send out ICMP messages with a source address of 0.0.0.0
- cdc_ncm: switch to eth%d interface naming
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmDNP7EACgkQMUZtbf5S
IrvTmxAAgOAM9MdRl9wnYtqXKPXJ1JJtenozwt1yX6b6OG+Ns7cm6YYafU3KoZWR
KlzpvP90vRrER3RqksbMngHzvGjZKDS4LWRur7sRlJ1TBQoLrQCIbriAh07d7wlU
0nnS4J8mczTCKx78QCUYy1QBIX5TQrUbx0JQZDPoIPBjFeILW+Gx/Ghg5tUR4mhf
6icYqwIPocTXO37ZmWOzezZNVOXJF4kaQUZeuOHNe5hOtm6EeIpZbW1Xx3DIr5bd
80a/uNU7nVyos0n7jxnfVE/oelTnYbT5scZeV/PPVqZ4U113f7uex2QP23/XhGSX
lK1EhwPqPOyaNhQoihLM6Xzd4o7aZOcmF8NY96xqjC+DqdN+juvfJU+ClCZojGIj
H4bwCSaj3y2PiimfQdBiIKvYMc5d4zBdw/Dpk/gLDp4d5N638TAtuunK4Mj+TEuT
QF1qkBLIB4HFtLS0M35/twk93md/5GUdSTij2GB3fOkAWRu2m266P5m+4DigW/TB
Xm8FgKdetvxVP0Qv/p49nPEn24Ny8wCafH1x1wVTmoda2qi6j1EXMuSa0PlCdz70
Sl5FrlxdEkOpC4p+Aoc8APSoBXnOriAlpU+z/EVb8Co4JR/+Ge5zBWpsiZDVD0/K
Ay0FW3I87iyn9tw1H1Fzr9GBlVl5vWRauZFHjzl90fWakCrCzJE=
=xxUe
-----END PGP SIGNATURE-----
Merge tag 'net-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.13-rc7, including fixes from wireless, bpf,
bluetooth, netfilter and can.
Current release - regressions:
- mlxsw: spectrum_qdisc: Pass handle, not band number to find_class()
to fix modifying offloaded qdiscs
- lantiq: net: fix duplicated skb in rx descriptor ring
- rtnetlink: fix regression in bridge VLAN configuration, empty info
is not an error, bot-generated "fix" was not needed
- libbpf: s/rx/tx/ typo on umem->rx_ring_setup_done to fix umem
creation
Current release - new code bugs:
- ethtool: fix NULL pointer dereference during module EEPROM dump via
the new netlink API
- mlx5e: don't update netdev RQs with PTP-RQ, the special purpose
queue should not be visible to the stack
- mlx5e: select special PTP queue only for SKBTX_HW_TSTAMP skbs
- mlx5e: verify dev is present in get devlink port ndo, avoid a panic
Previous releases - regressions:
- neighbour: allow NUD_NOARP entries to be force GCed
- further fixes for fallout from reorg of WiFi locking (staging:
rtl8723bs, mac80211, cfg80211)
- skbuff: fix incorrect msg_zerocopy copy notifications
- mac80211: fix NULL ptr deref for injected rate info
- Revert "net/mlx5: Arm only EQs with EQEs" it may cause missed IRQs
Previous releases - always broken:
- bpf: more speculative execution fixes
- netfilter: nft_fib_ipv6: skip ipv6 packets from any to link-local
- udp: fix race between close() and udp_abort() resulting in a panic
- fix out of bounds when parsing TCP options before packets are
validated (in netfilter: synproxy, tc: sch_cake and mptcp)
- mptcp: improve operation under memory pressure, add missing
wake-ups
- mptcp: fix double-lock/soft lookup in subflow_error_report()
- bridge: fix races (null pointer deref and UAF) in vlan tunnel
egress
- ena: fix DMA mapping function issues in XDP
- rds: fix memory leak in rds_recvmsg
Misc:
- vrf: allow larger MTUs
- icmp: don't send out ICMP messages with a source address of 0.0.0.0
- cdc_ncm: switch to eth%d interface naming"
* tag 'net-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (139 commits)
net: ethernet: fix potential use-after-free in ec_bhf_remove
selftests/net: Add icmp.sh for testing ICMP dummy address responses
icmp: don't send out ICMP messages with a source address of 0.0.0.0
net: ll_temac: Avoid ndo_start_xmit returning NETDEV_TX_BUSY
net: ll_temac: Fix TX BD buffer overwrite
net: ll_temac: Add memory-barriers for TX BD access
net: ll_temac: Make sure to free skb when it is completely used
MAINTAINERS: add Guvenc as SMC maintainer
bnxt_en: Call bnxt_ethtool_free() in bnxt_init_one() error path
bnxt_en: Fix TQM fastpath ring backing store computation
bnxt_en: Rediscover PHY capabilities after firmware reset
cxgb4: fix wrong shift.
mac80211: handle various extensible elements correctly
mac80211: reset profile_periodicity/ema_ap
cfg80211: avoid double free of PMSR request
cfg80211: make certificate generation more robust
mac80211: minstrel_ht: fix sample time check
net: qed: Fix memcpy() overflow of qed_dcbx_params()
net: cdc_eem: fix tx fixup skb leak
net: hamradio: fix memory leak in mkiss_close
...