License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 17:07:57 +03:00
|
|
|
/* SPDX-License-Identifier: GPL-2.0 */
|
2013-06-05 17:21:07 +04:00
|
|
|
#ifndef _BCACHE_WRITEBACK_H
|
|
|
|
#define _BCACHE_WRITEBACK_H
|
|
|
|
|
2013-06-05 17:24:39 +04:00
|
|
|
#define CUTOFF_WRITEBACK 40
|
|
|
|
#define CUTOFF_WRITEBACK_SYNC 70
|
|
|
|
|
2018-12-13 17:53:55 +03:00
|
|
|
#define CUTOFF_WRITEBACK_MAX 70
|
|
|
|
#define CUTOFF_WRITEBACK_SYNC_MAX 90
|
|
|
|
|
2018-01-08 23:21:22 +03:00
|
|
|
#define MAX_WRITEBACKS_IN_PASS 5
|
|
|
|
#define MAX_WRITESIZE_IN_PASS 5000 /* *512b */
|
|
|
|
|
2018-02-07 22:41:44 +03:00
|
|
|
#define WRITEBACK_RATE_UPDATE_SECS_MAX 60
|
|
|
|
#define WRITEBACK_RATE_UPDATE_SECS_DEFAULT 5
|
|
|
|
|
2018-12-13 17:53:53 +03:00
|
|
|
#define BCH_AUTO_GC_DIRTY_THRESHOLD 50
|
|
|
|
|
bcache: consider the fragmentation when update the writeback rate
Current way to calculate the writeback rate only considered the
dirty sectors, this usually works fine when the fragmentation
is not high, but it will give us unreasonable small rate when
we are under a situation that very few dirty sectors consumed
a lot dirty buckets. In some case, the dirty bucekts can reached
to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) not even
reached the writeback_percent, the writeback rate will still
be the minimum value (4k), thus it will cause all the writes to be
stucked in a non-writeback mode because of the slow writeback.
We accelerate the rate in 3 stages with different aggressiveness,
the first stage starts when dirty buckets percent reach above
BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), the second is
BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57), the third is
BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64). By default
the first stage tries to writeback the amount of dirty data
in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) second,
the second stage tries to writeback the amount of dirty data in one bucket
in (1 / (dirty_buckets_percent - 57)) * 100 millisecond, the third
stage tries to writeback the amount of dirty data in one bucket in
(1 / (dirty_buckets_percent - 64)) millisecond.
the initial rate at each stage can be controlled by 3 configurable
parameters writeback_rate_fp_term_{low|mid|high}, they are by default
1, 10, 1000, the hint of IO throughput that these values are trying
to achieve is described by above paragraph, the reason that
I choose those value as default is based on the testing and the
production data, below is some details:
A. When it comes to the low stage, there is still a bit far from the 70
threshold, so we only want to give it a little bit push by setting the
term to 1, it means the initial rate will be 170 if the fragment is 6,
it is calculated by bucket_size/fragment, this rate is very small,
but still much reasonable than the minimum 8.
For a production bcache with unheavy workload, if the cache device
is bigger than 1 TB, it may take hours to consume 1% buckets,
so it is very possible to reclaim enough dirty buckets in this stage,
thus to avoid entering the next stage.
B. If the dirty buckets ratio didn't turn around during the first stage,
it comes to the mid stage, then it is necessary for mid stage
to be more aggressive than low stage, so i choose the initial rate
to be 10 times more than low stage, that means 1700 as the initial
rate if the fragment is 6. This is some normal rate
we usually see for a normal workload when writeback happens
because of writeback_percent.
C. If the dirty buckets ratio didn't turn around during the low and mid
stages, it comes to the third stage, and it is the last chance that
we can turn around to avoid the horrible cutoff writeback sync issue,
then we choose 100 times more aggressive than the mid stage, that
means 170000 as the initial rate if the fragment is 6. This is also
inferred from a production bcache, I've got one week's writeback rate
data from a production bcache which has quite heavy workloads,
again, the writeback is triggered by the writeback percent,
the highest rate area is around 100000 to 240000, so I believe this
kind aggressiveness at this stage is reasonable for production.
And it should be mostly enough because the hint is trying to reclaim
1000 bucket per second, and from that heavy production env,
it is consuming 50 bucket per second on average in one week's data.
Option writeback_consider_fragment is to control whether we want
this feature to be on or off, it's on by default.
Lastly, below is the performance data for all the testing result,
including the data from production env:
https://docs.google.com/document/d/1AmbIEa_2MhB9bqhC3rfga9tp7n9YX9PLn0jSUxscVW0/edit?usp=sharing
Signed-off-by: dongdong tao <dongdong.tao@canonical.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-10 08:07:23 +03:00
|
|
|
#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW 50
|
|
|
|
#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID 57
|
|
|
|
#define BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH 64
|
|
|
|
|
bcache: make bch_sectors_dirty_init() to be multithreaded
When attaching a cached device (a.k.a backing device) to a cache
device, bch_sectors_dirty_init() is called to count dirty sectors
and stripes (see what bcache_dev_sectors_dirty_add() does) on the
cache device.
The counting is done by a single thread recursive function
bch_btree_map_keys() to iterate all the bcache btree nodes.
If the btree has huge number of nodes, bch_sectors_dirty_init() will
take quite long time. In my testing, if the registering cache set has
a existed UUID which matches a already registered cached device, the
automatical attachment during the registration may take more than
55 minutes. This is too long for waiting the bcache to work in real
deployment.
Fortunately when bch_sectors_dirty_init() is called, no other thread
will access the btree yet, it is safe to do a read-only parallelized
dirty sectors counting by multiple threads.
This patch tries to create multiple threads, and each thread tries to
one-by-one count dirty sectors from the sub-tree indexed by a root
node key which the thread fetched. After the sub-tree is counted, the
counting thread will continue to fetch another root node key, until
the fetched key is NULL. How many threads in parallel depends on
the number of keys from the btree root node, and the number of online
CPU core. The thread number will be the less number but no more than
BCH_DIRTY_INIT_THRD_MAX. If there are only 2 keys in root node, it
can only be 2x times faster by this patch. But if there are 10 keys
in the root node, with this patch it can be 10x times faster.
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-22 09:03:02 +03:00
|
|
|
#define BCH_DIRTY_INIT_THRD_MAX 64
|
2018-01-08 23:21:30 +03:00
|
|
|
/*
|
|
|
|
* 14 (16384ths) is chosen here as something that each backing device
|
|
|
|
* should be a reasonable fraction of the share, and not to blow up
|
|
|
|
* until individual backing devices are a petabyte.
|
|
|
|
*/
|
|
|
|
#define WRITEBACK_SHARE_SHIFT 14
|
|
|
|
|
bcache: make bch_sectors_dirty_init() to be multithreaded
When attaching a cached device (a.k.a backing device) to a cache
device, bch_sectors_dirty_init() is called to count dirty sectors
and stripes (see what bcache_dev_sectors_dirty_add() does) on the
cache device.
The counting is done by a single thread recursive function
bch_btree_map_keys() to iterate all the bcache btree nodes.
If the btree has huge number of nodes, bch_sectors_dirty_init() will
take quite long time. In my testing, if the registering cache set has
a existed UUID which matches a already registered cached device, the
automatical attachment during the registration may take more than
55 minutes. This is too long for waiting the bcache to work in real
deployment.
Fortunately when bch_sectors_dirty_init() is called, no other thread
will access the btree yet, it is safe to do a read-only parallelized
dirty sectors counting by multiple threads.
This patch tries to create multiple threads, and each thread tries to
one-by-one count dirty sectors from the sub-tree indexed by a root
node key which the thread fetched. After the sub-tree is counted, the
counting thread will continue to fetch another root node key, until
the fetched key is NULL. How many threads in parallel depends on
the number of keys from the btree root node, and the number of online
CPU core. The thread number will be the less number but no more than
BCH_DIRTY_INIT_THRD_MAX. If there are only 2 keys in root node, it
can only be 2x times faster by this patch. But if there are 10 keys
in the root node, with this patch it can be 10x times faster.
Signed-off-by: Coly Li <colyli@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-22 09:03:02 +03:00
|
|
|
struct bch_dirty_init_state;
|
|
|
|
struct dirty_init_thrd_info {
|
|
|
|
struct bch_dirty_init_state *state;
|
|
|
|
struct task_struct *thread;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct bch_dirty_init_state {
|
|
|
|
struct cache_set *c;
|
|
|
|
struct bcache_device *d;
|
|
|
|
int total_threads;
|
|
|
|
int key_idx;
|
|
|
|
spinlock_t idx_lock;
|
|
|
|
atomic_t started;
|
|
|
|
atomic_t enough;
|
|
|
|
wait_queue_head_t wait;
|
|
|
|
struct dirty_init_thrd_info infos[BCH_DIRTY_INIT_THRD_MAX];
|
|
|
|
};
|
|
|
|
|
2013-06-05 17:21:07 +04:00
|
|
|
static inline uint64_t bcache_dev_sectors_dirty(struct bcache_device *d)
|
|
|
|
{
|
|
|
|
uint64_t i, ret = 0;
|
|
|
|
|
|
|
|
for (i = 0; i < d->nr_stripes; i++)
|
|
|
|
ret += atomic_read(d->stripe_sectors_dirty + i);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
bcache: fix overflow in offset_to_stripe()
offset_to_stripe() returns the stripe number (in type unsigned int) from
an offset (in type uint64_t) by the following calculation,
do_div(offset, d->stripe_size);
For large capacity backing device (e.g. 18TB) with small stripe size
(e.g. 4KB), the result is 4831838208 and exceeds UINT_MAX. The actual
returned value which caller receives is 536870912, due to the overflow.
Indeed in bcache_device_init(), bcache_device->nr_stripes is limited in
range [1, INT_MAX]. Therefore all valid stripe numbers in bcache are
in range [0, bcache_dev->nr_stripes - 1].
This patch adds a upper limition check in offset_to_stripe(): the max
valid stripe number should be less than bcache_device->nr_stripes. If
the calculated stripe number from do_div() is equal to or larger than
bcache_device->nr_stripe, -EINVAL will be returned. (Normally nr_stripes
is less than INT_MAX, exceeding upper limitation doesn't mean overflow,
therefore -EOVERFLOW is not used as error code.)
This patch also changes nr_stripes' type of struct bcache_device from
'unsigned int' to 'int', and return value type of offset_to_stripe()
from 'unsigned int' to 'int', to match their exact data ranges.
All locations where bcache_device->nr_stripes and offset_to_stripe() are
referenced also get updated for the above type change.
Reported-and-tested-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1783075
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 15:00:22 +03:00
|
|
|
static inline int offset_to_stripe(struct bcache_device *d,
|
2013-11-01 02:43:22 +04:00
|
|
|
uint64_t offset)
|
|
|
|
{
|
|
|
|
do_div(offset, d->stripe_size);
|
bcache: fix overflow in offset_to_stripe()
offset_to_stripe() returns the stripe number (in type unsigned int) from
an offset (in type uint64_t) by the following calculation,
do_div(offset, d->stripe_size);
For large capacity backing device (e.g. 18TB) with small stripe size
(e.g. 4KB), the result is 4831838208 and exceeds UINT_MAX. The actual
returned value which caller receives is 536870912, due to the overflow.
Indeed in bcache_device_init(), bcache_device->nr_stripes is limited in
range [1, INT_MAX]. Therefore all valid stripe numbers in bcache are
in range [0, bcache_dev->nr_stripes - 1].
This patch adds a upper limition check in offset_to_stripe(): the max
valid stripe number should be less than bcache_device->nr_stripes. If
the calculated stripe number from do_div() is equal to or larger than
bcache_device->nr_stripe, -EINVAL will be returned. (Normally nr_stripes
is less than INT_MAX, exceeding upper limitation doesn't mean overflow,
therefore -EOVERFLOW is not used as error code.)
This patch also changes nr_stripes' type of struct bcache_device from
'unsigned int' to 'int', and return value type of offset_to_stripe()
from 'unsigned int' to 'int', to match their exact data ranges.
All locations where bcache_device->nr_stripes and offset_to_stripe() are
referenced also get updated for the above type change.
Reported-and-tested-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1783075
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 15:00:22 +03:00
|
|
|
|
|
|
|
/* d->nr_stripes is in range [1, INT_MAX] */
|
|
|
|
if (unlikely(offset >= d->nr_stripes)) {
|
|
|
|
pr_err("Invalid stripe %llu (>= nr_stripes %d).\n",
|
|
|
|
offset, d->nr_stripes);
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Here offset is definitly smaller than INT_MAX,
|
|
|
|
* return it as int will never overflow.
|
|
|
|
*/
|
2013-11-01 02:43:22 +04:00
|
|
|
return offset;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline bool bcache_dev_stripe_dirty(struct cached_dev *dc,
|
2013-06-05 17:24:39 +04:00
|
|
|
uint64_t offset,
|
2018-08-11 08:19:44 +03:00
|
|
|
unsigned int nr_sectors)
|
2013-06-05 17:24:39 +04:00
|
|
|
{
|
bcache: fix overflow in offset_to_stripe()
offset_to_stripe() returns the stripe number (in type unsigned int) from
an offset (in type uint64_t) by the following calculation,
do_div(offset, d->stripe_size);
For large capacity backing device (e.g. 18TB) with small stripe size
(e.g. 4KB), the result is 4831838208 and exceeds UINT_MAX. The actual
returned value which caller receives is 536870912, due to the overflow.
Indeed in bcache_device_init(), bcache_device->nr_stripes is limited in
range [1, INT_MAX]. Therefore all valid stripe numbers in bcache are
in range [0, bcache_dev->nr_stripes - 1].
This patch adds a upper limition check in offset_to_stripe(): the max
valid stripe number should be less than bcache_device->nr_stripes. If
the calculated stripe number from do_div() is equal to or larger than
bcache_device->nr_stripe, -EINVAL will be returned. (Normally nr_stripes
is less than INT_MAX, exceeding upper limitation doesn't mean overflow,
therefore -EOVERFLOW is not used as error code.)
This patch also changes nr_stripes' type of struct bcache_device from
'unsigned int' to 'int', and return value type of offset_to_stripe()
from 'unsigned int' to 'int', to match their exact data ranges.
All locations where bcache_device->nr_stripes and offset_to_stripe() are
referenced also get updated for the above type change.
Reported-and-tested-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1783075
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-25 15:00:22 +03:00
|
|
|
int stripe = offset_to_stripe(&dc->disk, offset);
|
|
|
|
|
|
|
|
if (stripe < 0)
|
|
|
|
return false;
|
2013-06-05 17:24:39 +04:00
|
|
|
|
|
|
|
while (1) {
|
2013-11-01 02:43:22 +04:00
|
|
|
if (atomic_read(dc->disk.stripe_sectors_dirty + stripe))
|
2013-06-05 17:24:39 +04:00
|
|
|
return true;
|
|
|
|
|
2013-11-01 02:43:22 +04:00
|
|
|
if (nr_sectors <= dc->disk.stripe_size)
|
2013-06-05 17:24:39 +04:00
|
|
|
return false;
|
|
|
|
|
2013-11-01 02:43:22 +04:00
|
|
|
nr_sectors -= dc->disk.stripe_size;
|
2013-06-05 17:24:39 +04:00
|
|
|
stripe++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-12-13 17:53:55 +03:00
|
|
|
extern unsigned int bch_cutoff_writeback;
|
|
|
|
extern unsigned int bch_cutoff_writeback_sync;
|
|
|
|
|
2013-06-05 17:24:39 +04:00
|
|
|
static inline bool should_writeback(struct cached_dev *dc, struct bio *bio,
|
2018-08-11 08:19:44 +03:00
|
|
|
unsigned int cache_mode, bool would_skip)
|
2013-06-05 17:24:39 +04:00
|
|
|
{
|
2018-08-11 08:19:44 +03:00
|
|
|
unsigned int in_use = dc->disk.c->gc_stats.in_use;
|
2013-06-05 17:24:39 +04:00
|
|
|
|
|
|
|
if (cache_mode != CACHE_MODE_WRITEBACK ||
|
2013-08-22 04:49:09 +04:00
|
|
|
test_bit(BCACHE_DEV_DETACHING, &dc->disk.flags) ||
|
2018-12-13 17:53:55 +03:00
|
|
|
in_use > bch_cutoff_writeback_sync)
|
2013-06-05 17:24:39 +04:00
|
|
|
return false;
|
|
|
|
|
bcache: never writeback a discard operation
Some users see panics like the following when performing fstrim on a
bcached volume:
[ 529.803060] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[ 530.183928] #PF error: [normal kernel read fault]
[ 530.412392] PGD 8000001f42163067 P4D 8000001f42163067 PUD 1f42168067 PMD 0
[ 530.750887] Oops: 0000 [#1] SMP PTI
[ 530.920869] CPU: 10 PID: 4167 Comm: fstrim Kdump: loaded Not tainted 5.0.0-rc1+ #3
[ 531.290204] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
[ 531.693137] RIP: 0010:blk_queue_split+0x148/0x620
[ 531.922205] Code: 60 38 89 55 a0 45 31 db 45 31 f6 45 31 c9 31 ff 89 4d 98 85 db 0f 84 7f 04 00 00 44 8b 6d 98 4c 89 ee 48 c1 e6 04 49 03 70 78 <8b> 46 08 44 8b 56 0c 48
8b 16 44 29 e0 39 d8 48 89 55 a8 0f 47 c3
[ 532.838634] RSP: 0018:ffffb9b708df39b0 EFLAGS: 00010246
[ 533.093571] RAX: 00000000ffffffff RBX: 0000000000046000 RCX: 0000000000000000
[ 533.441865] RDX: 0000000000000200 RSI: 0000000000000000 RDI: 0000000000000000
[ 533.789922] RBP: ffffb9b708df3a48 R08: ffff940d3b3fdd20 R09: 0000000000000000
[ 534.137512] R10: ffffb9b708df3958 R11: 0000000000000000 R12: 0000000000000000
[ 534.485329] R13: 0000000000000000 R14: 0000000000000000 R15: ffff940d39212020
[ 534.833319] FS: 00007efec26e3840(0000) GS:ffff940d1f480000(0000) knlGS:0000000000000000
[ 535.224098] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 535.504318] CR2: 0000000000000008 CR3: 0000001f4e256004 CR4: 00000000001606e0
[ 535.851759] Call Trace:
[ 535.970308] ? mempool_alloc_slab+0x15/0x20
[ 536.174152] ? bch_data_insert+0x42/0xd0 [bcache]
[ 536.403399] blk_mq_make_request+0x97/0x4f0
[ 536.607036] generic_make_request+0x1e2/0x410
[ 536.819164] submit_bio+0x73/0x150
[ 536.980168] ? submit_bio+0x73/0x150
[ 537.149731] ? bio_associate_blkg_from_css+0x3b/0x60
[ 537.391595] ? _cond_resched+0x1a/0x50
[ 537.573774] submit_bio_wait+0x59/0x90
[ 537.756105] blkdev_issue_discard+0x80/0xd0
[ 537.959590] ext4_trim_fs+0x4a9/0x9e0
[ 538.137636] ? ext4_trim_fs+0x4a9/0x9e0
[ 538.324087] ext4_ioctl+0xea4/0x1530
[ 538.497712] ? _copy_to_user+0x2a/0x40
[ 538.679632] do_vfs_ioctl+0xa6/0x600
[ 538.853127] ? __do_sys_newfstat+0x44/0x70
[ 539.051951] ksys_ioctl+0x6d/0x80
[ 539.212785] __x64_sys_ioctl+0x1a/0x20
[ 539.394918] do_syscall_64+0x5a/0x110
[ 539.568674] entry_SYSCALL_64_after_hwframe+0x44/0xa9
We have observed it where both:
1) LVM/devmapper is involved (bcache backing device is LVM volume) and
2) writeback cache is involved (bcache cache_mode is writeback)
On one machine, we can reliably reproduce it with:
# echo writeback > /sys/block/bcache0/bcache/cache_mode
(not sure whether above line is required)
# mount /dev/bcache0 /test
# for i in {0..10}; do
file="$(mktemp /test/zero.XXX)"
dd if=/dev/zero of="$file" bs=1M count=256
sync
rm $file
done
# fstrim -v /test
Observing this with tracepoints on, we see the following writes:
fstrim-18019 [022] .... 91107.302026: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 4260112 + 196352 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302050: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 4456464 + 262144 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302075: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 4718608 + 81920 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302094: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 5324816 + 180224 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302121: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 5505040 + 262144 hit 0 bypass 1
fstrim-18019 [022] .... 91107.302145: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 5767184 + 81920 hit 0 bypass 1
fstrim-18019 [022] .... 91107.308777: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 6373392 + 180224 hit 1 bypass 0
<crash>
Note the final one has different hit/bypass flags.
This is because in should_writeback(), we were hitting a case where
the partial stripe condition was returning true and so
should_writeback() was returning true early.
If that hadn't been the case, it would have hit the would_skip test, and
as would_skip == s->iop.bypass == true, should_writeback() would have
returned false.
Looking at the git history from 'commit 72c270612bd3 ("bcache: Write out
full stripes")', it looks like the idea was to optimise for raid5/6:
* If a stripe is already dirty, force writes to that stripe to
writeback mode - to help build up full stripes of dirty data
To fix this issue, make sure that should_writeback() on a discard op
never returns true.
More details of debugging:
https://www.spinics.net/lists/linux-bcache/msg06996.html
Previous reports:
- https://bugzilla.kernel.org/show_bug.cgi?id=201051
- https://bugzilla.kernel.org/show_bug.cgi?id=196103
- https://www.spinics.net/lists/linux-bcache/msg06885.html
(Coly Li: minor modification to follow maximum 75 chars per line rule)
Cc: Kent Overstreet <koverstreet@google.com>
Cc: stable@vger.kernel.org
Fixes: 72c270612bd3 ("bcache: Write out full stripes")
Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 07:52:53 +03:00
|
|
|
if (bio_op(bio) == REQ_OP_DISCARD)
|
|
|
|
return false;
|
|
|
|
|
2013-06-05 17:24:39 +04:00
|
|
|
if (dc->partial_stripes_expensive &&
|
2013-10-12 02:44:27 +04:00
|
|
|
bcache_dev_stripe_dirty(dc, bio->bi_iter.bi_sector,
|
2013-06-05 17:24:39 +04:00
|
|
|
bio_sectors(bio)))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
if (would_skip)
|
|
|
|
return false;
|
|
|
|
|
2017-10-14 02:35:33 +03:00
|
|
|
return (op_is_sync(bio->bi_opf) ||
|
|
|
|
bio->bi_opf & (REQ_META|REQ_PRIO) ||
|
2018-12-13 17:53:55 +03:00
|
|
|
in_use <= bch_cutoff_writeback);
|
2013-06-05 17:24:39 +04:00
|
|
|
}
|
|
|
|
|
2013-07-25 04:50:06 +04:00
|
|
|
static inline void bch_writeback_queue(struct cached_dev *dc)
|
|
|
|
{
|
2015-11-30 05:44:49 +03:00
|
|
|
if (!IS_ERR_OR_NULL(dc->writeback_thread))
|
|
|
|
wake_up_process(dc->writeback_thread);
|
2013-07-25 04:50:06 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void bch_writeback_add(struct cached_dev *dc)
|
|
|
|
{
|
|
|
|
if (!atomic_read(&dc->has_dirty) &&
|
|
|
|
!atomic_xchg(&dc->has_dirty, 1)) {
|
|
|
|
if (BDEV_STATE(&dc->sb) != BDEV_STATE_DIRTY) {
|
|
|
|
SET_BDEV_STATE(&dc->sb, BDEV_STATE_DIRTY);
|
|
|
|
/* XXX: should do this synchronously */
|
|
|
|
bch_write_bdev_super(dc, NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
bch_writeback_queue(dc);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-08-11 08:19:46 +03:00
|
|
|
void bcache_dev_sectors_dirty_add(struct cache_set *c, unsigned int inode,
|
|
|
|
uint64_t offset, int nr_sectors);
|
2013-06-05 17:21:07 +04:00
|
|
|
|
2018-08-11 08:19:46 +03:00
|
|
|
void bch_sectors_dirty_init(struct bcache_device *d);
|
|
|
|
void bch_cached_dev_writeback_init(struct cached_dev *dc);
|
|
|
|
int bch_cached_dev_writeback_start(struct cached_dev *dc);
|
2013-06-05 17:21:07 +04:00
|
|
|
|
|
|
|
#endif
|