The attribute_group structs are never modified, they're only passed to
sysfs_create_group() and sysfs_remove_group(). Make them const to allow
the compiler to put them in read-only memory.
Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Signed-off-by: Song Liu <song@kernel.org>
NULL pointer dereference was observed in super_written() when it tries
to access the mddev structure.
[The below stack trace is from an older kernel, but the problem described
in this patch applies to the mainline kernel.]
[ 1194.474861] task: ffff8fdd20858000 task.stack: ffffb99d40790000
[ 1194.488000] RIP: 0010:super_written+0x29/0xe1
[ 1194.499688] RSP: 0018:ffff8ffb7fcc3c78 EFLAGS: 00010046
[ 1194.512477] RAX: 0000000000000000 RBX: ffff8ffb7bf4a000 RCX: ffff8ffb78991048
[ 1194.527325] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ffb56b8a200
[ 1194.542576] RBP: ffff8ffb7fcc3c90 R08: 000000000000000b R09: 0000000000000000
[ 1194.558001] R10: ffff8ffb56b8a298 R11: 0000000000000000 R12: ffff8ffb56b8a200
[ 1194.573070] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1194.588117] FS: 0000000000000000(0000) GS:ffff8ffb7fcc0000(0000) knlGS:0000000000000000
[ 1194.604264] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1194.617375] CR2: 00000000000002b8 CR3: 00000021e040a002 CR4: 00000000007606e0
[ 1194.632327] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1194.647865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1194.663316] PKRU: 55555554
[ 1194.674090] Call Trace:
[ 1194.683735] <IRQ>
[ 1194.692948] bio_endio+0xae/0x135
[ 1194.703580] blk_update_request+0xad/0x2fa
[ 1194.714990] blk_update_bidi_request+0x20/0x72
[ 1194.726578] __blk_end_bidi_request+0x2c/0x4d
[ 1194.738373] __blk_end_request_all+0x31/0x49
[ 1194.749344] blk_flush_complete_seq+0x377/0x383
[ 1194.761550] flush_end_io+0x1dd/0x2a7
[ 1194.772910] blk_finish_request+0x9f/0x13c
[ 1194.784544] scsi_end_request+0x180/0x25c
[ 1194.796149] scsi_io_completion+0xc8/0x610
[ 1194.807503] scsi_finish_command+0xdc/0x125
[ 1194.818897] scsi_softirq_done+0x81/0xde
[ 1194.830062] blk_done_softirq+0xa4/0xcc
[ 1194.841008] __do_softirq+0xd9/0x29f
[ 1194.851257] irq_exit+0xe6/0xeb
[ 1194.861290] do_IRQ+0x59/0xe3
[ 1194.871060] common_interrupt+0x1c6/0x382
[ 1194.881988] </IRQ>
[ 1194.890646] RIP: 0010:cpuidle_enter_state+0xdd/0x2a5
[ 1194.902532] RSP: 0018:ffffb99d40793e68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff43
[ 1194.917317] RAX: ffff8ffb7fce27c0 RBX: ffff8ffb7fced800 RCX: 000000000000001f
[ 1194.932056] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000000
[ 1194.946428] RBP: ffffb99d40793ea0 R08: 0000000000000004 R09: 0000000000002ed2
[ 1194.960508] R10: 0000000000002664 R11: 0000000000000018 R12: 0000000000000003
[ 1194.974454] R13: 000000000000000b R14: ffffffff925715a0 R15: 0000011610120d5a
[ 1194.988607] ? cpuidle_enter_state+0xcc/0x2a5
[ 1194.999077] cpuidle_enter+0x17/0x19
[ 1195.008395] call_cpuidle+0x23/0x3a
[ 1195.017718] do_idle+0x172/0x1d5
[ 1195.026358] cpu_startup_entry+0x73/0x75
[ 1195.035769] start_secondary+0x1b9/0x20b
[ 1195.044894] secondary_startup_64+0xa5/0xa5
[ 1195.084921] RIP: super_written+0x29/0xe1 RSP: ffff8ffb7fcc3c78
[ 1195.096354] CR2: 00000000000002b8
bio in the above stack is a bitmap write whose completion is invoked after
the tear down sequence sets the mddev structure to NULL in rdev.
During tear down, there is an attempt to flush the bitmap writes, but for
external bitmaps, there is no explicit wait for all the bitmap writes to
complete. For instance, md_bitmap_flush() is called to flush the bitmap
writes, but the last call to md_bitmap_daemon_work() in md_bitmap_flush()
could generate new bitmap writes for which there is no explicit wait to
complete those writes. The call to md_bitmap_update_sb() will return
simply for external bitmaps and the follow-up call to md_update_sb() is
conditional and may not get called for external bitmaps. This results in a
kernel panic when the completion routine, super_written() is called which
tries to reference mddev in the rdev that has been set to
NULL(in unbind_rdev_from_array() by tear down sequence).
The solution is to call md_super_wait() for external bitmaps after the
last call to md_bitmap_daemon_work() in md_bitmap_flush() to ensure there
are no pending bitmap writes before proceeding with the tear down.
Cc: stable@vger.kernel.org
Signed-off-by: Sudhakar Panneerselvam <sudhakar.panneerselvam@oracle.com>
Reviewed-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <song@kernel.org>
Callers of get_bitmap_from_slot() are responsible to free the bitmap.
Suggested-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
md_bitmap_get_counter() has code:
```
if (bitmap->bp[page].hijacked ||
bitmap->bp[page].map == NULL)
csize = ((sector_t)1) << (bitmap->chunkshift +
PAGE_COUNTER_SHIFT - 1);
```
The minus 1 is wrong, this branch should report 2048 bits of space.
With "-1" action, this only report 1024 bit of space.
This bug code returns wrong blocks, but it doesn't inflence bitmap logic:
1. Most callers focus this function return value (the counter of offset),
not the parameter blocks.
2. The bug is only triggered when hijacked is true or map is NULL.
the hijacked true condition is very rare.
the "map == null" only true when array is creating or resizing.
3. Even the caller gets wrong blocks, current code makes caller just to
call md_bitmap_get_counter() one more time.
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
The patched code is used to get chunks number, should use round-up div
to replace current sector_div. The same code is in md_bitmap_resize():
```
chunks = DIV_ROUND_UP_SECTOR_T(blocks, 1 << chunkshift);
```
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
We alreday has the interface i_blocksize(), which can be used
to get blocksize, so use it.
Only calculate blocksize once and use it within read_page().
Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
The following deadlock was captured. The first process is holding 'kernfs_mutex'
and hung by io. The io was staging in 'r1conf.pending_bio_list' of raid1 device,
this pending bio list would be flushed by second process 'md127_raid1', but
it was hung by 'kernfs_mutex'. Using sysfs_notify_dirent_safe() to replace
sysfs_notify() can fix it. There were other sysfs_notify() invoked from io
path, removed all of them.
PID: 40430 TASK: ffff8ee9c8c65c40 CPU: 29 COMMAND: "probe_file"
#0 [ffffb87c4df37260] __schedule at ffffffff9a8678ec
#1 [ffffb87c4df372f8] schedule at ffffffff9a867f06
#2 [ffffb87c4df37310] io_schedule at ffffffff9a0c73e6
#3 [ffffb87c4df37328] __dta___xfs_iunpin_wait_3443 at ffffffffc03a4057 [xfs]
#4 [ffffb87c4df373a0] xfs_iunpin_wait at ffffffffc03a6c79 [xfs]
#5 [ffffb87c4df373b0] __dta_xfs_reclaim_inode_3357 at ffffffffc039a46c [xfs]
#6 [ffffb87c4df37400] xfs_reclaim_inodes_ag at ffffffffc039a8b6 [xfs]
#7 [ffffb87c4df37590] xfs_reclaim_inodes_nr at ffffffffc039bb33 [xfs]
#8 [ffffb87c4df375b0] xfs_fs_free_cached_objects at ffffffffc03af0e9 [xfs]
#9 [ffffb87c4df375c0] super_cache_scan at ffffffff9a287ec7
#10 [ffffb87c4df37618] shrink_slab at ffffffff9a1efd93
#11 [ffffb87c4df37700] shrink_node at ffffffff9a1f5968
#12 [ffffb87c4df37788] do_try_to_free_pages at ffffffff9a1f5ea2
#13 [ffffb87c4df377f0] try_to_free_mem_cgroup_pages at ffffffff9a1f6445
#14 [ffffb87c4df37880] try_charge at ffffffff9a26cc5f
#15 [ffffb87c4df37920] memcg_kmem_charge_memcg at ffffffff9a270f6a
#16 [ffffb87c4df37958] new_slab at ffffffff9a251430
#17 [ffffb87c4df379c0] ___slab_alloc at ffffffff9a251c85
#18 [ffffb87c4df37a80] __slab_alloc at ffffffff9a25635d
#19 [ffffb87c4df37ac0] kmem_cache_alloc at ffffffff9a251f89
#20 [ffffb87c4df37b00] alloc_inode at ffffffff9a2a2b10
#21 [ffffb87c4df37b20] iget_locked at ffffffff9a2a4854
#22 [ffffb87c4df37b60] kernfs_get_inode at ffffffff9a311377
#23 [ffffb87c4df37b80] kernfs_iop_lookup at ffffffff9a311e2b
#24 [ffffb87c4df37ba8] lookup_slow at ffffffff9a290118
#25 [ffffb87c4df37c10] walk_component at ffffffff9a291e83
#26 [ffffb87c4df37c78] path_lookupat at ffffffff9a293619
#27 [ffffb87c4df37cd8] filename_lookup at ffffffff9a2953af
#28 [ffffb87c4df37de8] user_path_at_empty at ffffffff9a295566
#29 [ffffb87c4df37e10] vfs_statx at ffffffff9a289787
#30 [ffffb87c4df37e70] SYSC_newlstat at ffffffff9a289d5d
#31 [ffffb87c4df37f18] sys_newlstat at ffffffff9a28a60e
#32 [ffffb87c4df37f28] do_syscall_64 at ffffffff9a003949
#33 [ffffb87c4df37f50] entry_SYSCALL_64_after_hwframe at ffffffff9aa001ad
RIP: 00007f617a5f2905 RSP: 00007f607334f838 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 00007f6064044b20 RCX: 00007f617a5f2905
RDX: 00007f6064044b20 RSI: 00007f6064044b20 RDI: 00007f6064005890
RBP: 00007f6064044aa0 R8: 0000000000000030 R9: 000000000000011c
R10: 0000000000000013 R11: 0000000000000246 R12: 00007f606417e6d0
R13: 00007f6064044aa0 R14: 00007f6064044b10 R15: 00000000ffffffff
ORIG_RAX: 0000000000000006 CS: 0033 SS: 002b
PID: 927 TASK: ffff8f15ac5dbd80 CPU: 42 COMMAND: "md127_raid1"
#0 [ffffb87c4df07b28] __schedule at ffffffff9a8678ec
#1 [ffffb87c4df07bc0] schedule at ffffffff9a867f06
#2 [ffffb87c4df07bd8] schedule_preempt_disabled at ffffffff9a86825e
#3 [ffffb87c4df07be8] __mutex_lock at ffffffff9a869bcc
#4 [ffffb87c4df07ca0] __mutex_lock_slowpath at ffffffff9a86a013
#5 [ffffb87c4df07cb0] mutex_lock at ffffffff9a86a04f
#6 [ffffb87c4df07cc8] kernfs_find_and_get_ns at ffffffff9a311d83
#7 [ffffb87c4df07cf0] sysfs_notify at ffffffff9a314b3a
#8 [ffffb87c4df07d18] md_update_sb at ffffffff9a688696
#9 [ffffb87c4df07d98] md_update_sb at ffffffff9a6886d5
#10 [ffffb87c4df07da8] md_check_recovery at ffffffff9a68ad9c
#11 [ffffb87c4df07dd0] raid1d at ffffffffc01f0375 [raid1]
#12 [ffffb87c4df07ea0] md_thread at ffffffff9a680348
#13 [ffffb87c4df07f08] kthread at ffffffff9a0b8005
#14 [ffffb87c4df07f50] ret_from_fork at ffffffff9aa00344
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
After introduction attach/detach_page_private in pagemap.h, we can remove
the duplicated code and call the new functions.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Song Liu <song@kernel.org>
Link: http://lkml.kernel.org/r/20200517214718.468-3-guoqing.jiang@cloud.ionos.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull misc vfs updates from Al Viro:
- bmap series from cmaiolino
- getting rid of convolutions in copy_mount_options() (use a couple of
copy_from_user() instead of the __get_user() crap)
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
saner copy_mount_options()
fibmap: Reject negative block numbers
fibmap: Use bmap instead of ->bmap method in ioctl_fibmap
ecryptfs: drop direct calls to ->bmap
cachefiles: drop direct usage of ->bmap method.
fs: Enable bmap() function to properly return errors
By now, bmap() will either return the physical block number related to
the requested file offset or 0 in case of error or the requested offset
maps into a hole.
This patch makes the needed changes to enable bmap() to proper return
errors, using the return value as an error return, and now, a pointer
must be passed to bmap() to be filled with the mapped physical block.
It will change the behavior of bmap() on return:
- negative value in case of error
- zero on success or map fell into a hole
In case of a hole, the *block will be zero too
Since this is a prep patch, by now, the only error return is -EINVAL if
->bmap doesn't exist.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Obviously, IO serialization could cause the degradation of
performance a lot. In order to reduce the degradation, so a
rb interval tree is added in raid1 to speed up the check of
collision.
So, a rb root is needed in md_rdev, then abstract all the
serialize related members to a new struct (serial_in_rdev),
embed it into md_rdev.
Of course, we need to free the struct if it is not needed
anymore, so rdev/rdevs_uninit_serial are added accordingly.
And they should be called when destroty memory pool or can't
alloc memory.
And we need to consider to call mddev_destroy_serial_pool
in case serialize_policy/write-behind is disabled, bitmap
is destroyed or in __md_stop_writes.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
The serial_info_pool is needed if array sets serialize_policy to
true, so don't destroy it.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Previously, wb_info_pool and wb_list stuffs are introduced
to address potential data inconsistence issue for write
behind device.
Now rename them to serial related name, since the same
mechanism will be used to address reorder overlap write
issue for raid1.
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
In md_bitmap_unplug, bitmap->storage.filemap is double checked.
In md_bitmap_daemon_work, bitmap->storage.filemap should be checked
before reference.
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
The write-behind attribute is part of bitmap, since bitmap
can be added/removed dynamically with the following.
1. mdadm --grow /dev/md0 --bitmap=none
2. mdadm --grow /dev/md0 --bitmap=internal --write-behind
So we need to destroy wb_info_pool in md_bitmap_destroy,
and create the pool before load bitmap.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Since we can enable write-behind mode by write backlog node,
so create wb_info_pool if the mode is just enabled, also call
call md_bitmap_update_sb to make user aware the write-behind
mode is enabled. Conversely, wb_info_pool should be destroyed
when write-behind mode is disabled.
Beside above, it is better to update bitmap sb if we change
the number of max_write_behind.
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Add SPDX license identifiers to all files which:
- Have no license information of any form
- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file
These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If we want to convert from a little endian format we need to cast
to a little endian type, otherwise sparse will be unhappy.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <songliubraving@fb.com>
After 9e1cc0a545 ("md: use mddev_suspend/resume instead of ->quiesce()")
We still have similar left in bitmap functions.
Replace quiesce() with mddev_suspend/resume.
Also move md_bitmap_create out of mddev_suspend. and move mddev_resume
after md_bitmap_destroy. as we did in set_bitmap_file.
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
Reviewed-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Pull input updates from Dmitry Torokhov:
- a new driver for Rohm BU21029 touch controller
- new bitmap APIs: bitmap_alloc, bitmap_zalloc and bitmap_free
- updates to Atmel, eeti. pxrc and iforce drivers
- assorted driver cleanups and fixes.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (57 commits)
MAINTAINERS: Add PhoenixRC Flight Controller Adapter
Input: do not use WARN() in input_alloc_absinfo()
Input: mark expected switch fall-throughs
Input: raydium_i2c_ts - use true and false for boolean values
Input: evdev - switch to bitmap API
Input: gpio-keys - switch to bitmap_zalloc()
Input: elan_i2c_smbus - cast sizeof to int for comparison
bitmap: Add bitmap_alloc(), bitmap_zalloc() and bitmap_free()
md: Avoid namespace collision with bitmap API
dm: Avoid namespace collision with bitmap API
Input: pm8941-pwrkey - add resin entry
Input: pm8941-pwrkey - abstract register offsets and event code
Input: iforce - reorganize joystick configuration lists
Input: atmel_mxt_ts - move completion to after config crc is updated
Input: atmel_mxt_ts - don't report zero pressure from T9
Input: atmel_mxt_ts - zero terminate config firmware file
Input: atmel_mxt_ts - refactor config update code to add context struct
Input: atmel_mxt_ts - config CRC may start at T71
Input: atmel_mxt_ts - remove unnecessary debug on ENOMEM
Input: atmel_mxt_ts - remove duplicate setup of ABS_MT_PRESSURE
...
bitmap API (include/linux/bitmap.h) has 'bitmap' prefix for its methods.
On the other hand MD bitmap API is special case.
Adding 'md' prefix to it to avoid name space collision.
No functional changes intended.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Shaohua Li <shli@kernel.org>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Pull MD update from Shaohua Li:
"This update mostly includes bug fixes:
- md-cluster now supports raid10 from Guoqing
- raid5 PPL fixes from Artur
- badblock regression fix from Bo
- suspend hang related fixes from Neil
- raid5 reshape fixes from Neil
- raid1 freeze deadlock fix from Nate
- memleak fixes from Zdenek
- bitmap related fixes from Me and Tao
- other fixes and cleanups"
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (33 commits)
md: free unused memory after bitmap resize
md: release allocated bitset sync_set
md/bitmap: clear BITMAP_WRITE_ERROR bit before writing it to sb
md: be cautious about using ->curr_resync_completed for ->recovery_offset
badblocks: fix wrong return value in badblocks_set if badblocks are disabled
md: don't check MD_SB_CHANGE_CLEAN in md_allow_write
md-cluster: update document for raid10
md: remove redundant variable q
raid1: remove obsolete code in raid1_write_request
md-cluster: Use a small window for raid10 resync
md-cluster: Suspend writes in RAID10 if within range
md-cluster/raid10: set "do_balance = 0" if area is resyncing
md: use lockdep_assert_held
raid1: prevent freeze_array/wait_all_barriers deadlock
md: use TASK_IDLE instead of blocking signals
md: remove special meaning of ->quiesce(.., 2)
md: allow metadata update while suspending.
md: use mddev_suspend/resume instead of ->quiesce()
md: move suspend_hi/lo handling into core md code
md: don't call bitmap_create() while array is quiesced.
...
For a RAID1 device using a file-based bitmap, if a bitmap write error
occurs but the later writes succeed, it's possible both BITMAP_STALE
and BITMAP_WRITE_ERROR bits will be written to the bitmap super block,
the BITMAP_STALE bit will be handled properly and be cleared, but the
BITMAP_WRITE_ERROR bit in sb->flags will make bitmap_create() to fail.
So clear it to protect against the write failure-and-then-recovery case.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Having both a bitmap and a journal is pointless.
Attempting to do so can corrupt the bitmap if the journal
replay happens before the bitmap is initialized.
Rather than try to avoid this corruption, simply
refuse to allow arrays with both a bitmap and a journal.
So:
- if raid5_run sees both are present, fail.
- if adding a bitmap finds a journal is present, fail
- if adding a journal finds a bitmap is present, fail.
Cc: stable@vger.kernel.org (4.10+)
Signed-off-by: NeilBrown <neilb@suse.com>
Tested-by: Joshua Kinard <kumba@gentoo.org>
Acked-by: Joshua Kinard <kumba@gentoo.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Motivated by the desire to illiminate the imprecise nature of
DM-specific patches being unnecessarily sent to both the MD maintainer
and mailing-list. Which is born out of the fact that DM files also
reside in drivers/md/
Now all MD-specific files in drivers/md/ start with either "raid" or
"md-" and the MAINTAINERS file has been updated accordingly.
Shaohua: don't change module name
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>