WSL2-Linux-Kernel

Граф коммитов

Автор	SHA1	Сообщение	Дата
Pavel Begunkov	9835d6fafb	io_uring: add likely/unlikely in io_get_sqring() The number of SQEs to submit is specified by a user, so io_get_sqring() in most of the cases succeeds. Hint compilers about that. Checking ASM genereted by gcc 9.2.0 for x64, there is one branch misprediction. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:10 -07:00
Pavel Begunkov	d732447fed	io_uring: rename __io_submit_sqe() __io_submit_sqe() is issuing requests, so call it as such. Moreover, it ends by calling io_iopoll_req_issued(). Rename it and make terminology clearer. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:10 -07:00
Jens Axboe	915967f69c	io_uring: improve trace_io_uring_defer() trace point We don't have shadow requests anymore, so get rid of the shadow argument. Add the user_data argument, as that's often useful to easily match up requests, instead of having to look at request pointers. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:10 -07:00
Pavel Begunkov	1b4a51b6d0	io_uring: drain next sqe instead of shadowing There's an issue with the shadow drain logic in that we drop the completion lock after deciding to defer a request, then re-grab it later and assume that the state is still the same. In the mean time, someone else completing a request could have found and issued it. This can cause a stall in the queue, by having a shadow request inserted that nobody is going to drain. Additionally, if we fail allocating the shadow request, we simply ignore the drain. Instead of using a shadow request, defer the next request/link instead. This also has the following advantages: - removes semi-duplicated code - doesn't allocate memory for shadows - works better if only the head marked for drain - doesn't need complex synchronisation On the flip side, it removes the shadow->seq == last_drain_in_in_link->seq optimization. That shouldn't be a common case, and can always be added back, if needed. Fixes: `4fe2c96315` ("io_uring: add support for link with drain") Cc: Jackie Liu <liuyun01@kylinos.cn> Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:10 -07:00
Jens Axboe	b76da70fc3	io_uring: close lookup gap for dependent next work When we find new work to process within the work handler, we queue the linked timeout before we have issued the new work. This can be problematic for very short timeouts, as we have a window where the new work isn't visible. Allow the work handler to store a callback function for this in the work item, and flag it with IO_WQ_WORK_CB if the caller has done so. If that is set, then io-wq will call the callback when it has setup the new work item. Reported-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:10 -07:00
Jens Axboe	4d7dd46297	io_uring: allow finding next link independent of req reference count We currently try and start the next link when we put the request, and only if we were going to free it. This means that the optimization to continue executing requests from the same context often fails, as we're not putting the final reference. Add REQ_F_LINK_NEXT to keep track of this, and allow io_uring to find the next request more efficiently. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:06 -07:00
Jens Axboe	eb065d301e	io_uring: io_allocate_scq_urings() should return a sane state We currently rely on the ring destroy on cleaning things up in case of failure, but io_allocate_scq_urings() can leave things half initialized if only parts of it fails. Be nice and return with either everything setup in success, or return an error with things nicely cleaned up. Reported-by: syzbot+0d818c0d39399188f393@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:06 -07:00
Pavel Begunkov	bbad27b2f6	io_uring: Always REQ_F_FREE_SQE for allocated sqe Always mark requests with allocated sqe and deallocate it in __io_free_req(). It's easier to follow and doesn't add edge cases. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:06 -07:00
Jens Axboe	5d960724b0	io_uring: io_fail_links() should only consider first linked timeout We currently clear the linked timeout field if we cancel such a timeout, but we should only attempt to cancel if it's the first one we see. Others should simply be freed like other requests, as they haven't been started yet. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:06 -07:00
Pavel Begunkov	09fbb0a83e	io_uring: Fix leaking linked timeouts let have a dependant link: REQ -> LINK_TIMEOUT -> LINK_TIMEOUT 1. submission stage: submission references for REQ and LINK_TIMEOUT are dropped. So, references respectively (1,1,2) 2. io_put(REQ) + FAIL_LINKS stage: calls io_fail_links(), which for all linked timeouts will call cancel_timeout() and drop 1 reference. So, references after: (0,0,1). That's a leak. Make it treat only the first linked timeout as such, and pass others through __io_double_put_req(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:06 -07:00
Pavel Begunkov	f70193d6d8	io_uring: remove redundant check Pass any IORING_OP_LINK_TIMEOUT request further, where it will eventually fail in io_issue_sqe(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:06 -07:00
Pavel Begunkov	d3b35796b1	io_uring: break links for failed defer If io_req_defer() failed, it needs to cancel a dependant link. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:06 -07:00
Dan Carpenter	b2e9c7d64b	io-wq: remove extra space characters These lines are indented an extra space character. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:05 -07:00
Jens Axboe	b60fda6000	io-wq: wait for io_wq_create() to setup necessary workers We currently have a race where if setup is really slow, we can be calling io_wq_destroy() before we're done setting up. This will cause the caller to get stuck waiting for the manager to set things up, but the manager already exited. Fix this by doing a sync setup of the manager. This also fixes the case where if we failed creating workers, we'd also get stuck. In practice this race window was really small, as we already wait for the manager to start. Hence someone would have to call io_wq_destroy() after the task has started, but before it started the first loop. The reported test case forked tons of these, which is why it became an issue. Reported-by: syzbot+0f1cc17f85154f400465@syzkaller.appspotmail.com Fixes: `771b53d033` ("io-wq: small threadpool implementation for io_uring") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:05 -07:00
Jens Axboe	fba38c272a	io_uring: request cancellations should break links We currently don't explicitly break links if a request is cancelled, but we should. Add explicitly link breakage for all types of request cancellations that we support. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:05 -07:00
Jens Axboe	b0dd8a4126	io_uring: correct poll cancel and linked timeout expiration completion Currently a poll request fills a completion entry of 0, even if it got cancelled. This is odd, and it makes it harder to support with chains. Ensure that it returns -ECANCELED in the completions events if it got cancelled, and furthermore ensure that the linked timeout that triggered it completes with -ETIME if we did indeed trigger the completions through a timeout. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:05 -07:00
Jens Axboe	e0e328c4b3	io_uring: remove dead REQ_F_SEQ_PREV flag With the conversion to io-wq, we no longer use that flag. Kill it. Fixes: `561fb04a6a` ("io_uring: replace workqueue usage with io-wq") Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:05 -07:00
Jens Axboe	94ae5e77a9	io_uring: fix sequencing issues with linked timeouts We have an issue with timeout links that are deeper in the submit chain, because we only handle it upfront, not from later submissions. Move the prep + issue of the timeout link to the async work prep handler, and do it normally for non-async queue. If we validate and prepare the timeout links upfront when we first see them, there's nothing stopping us from supporting any sort of nesting. Fixes: `2665abfd75` ("io_uring: add support for linked SQE timeouts") Reported-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:05 -07:00
Jens Axboe	ad8a48acc2	io_uring: make req->timeout be dynamically allocated There are a few reasons for this: - As a prep to improving the linked timeout logic - io_timeout is the biggest member in the io_kiocb opcode union This also enables a few cleanups, like unifying the timer setup between IORING_OP_TIMEOUT and IORING_OP_LINK_TIMEOUT, and not needing multiple arguments to the link/prep helpers. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:56:01 -07:00
Jens Axboe	978db57e2c	io_uring: make io_double_put_req() use normal completion path If we don't use the normal completion path, we may skip killing links that should be errored and freed. Add __io_double_put_req() for use within the completion path itself, other calls should just use io_double_put_req(). Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:48:31 -07:00
Jens Axboe	0e0702dac2	io_uring: cleanup return values from the queueing functions __io_queue_sqe(), io_queue_sqe(), io_queue_link_head() all return 0/err, but the caller doesn't care since the errors are handled inline. Clean these up and just make them void. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:48:31 -07:00
Jens Axboe	95a5bbae05	io_uring: io_async_cancel() should pass in 'nxt' request pointer If we have a linked request, this enables us to pass it back directly without having to go through async context. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-11-25 19:48:31 -07:00
Linus Torvalds	e25645b181	linux-kselftest-5.5-rc1-kunit This kselftest update for Linux 5.5-rc1 adds KUnit, a lightweight unit testing and mocking framework for the Linux kernel from Brendan Higgins. KUnit is not an end-to-end testing framework. It is currently supported on UML and sub-systems can write unit tests and run them in UML env. KUnit documentation is included in this update. In addition, this Kunit update adds 3 new kunit tests: - kunit test for proc sysctl from Iurii Zaikin - kunit test for the 'list' doubly linked list from David Gow - ext4 kunit test for decoding extended timestamps from Iurii Zaikin In the future KUnit will be linked to Kselftest framework to provide a way to trigger KUnit tests from user-space. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAl3YfBQACgkQCwJExA0N QxzTjxAAiFhaDMhlpLhn1DpIUNvfKrIgDjJgajQAyMMs6TJK3OrD6J4WbpVD7wGo aqF9l6o64sY18JAo3s00j6AcAmVwNH7qzEEuzIQPjJvQ8C4sCWL3esEP4JHgFb2F snlSn5KjSsdC1D9N7uQIhgW76xPSyDrTwWQpglvmB9TwmJVBIl9zhu+unp73ufFJ N+ieDg8A6W/wDGYSq5JBSkJbuI0gL+daNwUYzxEEZIskndhpovOc82WAldECRm6x TfI0u39zTbrEO0DHgmYpyGbTN8TB2mXjH5HMjwg+KbHfKVTKKGvTK7XFs8mWGQpO n2meypZuwuIsRPOPcAVs+Gt2dc0jFODJVIV1EzA0WSv6TEdPqyhM/d13tHdCqjm9 ic5wQ/hQQNEB1Dvg5ereXBaGGaoqP95y61ZpCS9vCXFXH+28E/B63Ebfs2IBIuqS Jv2KcoxabyZq3uGdjnn+mD7IM8rkvscRP4Ba31nXRgJIYDHAzqe7APN7y3on4NGx 1q7lBlA3XZZ8qgo0zpLST20ck/qaL3tk4k8E1f8emh6CuyrCWtazgrWkMIlyEX0O 8nre3uEAF9xUzB4+gZK4YmelN9Bld3Uv7Ippt1zTCiQ0FkEABQIMUrTZygy7Wfg6 6qi4dk8frWW4Kt63gOXsMxr9FWTqDk+Ys4GPAVDVm1d0dzERn8k= =0zEe -----END PGP SIGNATURE----- Merge tag 'linux-kselftest-5.5-rc1-kunit' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull kselftest KUnit support gtom Shuah Khan: "This adds KUnit, a lightweight unit testing and mocking framework for the Linux kernel from Brendan Higgins. KUnit is not an end-to-end testing framework. It is currently supported on UML and sub-systems can write unit tests and run them in UML env. KUnit documentation is included in this update. In addition, this Kunit update adds 3 new kunit tests: - proc sysctl test from Iurii Zaikin - the 'list' doubly linked list test from David Gow - ext4 tests for decoding extended timestamps from Iurii Zaikin In the future KUnit will be linked to Kselftest framework to provide a way to trigger KUnit tests from user-space" * tag 'linux-kselftest-5.5-rc1-kunit' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (23 commits) lib/list-test: add a test for the 'list' doubly linked list ext4: add kunit test for decoding extended timestamps Documentation: kunit: Fix verification command kunit: Fix '--build_dir' option kunit: fix failure to build without printk MAINTAINERS: add proc sysctl KUnit test to PROC SYSCTL section kernel/sysctl-test: Add null pointer test for sysctl.c:proc_dointvec() MAINTAINERS: add entry for KUnit the unit testing framework Documentation: kunit: add documentation for KUnit kunit: defconfig: add defconfigs for building KUnit tests kunit: tool: add Python wrappers for running KUnit tests kunit: test: add tests for KUnit managed resources kunit: test: add the concept of assertions kunit: test: add tests for kunit test abort kunit: test: add support for test abort objtool: add kunit_try_catch_throw to the noreturn list kunit: test: add initial tests lib: enable building KUnit in lib/ kunit: test: add the concept of expectations kunit: test: add assertion printing library ...	2019-11-25 15:01:30 -08:00
Linus Torvalds	1c1ff4836f	fsverity updates for 5.5 Expose the fs-verity bit through statx(). -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCXdtWqhQcZWJpZ2dlcnNA Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK+C9AQCCf8C2KP6DynoGQb9KRYYreJk8js8G IgtlhazJ3j1RJAD/VijFbdwbxGCmiR1Y6BhKq5eaCYD1El68wSwkKuNO3ww= =7WpU -----END PGP SIGNATURE----- Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt Pull fsverity updates from Eric Biggers: "Expose the fs-verity bit through statx()" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: docs: fs-verity: mention statx() support f2fs: support STATX_ATTR_VERITY ext4: support STATX_ATTR_VERITY statx: define STATX_ATTR_VERITY docs: fs-verity: document first supported kernel version	2019-11-25 12:21:23 -08:00
Linus Torvalds	ea4b71bc0b	fscrypt updates for 5.5 - Add the IV_INO_LBLK_64 encryption policy flag which modifies the encryption to be optimized for UFS inline encryption hardware. - For AES-128-CBC, use the crypto API's implementation of ESSIV (which was added in 5.4) rather than doing ESSIV manually. - A few other cleanups. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCXdtVMxQcZWJpZ2dlcnNA Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK8MVAP44iRzj8ZXu62BhqNOYYcF60s/58QfZ Jo1VdmvO/8MNrAD+P/jW5sqzcB5BLdNzS7pLKGIzsC55uMyp/79xyKK8wQc= =XKWV -----END PGP SIGNATURE----- Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt Pull fscrypt updates from Eric Biggers: - Add the IV_INO_LBLK_64 encryption policy flag which modifies the encryption to be optimized for UFS inline encryption hardware. - For AES-128-CBC, use the crypto API's implementation of ESSIV (which was added in 5.4) rather than doing ESSIV manually. - A few other cleanups. * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: f2fs: add support for IV_INO_LBLK_64 encryption policies ext4: add support for IV_INO_LBLK_64 encryption policies fscrypt: add support for IV_INO_LBLK_64 policies fscrypt: avoid data race on fscrypt_mode::logged_impl_name docs: ioctl-number: document fscrypt ioctl numbers fscrypt: zeroize fscrypt_info before freeing fscrypt: remove struct fscrypt_ctx fscrypt: invoke crypto API for ESSIV handling	2019-11-25 12:19:28 -08:00
Linus Torvalds	ae36607b66	affs-for-5.5-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl3YC+sACgkQxWXV+ddt WDtcJg/+JESwug/t+8+JWoajhwcMtLcQqqdPIWcfGq+9S7kIcKeoNtboO9Axj8d7 rxubmkk9zHEVgQtA5TWW9kdnNBjTnslHvCvD6m7g6AJGtiNdOofbc/2EaEmDP05f 6N7ZBOQ+l8jwgeZqRO2H2/Xh+Le9pjchhljcRgJ0eVVgPklSh5e4KrWkqOb4UwCm iZCHXPB8AeLxTfGL1yy5wklh4nKZF3DMDaT7+20YyKzNtJjmbS/2WPivr9uWd25D QHUeHk+yOJoRICAX4/8xFk+x7FLJTutfAsvwJM90a5+HR2/msyC7ahzIYguPBtaX Aa5hlP79ug6UhtB9RIUULjZiBYQS38qhlEmAE7RA4/sYFzbm6AkSQTnOxBennivN 2+ev64xeDEKNx//pWM52OHyWuDJbGBeBMdr4I8LtYEoyBgQC5kDnecodV81q9c79 ROTBmjP9gQFxkk3OoP44haky6t3HPqN/jCfFOvo0VxbEyVluRS1BZQ5+IjZIgSFU Zg5L9OHA93+PsZsNnas9Zd07YuMBXK3X4g//veWmY/eWursC6cn3IGhsGlxl0X1a g07jiZCLElmKV66UTdKqn+ByV1n0TreAjSn1Wr3NtYuBiqTAKh0xrbMlbE6n1db7 z+yKBRlcGBVvQQpI0AZB29aeeleNvDGDLuz+hE8K4rFFUCUb1zs= =NomJ -----END PGP SIGNATURE----- Merge tag 'affs-for-5.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull AFFS updates from David Sterba: "A minor bugfix and cleanup for AFFS" * tag 'affs-for-5.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: affs: fix a memory leak in affs_remount affs: Replace binary semaphores with mutexes	2019-11-25 12:17:58 -08:00
Linus Torvalds	97d0bf96a0	for-5.5-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl3YCRAACgkQxWXV+ddt WDuTuQ/7BOibDKqInm2SsL8xMuZqXjxGXUcHDPio5MbzNJ3wpV0j1KqWWsuK8hi0 HAhSI3fu3NG7RQYh3nuRO0CaZy3ENiqKoffrSpg7k5DJG0B7Lm/G/970fmOYUp6a j6PMNcrKaw+1J3yuljSd20+n6j/hdmfn847ZsSY+7JmZ4zGMJ5GMv3IRipdLFUzR tmjWmmCI05sF4/8cI6jzUVq588uSFTO1bGXFugmoO0ztpameudCnYniJI0tDBFSV pqk6lqoOPNcaC9nATuA5KKOpUJ9nSscP/St3DV4D6LaZKkT/M5zs12lXPMJx/pKn oCHt/A/wBdbDOoy1uHVMWQ9cz9PyVFtU7eSKizcFjoqnHO6fzlnRr9fsmIZKtTw9 H6nXVmP1S+xJg/zTBxCXHVfZR2dqADUsHWztN1LM8Pen/l9+UMwBeMhq9f9Jz68I kF7zWlfLEtNh8naEYf34LkGVMtCNY4PHFsSztPg/jbfsH34xMvetKvPR2s8lejhp 42YqPHgEh2+8mmVcq65+jl+bPOp/5bdToRtuPiszWiKZSXt/5xplP+5lkSEet0J6 4aNZ8NRAiZ98br45jdTMUVSo6YtI27SS+GdVOUHPQtPI/kWi9XHx+l3E9MVOUtrd lQ1Z9tPinEnJH4kntiCz2sKdzNKE01IagV4wFylz1Ct+ZqF9jNs= =JzIp -----END PGP SIGNATURE----- Merge tag 'for-5.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "User visible changes: - new block group profiles: RAID1 with 3- and 4- copies - RAID1 in btrfs has always 2 copies, now add support for 3 and 4 - this is an incompat feature (named RAID1C34) - recommended use of RAID1C3 is replacement of RAID6 profile on metadata, this brings a more reliable resiliency against 2 device loss/damage - support for new checksums - per-filesystem, set at mkfs time - fast hash (crc32c successor): xxhash, 64bit digest - strong hashes (both 256bit): sha256 (slower, FIPS), blake2b (faster) - the blake2b module goes via the crypto tree, btrfs.ko has a soft dependency - speed up lseek, don't take inode locks unnecessarily, this can speed up parallel SEEK_CUR/SEEK_SET/SEEK_END by 80% - send: - allow clone operations within the same file - limit maximum number of sent clone references to avoid slow backref walking - error message improvements: device scan prints process name and PID Core changes: - cleanups - remove unique workqueue helpers, used to provide a way to avoid deadlocks in the workqueue code, now done in a simpler way - remove lots of indirect function calls in compression code - extent IO tree code moved out of extent_io.c - cleanup backup superblock handling at mount time - transaction life cycle documentation and cleanups - locking code cleanups, annotations and documentation - add more cold, const, pure function attributes - removal of unused or redundant struct members or variables - new tree-checker sanity tests - try to detect missing INODE_ITEM, cross-reference checks of DIR_ITEM, DIR_INDEX, INODE_REF, and XATTR_* items - remove own bio scheduling code (used to avoid checksum submissions being stuck behind other IO), replaced by cgroup controller-based code to allow better control and avoid priority inversions in cases where the custom and cgroup scheduling disagreed Fixes: - avoid getting stuck during cyclic writebacks - fix trimming of ranges crossing block group boundaries - fix rename exchange on subvolumes, all involved subvolumes need to be recorded in the transaction" * tag 'for-5.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (137 commits) btrfs: drop bdev argument from submit_extent_page btrfs: remove extent_map::bdev btrfs: drop bio_set_dev where not needed btrfs: get bdev directly from fs_devices in submit_extent_page btrfs: record all roots for rename exchange on a subvol Btrfs: fix block group remaining RO forever after error during device replace btrfs: scrub: Don't check free space before marking a block group RO btrfs: change btrfs_fs_devices::rotating to bool btrfs: change btrfs_fs_devices::seeding to bool btrfs: rename btrfs_block_group_cache btrfs: block-group: Reuse the item key from caller of read_one_block_group() btrfs: block-group: Refactor btrfs_read_block_groups() btrfs: document extent buffer locking btrfs: access eb::blocking_writers according to ACCESS_ONCE policies btrfs: set blocking_writers directly, no increment or decrement btrfs: merge blocking_writers branches in btrfs_tree_read_lock btrfs: drop incompat bit for raid1c34 after last block group is gone btrfs: add incompat for raid1 with 3, 4 copies btrfs: add support for 4-copy replication (raid1c4) btrfs: add support for 3-copy replication (raid1c3) ...	2019-11-25 12:01:49 -08:00
Linus Torvalds	7e5192b93c	for-5.5/disk-revalidate-20191122 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3YA5sQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgplFxEACM7CwrWsullPX6b3j62NW6VepU5JQdzwVW S+bmLpb8Z2I4wzEnaVuWAY5hEhGaS9NFtQLdBG0W0YOzH7sweNmL38dZfCE4+oFj ZwytpXQQhAQUwkgANJCpNfzDymHduPsTz7RYqRr1plmhna1KC/dnhuMwg8lVOBf5 myWjqcCHxxoQn6KFqcX9/Azz29ZrgzV28lOnZdiw9yoTjraBmS/ymx4woaa3pc2v UNw0Cgx53vHENJzEL9FNSxc0ENZq/bQhpDolnc2AlPGy9+vPg4afMitJb60KTT7r HpDcLGkYAIKLrfk8DUmFW8lZhWsxTchXvK2+zwQV7nXMcdUgGN/G3HTIdvWEHFv8 oGbPB8cfdA2vNC9QAybwWEum/S0H/GfYsBVplNCUCdFXE7yj1cbKD5dPfCyIvmPz BjgMae5vH/KoH+vNdZ8NL5oFz2eFC3rLxa/Ss78pcEoBdiiV3WQHPv9MBmn/OQ/v CeUAM7omyWpbv3lcByNzIOkeeO3m6Ne28EpEMc2pzLnDPu2btvSyetdO488DE+7O MNfApZULVX91W7jWnhM5GR+1SJTdEXZnoxnFV+J/j4deog5vUR7Dt1VkujpUILfL 7jMl3erF6C53wNrc465z8iLRp1ZM+aTpwatXXRfucNXeomExKK9zF+/+O1ACckUB jWDCR9NTcw== =e5Lx -----END PGP SIGNATURE----- Merge tag 'for-5.5/disk-revalidate-20191122' of git://git.kernel.dk/linux-block Pull disk revalidation updates from Jens Axboe: "This continues the work that Jan Kara started to thoroughly cleanup and consolidate how we handle rescans and revalidations" * tag 'for-5.5/disk-revalidate-20191122' of git://git.kernel.dk/linux-block: block: move clearing bd_invalidated into check_disk_size_change block: remove (__)blkdev_reread_part as an exported API block: fix bdev_disk_changed for non-partitioned devices block: move rescan_partitions to fs/block_dev.c block: merge invalidate_partitions into rescan_partitions block: refactor rescan_partitions	2019-11-25 11:37:01 -08:00
Linus Torvalds	464a47f45d	for-5.5/zoned-20191122 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3YAiAQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpsJRD/wNfUGWVdIckw7iiFNuuipKBEy0Nd2VLt0B I+pVW/YjDsG2oxWXWPs5Nxc7ca2A8EzRXcWP0xEjBfOCcBh/9mULi1flkLRoWKcq v/OuTVif3ATvgJcwNkbMcoi0bYA/VwKi2dWC6ALhDDmZhyMTLeE362oIeOUNNnl6 GM8CGZHaRfmBzcH5t+WnxiS6rBlt5iwFJ35EvZo3GMXGGiLGlryxEXPAwZrf4haA Z4atNinKcNXhb80LWHo23aK3bpnaumwKP4BPuLEyvnjS4iU8SeYTXy+w5yq1BE+h HBP5s3no/mPiBAG8b6EZXqOJUGlN596AQfNLu7vCR78tmImZF0jKRFsHEAaKXf+B 1yRgZi7J+gV0qzK/Ufulg43vItk5/sTzEuV9YLfCpKTr14MFcWw908BAqaI5Kk1K e8uGqnb2KbZOLTW4QdPvpWg3eYtqEoluSoZUQ5elHxqQZ4MSZ1lK78FF1TeaW/pw sYH+v6rsWoVjEcFSwGoaaOMravzU4MKtavNAZrTJwKZx7qCqkwmi3R1k8WF6KsSV rTRAzUC1wpTdSOm1MYPMMKM/h5+BJRSJ/RjljOF4fXLnvpD5q0lequCWjrrEzc6c HPRKIgSBq7S620A19QD8UxwvZJ8bOivESqr0bux29v1Vpf7vJBrRMng8nLUrXfJs jdma5mK1UA== =/G9l -----END PGP SIGNATURE----- Merge tag 'for-5.5/zoned-20191122' of git://git.kernel.dk/linux-block Pull zoned block device update from Jens Axboe: "Enhancements and improvements to the zoned device support" * tag 'for-5.5/zoned-20191122' of git://git.kernel.dk/linux-block: scsi: sd_zbc: Remove set but not used variable 'buflen' block: rework zone reporting scsi: sd_zbc: Cleanup sd_zbc_alloc_report_buffer() null_blk: Add zone_nr_conv to features null_blk: clean up report zones null_blk: clean up the block device operations block: Remove partition support for zoned block devices block: Simplify report zones execution block: cleanup the !zoned case in blk_revalidate_disk_zones block: Enhance blk_revalidate_disk_zones()	2019-11-25 11:22:37 -08:00
Linus Torvalds	ff6814b078	for-5.5/block-20191121 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3WxrEQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpuH5D/9qQKfIIuQDUNO4Xx+dIHimTDCrfiEOeO9e CRaMuSj+yMxLDMwfX8RnDmR17H3ZVoiIY1CT24U9ZkA5iDjeAH4xmzkH30US7LR7 /64YVZTxB0OrWppRK8RiIhaJJZDQ6+HPUQsn6PRaLVuFHi2unMoTQnj/ZQKz03QA Pl8Xx7qBtH1JwYCzQ21f/uryAcNg9eWabRLN2f1uiOXLmvRxOfh6Z/iaezlaZlmL qeJdcdLjjvOgOPwEOfNjfS6pd+XBz3gdEhn0l+11nHITxWZmVBwsWTKyUQlCmKnl yuCWDVyx5d6zCnlrLYG0l2Fn2lr9SwAkdkq3YAKV03hA/6s6P9q9bm31VvOf828x 7gmr4YVz68y7H9bM0QAHCvDpjll0aIEUw6XFzSOCDtZ9B6/pppYQWzMU71J05eyF 8DOKv2M2EVNLUjf6u0RDyolnWGU0kIjt5ryWE3OsGcezAVa2wYstgUJTKbrn1YgT j+4KTpaI+sg8GKDFauvxcSa6gwoRp6jweFNW+7vC090/shXmrGmVLOnQZKRuHho/ O4W8y/1/deM8CCIAETpiNxA8RV5U/EZygrFGDFc7yzTtVDGHY356M/B4Bmm2qkVu K3WgeZp8Fc0lH0QF6Pp9ZlBkZEpGNCAPVsPkXIsxQXbctftkn3KY//uIubfpFEB1 PpHSicvkww== =HYYq -----END PGP SIGNATURE----- Merge tag 'for-5.5/block-20191121' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: "Due to more granular branches, this one is small and will be followed with other core branches that add specific features. I meant to just have a core and drivers branch, but external dependencies we ended up adding a few more that are also core. The changes are: - Fixes and improvements for the zoned device support (Ajay, Damien) - sed-opal table writing and datastore UID (Revanth) - blk-cgroup (and bfq) blk-cgroup stat fixes (Tejun) - Improvements to the block stats tracking (Pavel) - Fix for overruning sysfs buffer for large number of CPUs (Ming) - Optimization for small IO (Ming, Christoph) - Fix typo in RWH lifetime hint (Eugene) - Dead code removal and documentation (Bart) - Reduction in memory usage for queue and tag set (Bart) - Kerneldoc header documentation (André) - Device/partition revalidation fixes (Jan) - Stats tracking for flush requests (Konstantin) - Various other little fixes here and there (et al)" * tag 'for-5.5/block-20191121' of git://git.kernel.dk/linux-block: (48 commits) Revert "block: split bio if the only bvec's length is > SZ_4K" block: add iostat counters for flush requests block,bfq: Skip tracing hooks if possible block: sed-opal: Introduce SUM_SET_LIST parameter and append it using 'add_token_u64' blk-cgroup: cgroup_rstat_updated() shouldn't be called on cgroup1 block: Don't disable interrupts in trigger_softirq() sbitmap: Delete sbitmap_any_bit_clear() blk-mq: Delete blk_mq_has_free_tags() and blk_mq_can_queue() block: split bio if the only bvec's length is > SZ_4K block: still try to split bio if the bvec crosses pages blk-cgroup: separate out blkg_rwstat under CONFIG_BLK_CGROUP_RWSTAT blk-cgroup: reimplement basic IO stats using cgroup rstat blk-cgroup: remove now unused blkg_print_stat_{bytes\|ios}_recursive() blk-throtl: stop using blkg->stat_bytes and ->stat_ios bfq-iosched: stop using blkg->stat_bytes and ->stat_ios bfq-iosched: relocate bfqg_rwstat() helpers block: add zone open, close and finish ioctl support block: add zone open, close and finish operations block: Simplify REQ_OP_ZONE_RESET_ALL handling block: Remove REQ_OP_ZONE_RESET plugging ...	2019-11-25 10:59:41 -08:00
Linus Torvalds	fb4b3d3fd0	for-5.5/io_uring-20191121 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3WxNwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgps4kD/9SIDXhYhhE8fNqeAF7Uouu8fxgwnkY3hSI 43vJwCziiDxWWJH5mYW7/83VNOMZKHIbiYMnU6iEUsRQ/sG/wI0wEfAQZDHLzCKt cko2q7zAC1/4rtoslwJ3q04hE2Ap/nb93ELZBVr7fOAuODBNFUp/vifAojvsMPKz hNMNPq/vYg7c/iYMZKSBdtjE3tqceFNBjAVNMB9dHKQLeexEy4ve7AjBeawWsSi7 GesnQ5w5u5LqkMYwLslpv/oVjHiiFWgGnDAvBNvykQvVy+DfB54KSqMV11W1aqdU l6L+ENfZasEvlk1yMAth2Foq4vlscm5MKEb6VdJhXWHHXtXkcBmz7RBqPmjSvXCY wS5GZRw8oYtTcid0aQf+t/wgRNTDJsGsnsT32qto41No3Z7vlIDHUDxHZGTA+gEL E8j9rDx6EXMTo3EFbC8XZcfsorhPJ1HKAyw1YFczHtYzJEQUR9jJe3f/Q9u6K2Vy s/EhkVeHa/lEd7kb6mI+6lQjGe1FXl7AHauDuaaEfIOZA/xJB3Bad5Wjq1va1cUO TX+37zjzFzJghhSIBGYq7G7iT4AMecPQgxHzCdCyYfW5S4Uur9tMmIElwVPI/Pjl kDZ9gdg9lm6JifZ9Ab8QcGhuQQTF3frwX9VfgrVgcqyvm38AiYzVgL9ZJnxRS/Cy ZfLNkACXqQ== =YZ9s -----END PGP SIGNATURE----- Merge tag 'for-5.5/io_uring-20191121' of git://git.kernel.dk/linux-block Pull io_uring updates from Jens Axboe: "A lot of stuff has been going on this cycle, with improving the support for networked IO (and hence unbounded request completion times) being one of the major themes. There's been a set of fixes done this week, I'll send those out as well once we're certain we're fully happy with them. This contains: - Unification of the "normal" submit path and the SQPOLL path (Pavel) - Support for sparse (and bigger) file sets, and updating of those file sets without needing to unregister/register again. - Independently sized CQ ring, instead of just making it always 2x the SQ ring size. This makes it more flexible for networked applications. - Support for overflowed CQ ring, never dropping events but providing backpressure on submits. - Add support for absolute timeouts, not just relative ones. - Support for generic cancellations. This divorces io_uring from workqueues as well, which additionally gets us one step closer to generic async system call support. - With cancellations, we can support grabbing the process file table as well, just like we do mm context. This allows support for system calls that create file descriptors, like accept4() support that's built on top of that. - Support for io_uring tracing (Dmitrii) - Support for linked timeouts. These abort an operation if it isn't completed by the time noted in the linke timeout. - Speedup tracking of poll requests - Various cleanups making the coder easier to follow (Jackie, Pavel, Bob, YueHaibing, me) - Update MAINTAINERS with new io_uring list" * tag 'for-5.5/io_uring-20191121' of git://git.kernel.dk/linux-block: (64 commits) io_uring: make POLL_ADD/POLL_REMOVE scale better io-wq: remove now redundant struct io_wq_nulls_list io_uring: Fix getting file for non-fd opcodes io_uring: introduce req_need_defer() io_uring: clean up io_uring_cancel_files() io-wq: ensure free/busy list browsing see all items io-wq: ensure we have a stable view of ->cur_work for cancellations io_wq: add get/put_work handlers to io_wq_create() io_uring: check for validity of ->rings in teardown io_uring: fix potential deadlock in io_poll_wake() io_uring: use correct "is IO worker" helper io_uring: fix -ENOENT issue with linked timer with short timeout io_uring: don't do flush cancel under inflight_lock io_uring: flag SQPOLL busy condition to userspace io_uring: make ASYNC_CANCEL work with poll and timeout io_uring: provide fallback request for OOM situations io_uring: convert accept4() -ERESTARTSYS into -EINTR io_uring: fix error clear of ->file_table in io_sqe_files_register() io_uring: separate the io_free_req and io_free_req_find_next interface io_uring: keep io_put_req only responsible for release and put req ...	2019-11-25 10:40:27 -08:00
Linus Torvalds	0be0ee7181	vfs: properly and reliably lock f_pos in fdget_pos() fdget_pos() is used by file operations that will read and update f_pos: things like "read()", "write()" and "lseek()" (but not, for example, "pread()/pwrite" that get their file positions elsewhere). However, it had two separate escape clauses for this, because not everybody wants or needs serialization of the file position. The first and most obvious case is the "file descriptor doesn't have a position at all", ie a stream-like file. Except we didn't actually use FMODE_STREAM, but instead used FMODE_ATOMIC_POS. The reason for that was that FMODE_STREAM didn't exist back in the days, but also that we didn't want to mark all the special cases, so we only marked the ones that _required_ position atomicity according to POSIX - regular files and directories. The case one was intentionally lazy, but now that we _do_ have FMODE_STREAM we could and should just use it. With the change to use FMODE_STREAM, there are no remaining uses for FMODE_ATOMIC_POS, and all the code to set it is deleted. Any cases where we don't want the serialization because the driver (or subsystem) doesn't use the file position should just be updated to do "stream_open()". We've done that for all the obvious and common situations, we may need a few more. Quoting Kirill Smelkov in the original FMODE_STREAM thread (see link below for full email): "And I appreciate if people could help at least somehow with "getting rid of mixed case entirely" (i.e. always lock f_pos_lock on !FMODE_STREAM), because this transition starts to diverge from my particular use-case too far. To me it makes sense to do that transition as follows: - convert nonseekable_open -> stream_open via stream_open.cocci; - audit other nonseekable_open calls and convert left users that truly don't depend on position to stream_open; - extend stream_open.cocci to analyze alloc_file_pseudo as well (this will cover pipes and sockets), or maybe convert pipes and sockets to FMODE_STREAM manually; - extend stream_open.cocci to analyze file_operations that use no_llseek or noop_llseek, but do not use nonseekable_open or alloc_file_pseudo. This might find files that have stream semantic but are opened differently; - extend stream_open.cocci to analyze file_operations whose .read/.write do not use ppos at all (independently of how file was opened); - ... - after that remove FMODE_ATOMIC_POS and always take f_pos_lock if !FMODE_STREAM; - gather bug reports for deadlocked read/write and convert missed cases to FMODE_STREAM, probably extending stream_open.cocci along the road to catch similar cases i.e. always take f_pos_lock unless a file is explicitly marked as being stream, and try to find and cover all files that are streams" We have not done the "extend stream_open.cocci to analyze alloc_file_pseudo" as well, but the previous commit did manually handle the case of pipes and sockets. The other case where we can avoid locking f_pos is the "this file descriptor only has a single user and it is us, and thus there is no need to lock it". The second test was correct, although a bit subtle and worth just re-iterating here. There are two kinds of other sources of references to the same file descriptor: file descriptors that have been explicitly shared across fork() or with dup(), and file tables having elevated reference counts due to threading (or explicit file sharing with clone()). The first case would have incremented the file count explicitly, and in the second case the previous __fdget() would have incremented it for us and set the FDPUT_FPUT flag. But in both cases the file count would be greater than one, so the "file_count(file) > 1" test catches both situations. Also note that if file_count is 1, that also means that no other thread can have access to the file table, so there also cannot be races with concurrent calls to dup()/fork()/clone() that would increment the file count any other way. Link: https://lore.kernel.org/linux-fsdevel/20190413184404.GA13490@deco.navytux.spb.ru Cc: Kirill Smelkov <kirr@nexedi.com> Cc: Eic Dumazet <edumazet@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Marco Elver <elver@google.com> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Paul McKenney <paulmck@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-11-25 10:01:53 -08:00
Linus Torvalds	d8e464ecc1	vfs: mark pipes and sockets as stream-like file descriptors In commit `3975b097e5` ("convert stream-like files -> stream_open, even if they use noop_llseek") Kirill used a coccinelle script to change "nonseekable_open()" to "stream_open()", which changed the trivial cases of stream-like file descriptors to the new model with FMODE_STREAM. However, the two big cases - sockets and pipes - don't actually have that trivial pattern at all, and were thus never converted to FMODE_STREAM even though it makes lots of sense to do so. That's particularly true when looking forward to the next change: getting rid of FMODE_ATOMIC_POS entirely, and just using FMODE_STREAM to decide whether f_pos updates are needed or not. And if they are, we'll always do them atomically. This came up because KCSAN (correctly) noted that the non-locked f_pos updates are data races: they are clearly benign for the case where we don't care, but it would be good to just not have that issue exist at all. Note that the reason we used FMODE_ATOMIC_POS originally is that only doing it for the minimal required case is "safer" in that it's possible that the f_pos locking can cause unnecessary serialization across the whole write() call. And in the worst case, that kind of serialization can cause deadlock issues: think writers that need readers to empty the state using the same file descriptor. [ Note that the locking is per-file descriptor - because it protects "f_pos", which is obviously per-file descriptor - so it only affects cases where you literally use the same file descriptor to both read and write. So a regular pipe that has separate reading and writing file descriptors doesn't really have this situation even though it's the obvious case of "reader empties what a bit writer concurrently fills" But we want to make pipes as being stream-line anyway, because we don't want the unnecessary overhead of locking, and because a named pipe can be (ab-)used by reading and writing to the same file descriptor. ] There are likely a lot of other cases that might want FMODE_STREAM, and looking for ".llseek = no_llseek" users and other cases that don't have an lseek file operation at all and making them use "stream_open()" might be a good idea. But pipes and sockets are likely to be the two main cases. Cc: Kirill Smelkov <kirr@nexedi.com> Cc: Eic Dumazet <edumazet@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Marco Elver <elver@google.com> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Paul McKenney <paulmck@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-11-25 09:12:11 -08:00
Linus Torvalds	b8387f6f34	Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull cramfs fix from Al Viro: "Regression fix, fallen through the cracks" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: cramfs: fix usage on non-MTD device	2019-11-24 12:36:39 -08:00
Maxime Bizon	3e5aeec0e2	cramfs: fix usage on non-MTD device When both CONFIG_CRAMFS_MTD and CONFIG_CRAMFS_BLOCKDEV are enabled, if we fail to mount on MTD, we don't try on block device. Note: this relies upon cramfs_mtd_fill_super() leaving no side effects on fc state in case of failure; in general, failing get_tree_...() does not mean "fine to try again"; e.g. parsed options might've been consumed by fill_super callback and freed on failure. Fixes: `74f78fc5ef` ("vfs: Convert cramfs to use the new mount API") Signed-off-by: Maxime Bizon <mbizon@freebox.fr> Signed-off-by: Nicolas Pitre <nico@fluxnic.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2019-11-23 21:44:49 -05:00
Marc Dionne	b485275f1a	afs: Fix large file support By default s_maxbytes is set to MAX_NON_LFS, which limits the usable file size to 2GB, enforced by the vfs. Commit `b9b1f8d593` ("AFS: write support fixes") added support for the 64-bit fetch and store server operations, but did not change this value. As a result, attempts to write past the 2G mark result in EFBIG errors: $ dd if=/dev/zero of=foo bs=1M count=1 seek=2048 dd: error writing 'foo': File too large Set s_maxbytes to MAX_LFS_FILESIZE. Fixes: `b9b1f8d593` ("AFS: write support fixes") Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-11-22 14:19:26 -08:00
Marc Dionne	cd340703c2	afs: Fix possible assert with callbacks from yfs servers Servers sending callback breaks to the YFS_CM_SERVICE service may send up to YFSCBMAX (1024) fids in a single RPC. Anything over AFSCBMAX (50) will cause the assert in afs_break_callbacks to trigger. Remove the assert, as the count has already been checked against the appropriate max values in afs_deliver_cb_callback and afs_deliver_yfs_cb_callback. Fixes: `35dbfba311` ("afs: Implement the YFS cache manager service") Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-11-22 14:19:26 -08:00
Joseph Qi	94b07b6f9e	Revert "fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry()" This reverts commit `56e94ea132`. Commit `56e94ea132` ("fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry()") introduces a regression that fail to create directory with mount option user_xattr and acl. Actually the reported NULL pointer dereference case can be correctly handled by loc->xl_ops->xlo_add_entry(), so revert it. Link: http://lkml.kernel.org/r/1573624916-83825-1-git-send-email-joseph.qi@linux.alibaba.com Fixes: `56e94ea132` ("fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry()") Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reported-by: Thomas Voegtle <tv@lio96.de> Acked-by: Changwei Ge <gechangwei@live.cn> Cc: Jia-Ju Bai <baijiaju1990@gmail.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-11-22 09:11:18 -08:00
David Howells	c74386d50f	afs: Fix missing timeout reset In afs_wait_for_call_to_complete(), rather than immediately aborting an operation if a signal occurs, the code attempts to wait for it to complete, using a schedule timeout of 2*RTT (or min 2 jiffies) and a check that we're still receiving relevant packets from the server before we consider aborting the call. We may even ping the server to check on the status of the call. However, there's a missing timeout reset in the event that we do actually get a packet to process, such that if we then get a couple of short stalls, we then time out when progress is actually being made. Fix this by resetting the timeout any time we get something to process. If it's the failure of the call then the call state will get changed and we'll exit the loop shortly thereafter. A symptom of this is data fetches and stores failing with EINTR when they really shouldn't. Fixes: `bc5e3a546d` ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals") Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2019-11-19 14:36:38 -08:00
David Sterba	fa17ed069c	btrfs: drop bdev argument from submit_extent_page After previous patches removing bdev being passed around to set it to bio, it has become unused in submit_extent_page. So it now has "only" 13 parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 23:43:58 +01:00
David Sterba	a019e9e197	btrfs: remove extent_map::bdev We can now remove the bdev from extent_map. Previous patches made sure that bio_set_dev is correctly in all places and that we don't need to grab it from latest_bdev or pass it around inside the extent map. Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 23:43:44 +01:00
David Sterba	1a41802701	btrfs: drop bio_set_dev where not needed bio_set_dev sets a bdev to a bio and is not only setting a pointer bug also changing some state bits if there was a different bdev set before. This is one thing that's not needed. Another thing is that setting a bdev at bio allocation time is too early and actually does not work with plain redundancy profiles, where each time we submit a bio to a device, the bdev is set correctly. In many places the bio bdev is set to latest_bdev that seems to serve as a stub pointer "just to put something to bio". But we don't have to do that. Where do we know which bdev to set: * for regular IO: submit_stripe_bio that's called by btrfs_map_bio * repair IO: repair_io_failure, read or write from specific device * super block write (using buffer_heads but uses raw bdev) and barriers * scrub: this does not use all regular IO paths as it needs to reach all copies, verify and fixup eventually, and for that all bdev management is independent * raid56: rbio_add_io_page, for the RMW write * integrity-checker: does it's own low-level block tracking Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 23:39:30 +01:00
David Sterba	429aebc0a9	btrfs: get bdev directly from fs_devices in submit_extent_page This is preparatory patch to remove @bdev parameter from submit_extent_page. It can't be removed completely, because the cgroups need it for wbc when initializing the bio wbc_init_bio bio_associate_blkg_from_css dereference bdev->bi_disk->queue The bdev pointer is the same as latest_bdev, thus no functional change. We can retrieve it from fs_devices that's reachable through several dereferences. The local variable shadows the parameter, but that's only temporary. Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 23:38:46 +01:00
Josef Bacik	3e1740993e	btrfs: record all roots for rename exchange on a subvol Testing with the new fsstress support for subvolumes uncovered a pretty bad problem with rename exchange on subvolumes. We're modifying two different subvolumes, but we only start the transaction on one of them, so the other one is not added to the dirty root list. This is caught by btrfs_cow_block() with a warning because the root has not been updated, however if we do not modify this root again we'll end up pointing at an invalid root because the root item is never updated. Fix this by making sure we add the destination root to the trans list, the same as we do with normal renames. This fixes the corruption. Fixes: `cdd1fedf82` ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT") CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 20:08:31 +01:00
Filipe Manana	042528f8d8	Btrfs: fix block group remaining RO forever after error during device replace When doing a device replace, while at scrub.c:scrub_enumerate_chunks(), we set the block group to RO mode and then wait for any ongoing writes into extents of the block group to complete. While doing that wait we overwrite the value of the variable 'ret' and can break out of the loop if an error happens without turning the block group back into RW mode. So what happens is the following: 1) btrfs_inc_block_group_ro() returns 0, meaning it set the block group to RO mode (its ->ro field set to 1 or incremented to some value > 1); 2) Then btrfs_wait_ordered_roots() returns a value > 0; 3) Then if either joining or committing the transaction fails, we break out of the loop wihtout calling btrfs_dec_block_group_ro(), leaving the block group in RO mode forever. To fix this, just remove the code that waits for ongoing writes to extents of the block group, since it's not needed because in the initial setup phase of a device replace operation, before starting to find all chunks and their extents, we set the target device for replace while holding fs_info->dev_replace->rwsem, which ensures that after releasing that semaphore, any writes into the source device are made to the target device as well (__btrfs_map_block() guarantees that). So while at scrub_enumerate_chunks() we only need to worry about finding and copying extents (from the source device to the target device) that were written before we started the device replace operation. Fixes: `f0e9b7d640` ("Btrfs: fix race setting block group readonly during device replace") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 18:07:55 +01:00
Qu Wenruo	b12de52896	btrfs: scrub: Don't check free space before marking a block group RO [BUG] When running btrfs/072 with only one online CPU, it has a pretty high chance to fail: btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg) - output mismatch (see xfstests-dev/results//btrfs/072.out.bad) --- tests/btrfs/072.out 2019-10-22 15:18:14.008965340 +0800 +++ /xfstests-dev/results//btrfs/072.out.bad 2019-11-14 15:56:45.877152240 +0800 @@ -1,2 +1,3 @@ QA output created by 072 Silence is golden +Scrub find errors in "-m dup -d single" test ... And with the following call trace: BTRFS info (device dm-5): scrub: started on devid 1 ------------[ cut here ]------------ BTRFS: Transaction aborted (error -27) WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs] CPU: 0 PID: 55087 Comm: btrfs Tainted: G W O 5.4.0-rc1-custom+ #13 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs] Call Trace: __btrfs_end_transaction+0xdb/0x310 [btrfs] btrfs_end_transaction+0x10/0x20 [btrfs] btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs] scrub_enumerate_chunks+0x264/0x940 [btrfs] btrfs_scrub_dev+0x45c/0x8f0 [btrfs] btrfs_ioctl+0x31a1/0x3fb0 [btrfs] do_vfs_ioctl+0x636/0xaa0 ksys_ioctl+0x67/0x90 __x64_sys_ioctl+0x43/0x50 do_syscall_64+0x79/0xe0 entry_SYSCALL_64_after_hwframe+0x49/0xbe ---[ end trace 166c865cec7688e7 ]--- [CAUSE] The error number -27 is -EFBIG, returned from the following call chain: btrfs_end_transaction() \|- __btrfs_end_transaction() \|- btrfs_create_pending_block_groups() \|- btrfs_finish_chunk_alloc() \|- btrfs_add_system_chunk() This happens because we have used up all space of btrfs_super_block::sys_chunk_array. The root cause is, we have the following bad loop of creating tons of system chunks: 1. The only SYSTEM chunk is being scrubbed It's very common to have only one SYSTEM chunk. 2. New SYSTEM bg will be allocated As btrfs_inc_block_group_ro() will check if we have enough space after marking current bg RO. If not, then allocate a new chunk. 3. New SYSTEM bg is still empty, will be reclaimed During the reclaim, we will mark it RO again. 4. That newly allocated empty SYSTEM bg get scrubbed We go back to step 2, as the bg is already mark RO but still not cleaned up yet. If the cleaner kthread doesn't get executed fast enough (e.g. only one CPU), then we will get more and more empty SYSTEM chunks, using up all the space of btrfs_super_block::sys_chunk_array. [FIX] Since scrub/dev-replace doesn't always need to allocate new extent, especially chunk tree extent, so we don't really need to do chunk pre-allocation. To break above spiral, here we introduce a new parameter to btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we need extra chunk pre-allocation. For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass @do_chunk_alloc=false. This should keep unnecessary empty chunks from popping up for scrub. Also, since there are two parameters for btrfs_inc_block_group_ro(), add more comment for it. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 18:07:55 +01:00
Johannes Thumshirn	7f0432d0d8	btrfs: change btrfs_fs_devices::rotating to bool struct btrfs_fs_devices::rotating currently is declared as an integer variable but only used as a boolean. Change the variable definition to bool and update to code touching it to set 'true' and 'false'. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:51 +01:00
Johannes Thumshirn	0395d84f8e	btrfs: change btrfs_fs_devices::seeding to bool struct btrfs_fs_devices::seeding currently is declared as an integer variable but only used as a boolean. Change the variable definition to bool and update to code touching it to set 'true' and 'false'. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:51 +01:00
David Sterba	32da5386d9	btrfs: rename btrfs_block_group_cache The type name is misleading, a single entry is named 'cache' while this normally means a collection of objects. Rename that everywhere. Also the identifier was quite long, making function prototypes harder to format. Suggested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:51 +01:00
Qu Wenruo	d49a2ddb15	btrfs: block-group: Reuse the item key from caller of read_one_block_group() For read_one_block_group(), its only caller has already got the item key to search next block group item. So we can use that key directly without doing our own convertion on stack. Also, since that key used in btrfs_read_block_groups() is vital for block group item search, add 'const' keyword for that parameter to prevent read_one_block_group() to modify it. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2019-11-18 17:51:50 +01:00

1 2 3 4 5 ...

61072 Коммитов