WSL2-Linux-Kernel

Граф коммитов

Автор	SHA1	Сообщение	Дата
Mike Kravetz	b6305049f3	ipc/shm: call underlying open/close vm_ops Shared memory segments can be created that are backed by hugetlb pages. When this happens, the vmas associated with any mappings (shmat) are marked VM_HUGETLB, yet the vm_ops for such mappings are provided by ipc/shm (shm_vm_ops). There is a mechanism to call the underlying hugetlb vm_ops, and this is done for most operations. However, it is not done for open and close. This was not an issue until the introduction of the hugetlb vma_lock. This lock structure is pointed to by vm_private_data and the open/close vm_ops help maintain this structure. The special hugetlb routine called at fork took care of structure updates at fork time. However, vma_splitting is not properly handled for ipc shared memory mappings backed by hugetlb pages. This can result in a "kernel NULL pointer dereference" BUG or use after free as two vmas point to the same lock structure. Update the shm open and close routines to always call the underlying open and close routines. Link: https://lkml.kernel.org/r/20221114210018.49346-1-mike.kravetz@oracle.com Fixes: `8d9bfb2608` ("hugetlb: add vma based lock for pmd sharing") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Doug Nelson <doug.nelson@intel.com> Reported-by: <syzbot+83b4134621b7c326d950@syzkaller.appspotmail.com> Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-11-22 18:50:42 -08:00
Andrew Morton	64b4c411a6	ipc/msg.c: fix percpu_counter use after free These percpu counters are referenced in free_ipcs->freeque, so destroy them later. Fixes: `72d1e61108` ("ipc/msg: mitigate the lock contention with percpu counter") Reported-by: syzbot+96e659d35b9d6b541152@syzkaller.appspotmail.com Tested-by: Mark Rutland <mark.rutland@arm.com> Cc: Jiebin Sun <jiebin.sun@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-10-28 13:37:22 -07:00
Linus Torvalds	676cb49573	- hfs and hfsplus kmap API modernization from Fabio Francesco - Valentin Schneider makes crash-kexec work properly when invoked from an NMI-time panic. - ntfs bugfixes from Hawkins Jiawei - Jiebin Sun improves IPC msg scalability by replacing atomic_t's with percpu counters. - nilfs2 cleanups from Minghao Chi - lots of other single patches all over the tree! -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0Yf0gAKCRDdBJ7gKXxA joapAQDT1d1zu7T8yf9cQXkYnZVuBKCjxKE/IsYvqaq1a42MjQD/SeWZg0wV05B8 DhJPj9nkEp6R3Rj3Mssip+3vNuceAQM= =lUQY -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2022-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - hfs and hfsplus kmap API modernization (Fabio Francesco) - make crash-kexec work properly when invoked from an NMI-time panic (Valentin Schneider) - ntfs bugfixes (Hawkins Jiawei) - improve IPC msg scalability by replacing atomic_t's with percpu counters (Jiebin Sun) - nilfs2 cleanups (Minghao Chi) - lots of other single patches all over the tree! * tag 'mm-nonmm-stable-2022-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (71 commits) include/linux/entry-common.h: remove has_signal comment of arch_do_signal_or_restart() prototype proc: test how it holds up with mapping'less process mailmap: update Frank Rowand email address ia64: mca: use strscpy() is more robust and safer init/Kconfig: fix unmet direct dependencies ia64: update config files nilfs2: replace WARN_ONs by nilfs_error for checkpoint acquisition failure fork: remove duplicate included header files init/main.c: remove unnecessary (void*) conversions proc: mark more files as permanent nilfs2: remove the unneeded result variable nilfs2: delete unnecessary checks before brelse() checkpatch: warn for non-standard fixes tag style usr/gen_init_cpio.c: remove unnecessary -1 values from int file ipc/msg: mitigate the lock contention with percpu counter percpu: add percpu_counter_add_local and percpu_counter_sub_local fs/ocfs2: fix repeated words in comments relay: use kvcalloc to alloc page array in relay_alloc_page_array proc: make config PROC_CHILDREN depend on PROC_FS fs: uninline inode_maybe_inc_iversion() ...	2022-10-12 11:00:22 -07:00
Linus Torvalds	27bc50fc90	- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam R. Howlett. An overlapping range-based tree for vmas. It it apparently slight more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat (https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com). This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0HaPgAKCRDdBJ7gKXxA joPjAQDZ5LlRCMWZ1oxLP2NOTp6nm63q9PWcGnmY50FjD/dNlwEAnx7OejCLWGWf bbTuk6U2+TKgJa4X7+pbbejeoqnt5QU= =xfWx -----END PGP SIGNATURE----- Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam Howlett. An overlapping range-based tree for vmas. It it apparently slightly more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat at [1]. This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1] * tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits) hugetlb: allocate vma lock for all sharable vmas hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer hugetlb: fix vma lock handling during split vma and range unmapping mglru: mm/vmscan.c: fix imprecise comments mm/mglru: don't sync disk for each aging cycle mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol mm: memcontrol: use do_memsw_account() in a few more places mm: memcontrol: deprecate swapaccounting=0 mode mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled mm/secretmem: remove reduntant return value mm/hugetlb: add available_huge_pages() func mm: remove unused inline functions from include/linux/mm_inline.h selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd selftests/vm: add thp collapse shmem testing selftests/vm: add thp collapse file and tmpfs testing selftests/vm: modularize thp collapse memory operations selftests/vm: dedup THP helpers mm/khugepaged: add tracepoint to hpage_collapse_scan_file() mm/madvise: add file and shmem support to MADV_COLLAPSE ...	2022-10-10 17:53:04 -07:00
Linus Torvalds	86fb9c53d8	ipc: mqueue: fix possible memory leak in init_mqueue_fs() A fix for a unlikely but possible memory leak. Hangyu Hua (1): ipc: mqueue: fix possible memory leak in init_mqueue_fs() ipc/mqueue.c \| 1 + 1 file changed, 1 insertion(+) -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEgjlraLDcwBA2B+6cC/v6Eiajj0AFAmM7U48ACgkQC/v6Eiaj j0DMEw//R1u/xUW8ytS3/G3AHXsJ0df4GgcBr3MsYsVz/HXnVTbVFyz8BodX6mdD q6cPFlzBf1wYwh7DtDqAVOEvEZg8yiAyBX4RKkVvfD35J+8GCROQ2mnkeo3AHdNp jIGw8IimhFjiQr/gbc5WQxox+sH/l1O/6ksDl11JYEnWs5lJgsh5Wyg6cp5d7Wd3 p9B9AKzhqby+6JMLuucsaLqS/g+dZ218Y1ghhLIMrK/UEZ9+DAX9GWK3ebr+1Xxw mQnaGZtPWXkMwVTr69fED56/Q0HC/KhIMLhuGygX/zfTmU3iZjeChUCOgM/18kFZ vUvy7G0R9uMgm1lgjCvmQW9gO9jnIDKHeB1qPVgmUNpvpcFCjzJ2kRdnLfGkGp1j Ot6ZQMJOSMvLPUT8M4LZrMfFJQZYy8Jy2Oh97Ryf80ijGdFigS1oQ4BD9q7hMCdM DDBP4I4a2eB6Z2tIyMEQ9fx74lhOLI9pO22gG5JX4Fxhd9i1tBfJrjzKXPOQyO+y A7bSL1NyVjvxA6IkeSHyBsL8jl8ntcg6kwoxIfjfcmmMRTrK2rnfbXhSrmTflfP7 WUv5p+sd6NT4DAZ2rqvNB8BYzSelLGbDpXg77nqDJk2UjQDcA0MOl9c91vJsGaIn paoiCm28yIRxFkCHY0RIZSlLp2t32ZwIzhvcC5YOQW5W3VwyjXY= =tjwH -----END PGP SIGNATURE----- Merge tag 'retire_mq_sysctls-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull mqueue fix from Eric Biederman: "A fix for an unlikely but possible memory leak" * tag 'retire_mq_sysctls-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ipc: mqueue: fix possible memory leak in init_mqueue_fs()	2022-10-09 16:10:22 -07:00
Jiebin Sun	72d1e61108	ipc/msg: mitigate the lock contention with percpu counter The msg_bytes and msg_hdrs atomic counters are frequently updated when IPC msg queue is in heavy use, causing heavy cache bounce and overhead. Change them to percpu_counter greatly improve the performance. Since there is one percpu struct per namespace, additional memory cost is minimal. Reading of the count done in msgctl call, which is infrequent. So the need to sum up the counts in each CPU is infrequent. Apply the patch and test the pts/stress-ng-1.4.0 -- system v message passing (160 threads). Score gain: 3.99x CPU: ICX 8380 x 2 sockets Core number: 40 x 2 physical cores Benchmark: pts/stress-ng-1.4.0 -- system v message passing (160 threads) [akpm@linux-foundation.org: coding-style cleanups] [jiebin.sun@intel.com: avoid negative value by overflow in msginfo] Link: https://lkml.kernel.org/r/20220920150809.4014944-1-jiebin.sun@intel.com [akpm@linux-foundation.org: fix min() warnings] Link: https://lkml.kernel.org/r/20220913192538.3023708-3-jiebin.sun@intel.com Signed-off-by: Jiebin Sun <jiebin.sun@intel.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com> Cc: Alexey Gladkov <legion@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Dennis Zhou <dennis@kernel.org> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vasily Averin <vasily.averin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-10-03 14:21:44 -07:00
Jingyu Wang	5758478a3d	ipc: mqueue: remove unnecessary conditionals iput() already handles null and non-null parameters, so there is no need to use if(). Link: https://lkml.kernel.org/r/20220908185452.76590-1-jingyuwang_vip@163.com Signed-off-by: Jingyu Wang <jingyuwang_vip@163.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-10-03 14:21:42 -07:00
Liam R. Howlett	01293a62ba	ipc/shm: use VMA iterator instead of linked list The VMA iterator is faster than the linked llist, and it can be walked even when VMAs are being removed from the address space, so there's no need to keep track of 'next'. Link: https://lkml.kernel.org/r/20220906194824.2110408-46-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-26 19:46:21 -07:00
Manfred Spraul	58b5c20336	ipc/util.c: cleanup and improve sysvipc_find_ipc() sysvipc_find_ipc() can be simplified further: - It uses a for() loop to locate the next entry in the idr. This can be replaced with idr_get_next(). - It receives two parameters (pos - which is actually an idr index and not a position, and new_pos, which is really a position). One parameter is sufficient. Link: https://lore.kernel.org/all/20210903052020.3265-3-manfred@colorfullife.com/ Link: https://lkml.kernel.org/r/20220805115733.104763-1-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Waiman Long <longman@redhat.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: <1vier1@web.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-09-11 21:55:05 -07:00
Linus Torvalds	eb5699ba31	Updates to various subsystems which I help look after. lib, ocfs2, fatfs, autofs, squashfs, procfs, etc. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYu9BeQAKCRDdBJ7gKXxA jp1DAP4mjCSvAwYzXklrIt+Knv3CEY5oVVdS+pWOAOGiJpldTAD9E5/0NV+VmlD9 kwS/13j38guulSlXRzDLmitbg81zAAI= =Zfum -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2022-08-06-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc updates from Andrew Morton: "Updates to various subsystems which I help look after. lib, ocfs2, fatfs, autofs, squashfs, procfs, etc. A relatively small amount of material this time" * tag 'mm-nonmm-stable-2022-08-06-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (72 commits) scripts/gdb: ensure the absolute path is generated on initial source MAINTAINERS: kunit: add David Gow as a maintainer of KUnit mailmap: add linux.dev alias for Brendan Higgins mailmap: update Kirill's email profile: setup_profiling_timer() is moslty not implemented ocfs2: fix a typo in a comment ocfs2: use the bitmap API to simplify code ocfs2: remove some useless functions lib/mpi: fix typo 'the the' in comment proc: add some (hopefully) insightful comments bdi: remove enum wb_congested_state kernel/hung_task: fix address space of proc_dohung_task_timeout_secs lib/lzo/lzo1x_compress.c: replace ternary operator with min() and min_t() squashfs: support reading fragments in readahead call squashfs: implement readahead squashfs: always build "file direct" version of page actor Revert "squashfs: provide backing_dev_info in order to disable read-ahead" fs/ocfs2: Fix spelling typo in comment ia64: old_rr4 added under CONFIG_HUGETLB_PAGE proc: fix test for "vsyscall=xonly" boot option ...	2022-08-07 10:03:24 -07:00
Hangyu Hua	c579d60f0d	ipc: mqueue: fix possible memory leak in init_mqueue_fs() commit `db7cfc3809` ("ipc: Free mq_sysctls if ipc namespace creation failed") Here's a similar memory leak to the one fixed by the patch above. retire_mq_sysctls need to be called when init_mqueue_fs fails after setup_mq_sysctls. Fixes: `dc55e35f9e` ("ipc: Store mqueue sysctls in the ipc namespace") Signed-off-by: Hangyu Hua <hbh25y@gmail.com> Link: https://lkml.kernel.org/r/20220715062301.19311-1-hbh25y@gmail.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2022-07-19 17:17:35 -05:00
Yu Zhe	2c795fb03f	ipc/mqueue: remove unnecessary (void) conversion Remove unnecessary void type casting. Link: https://lkml.kernel.org/r/20220628021251.17197-1-yuzhe@nfschina.com Signed-off-by: Yu Zhe <yuzhe@nfschina.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-07-17 17:31:40 -07:00
Alexey Gladkov	db7cfc3809	ipc: Free mq_sysctls if ipc namespace creation failed The problem that Dmitry Vyukov pointed out is that if setup_ipc_sysctls fails, mq_sysctls must be freed before return. executing program BUG: memory leak unreferenced object 0xffff888112fc9200 (size 512): comm "syz-executor237", pid 3648, jiffies 4294970469 (age 12.270s) hex dump (first 32 bytes): ef d3 60 85 ff ff ff ff 0c 9b d2 12 81 88 ff ff ..`............. 04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff814b6eb3>] kmemdup+0x23/0x50 mm/util.c:129 [<ffffffff82219a9b>] kmemdup include/linux/fortify-string.h:456 [inline] [<ffffffff82219a9b>] setup_mq_sysctls+0x4b/0x1c0 ipc/mq_sysctl.c:89 [<ffffffff822197f2>] create_ipc_ns ipc/namespace.c:63 [inline] [<ffffffff822197f2>] copy_ipcs+0x292/0x390 ipc/namespace.c:91 [<ffffffff8127de7c>] create_new_namespaces+0xdc/0x4f0 kernel/nsproxy.c:90 [<ffffffff8127e89b>] unshare_nsproxy_namespaces+0x9b/0x120 kernel/nsproxy.c:226 [<ffffffff8123f92e>] ksys_unshare+0x2fe/0x600 kernel/fork.c:3165 [<ffffffff8123fc42>] __do_sys_unshare kernel/fork.c:3236 [inline] [<ffffffff8123fc42>] __se_sys_unshare kernel/fork.c:3234 [inline] [<ffffffff8123fc42>] __x64_sys_unshare+0x12/0x20 kernel/fork.c:3234 [<ffffffff845aab45>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<ffffffff845aab45>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 [<ffffffff8460006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0 BUG: memory leak unreferenced object 0xffff888112fd5f00 (size 256): comm "syz-executor237", pid 3648, jiffies 4294970469 (age 12.270s) hex dump (first 32 bytes): 00 92 fc 12 81 88 ff ff 00 00 00 00 01 00 00 00 ................ 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff816fea1b>] kmalloc include/linux/slab.h:605 [inline] [<ffffffff816fea1b>] kzalloc include/linux/slab.h:733 [inline] [<ffffffff816fea1b>] __register_sysctl_table+0x7b/0x7f0 fs/proc/proc_sysctl.c:1344 [<ffffffff82219b7a>] setup_mq_sysctls+0x12a/0x1c0 ipc/mq_sysctl.c:112 [<ffffffff822197f2>] create_ipc_ns ipc/namespace.c:63 [inline] [<ffffffff822197f2>] copy_ipcs+0x292/0x390 ipc/namespace.c:91 [<ffffffff8127de7c>] create_new_namespaces+0xdc/0x4f0 kernel/nsproxy.c:90 [<ffffffff8127e89b>] unshare_nsproxy_namespaces+0x9b/0x120 kernel/nsproxy.c:226 [<ffffffff8123f92e>] ksys_unshare+0x2fe/0x600 kernel/fork.c:3165 [<ffffffff8123fc42>] __do_sys_unshare kernel/fork.c:3236 [inline] [<ffffffff8123fc42>] __se_sys_unshare kernel/fork.c:3234 [inline] [<ffffffff8123fc42>] __x64_sys_unshare+0x12/0x20 kernel/fork.c:3234 [<ffffffff845aab45>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<ffffffff845aab45>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 [<ffffffff8460006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0 BUG: memory leak unreferenced object 0xffff888112fbba00 (size 256): comm "syz-executor237", pid 3648, jiffies 4294970469 (age 12.270s) hex dump (first 32 bytes): 78 ba fb 12 81 88 ff ff 00 00 00 00 01 00 00 00 x............... 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff816fef49>] kmalloc include/linux/slab.h:605 [inline] [<ffffffff816fef49>] kzalloc include/linux/slab.h:733 [inline] [<ffffffff816fef49>] new_dir fs/proc/proc_sysctl.c:978 [inline] [<ffffffff816fef49>] get_subdir fs/proc/proc_sysctl.c:1022 [inline] [<ffffffff816fef49>] __register_sysctl_table+0x5a9/0x7f0 fs/proc/proc_sysctl.c:1373 [<ffffffff82219b7a>] setup_mq_sysctls+0x12a/0x1c0 ipc/mq_sysctl.c:112 [<ffffffff822197f2>] create_ipc_ns ipc/namespace.c:63 [inline] [<ffffffff822197f2>] copy_ipcs+0x292/0x390 ipc/namespace.c:91 [<ffffffff8127de7c>] create_new_namespaces+0xdc/0x4f0 kernel/nsproxy.c:90 [<ffffffff8127e89b>] unshare_nsproxy_namespaces+0x9b/0x120 kernel/nsproxy.c:226 [<ffffffff8123f92e>] ksys_unshare+0x2fe/0x600 kernel/fork.c:3165 [<ffffffff8123fc42>] __do_sys_unshare kernel/fork.c:3236 [inline] [<ffffffff8123fc42>] __se_sys_unshare kernel/fork.c:3234 [inline] [<ffffffff8123fc42>] __x64_sys_unshare+0x12/0x20 kernel/fork.c:3234 [<ffffffff845aab45>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<ffffffff845aab45>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 [<ffffffff8460006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0 BUG: memory leak unreferenced object 0xffff888112fbb900 (size 256): comm "syz-executor237", pid 3648, jiffies 4294970469 (age 12.270s) hex dump (first 32 bytes): 78 b9 fb 12 81 88 ff ff 00 00 00 00 01 00 00 00 x............... 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff816fef49>] kmalloc include/linux/slab.h:605 [inline] [<ffffffff816fef49>] kzalloc include/linux/slab.h:733 [inline] [<ffffffff816fef49>] new_dir fs/proc/proc_sysctl.c:978 [inline] [<ffffffff816fef49>] get_subdir fs/proc/proc_sysctl.c:1022 [inline] [<ffffffff816fef49>] __register_sysctl_table+0x5a9/0x7f0 fs/proc/proc_sysctl.c:1373 [<ffffffff82219b7a>] setup_mq_sysctls+0x12a/0x1c0 ipc/mq_sysctl.c:112 [<ffffffff822197f2>] create_ipc_ns ipc/namespace.c:63 [inline] [<ffffffff822197f2>] copy_ipcs+0x292/0x390 ipc/namespace.c:91 [<ffffffff8127de7c>] create_new_namespaces+0xdc/0x4f0 kernel/nsproxy.c:90 [<ffffffff8127e89b>] unshare_nsproxy_namespaces+0x9b/0x120 kernel/nsproxy.c:226 [<ffffffff8123f92e>] ksys_unshare+0x2fe/0x600 kernel/fork.c:3165 [<ffffffff8123fc42>] __do_sys_unshare kernel/fork.c:3236 [inline] [<ffffffff8123fc42>] __se_sys_unshare kernel/fork.c:3234 [inline] [<ffffffff8123fc42>] __x64_sys_unshare+0x12/0x20 kernel/fork.c:3234 [<ffffffff845aab45>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<ffffffff845aab45>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 [<ffffffff8460006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0 Reported-by: syzbot+b4b0d1b35442afbf6fd2@syzkaller.appspotmail.com Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/000000000000f5004705e1db8bad@google.com Link: https://lkml.kernel.org/r/20220622200729.2639663-1-legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2022-06-22 17:47:41 -05:00
Linus Torvalds	1888e9b4bb	These changes update the ipc sysctls so that they are fundamentally per ipc namespace. Previously these sysctls depended upon a hack to simulate being per ipc namespace by looking up the ipc namespace in read or write. With this set of changes the ipc sysctls are registered per ipc namespace and open looks up the ipc namespace. Not only does this series of changes ensure the traditional binding at open time happens, but it sets a foundation for being able to relax the permission checks to allow a user namspace root to change the ipc sysctls for an ipc namespace that the user namespace root requires. To do this requires the ipc namespace to be known at open time. These changes were sent for v5.18[1] but were dropped because some additional cleanups were requested. Linus has given his nod[2] to the cleanups so I hope enough cleanups are present this time. [1] https://lkml.kernel.org/r/877d8kfmdp.fsf@email.froward.int.ebiederm.org [2] https://lkml.kernel.org/r/CAHk-=whi2SzU4XT_FsdTCAuK2qtYmH+-hwi1cbSdG8zu0KXL=g@mail.gmail.com Alexey Gladkov (6): ipc: Store mqueue sysctls in the ipc namespace ipc: Store ipc sysctls in the ipc namespace ipc: Use the same namespace to modify and validate ipc: Remove extra1 field abuse to pass ipc namespace ipc: Check permissions for checkpoint_restart sysctls at open time ipc: Remove extra braces include/linux/ipc_namespace.h \| 37 +++++++- ipc/ipc_sysctl.c \| 205 +++++++++++++++++++++++++----------------- ipc/mq_sysctl.c \| 121 +++++++++++++------------ ipc/mqueue.c \| 10 +-- ipc/namespace.c \| 10 +++ 5 files changed, 238 insertions(+), 145 deletions(-) Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEgjlraLDcwBA2B+6cC/v6Eiajj0AFAmKaP5QACgkQC/v6Eiaj j0Cd6A//fBb7GWeZpEsQXN0LLJZwfQqd5HYKBZ1yB0bclj4K4rg//goMqwvAb8YC x4h8Mny9yt3SHYWHqFMQvXGi5oMOodlZ3dxz5RAUoGG7c2oqF4mUhD5ugUJ07ElT z2DImq+oZ6NZcsVcW8n9WmaLiGFdZ6N1Ftr4w+lfH4bioON/jsBKa/v9ftXCgzyJ cqZ7Q7JCpD4qKDw7q6zEx5Y2ZqCciMWdmJOZ/X77D1vyNia1EJmsi26NgsH0uLTV mYz/L2BgHUiCmvPbdtD2hKs3OlkX38zkvVyyLxHVAIcCKIWE4O8vA6xsz+I+5kMB V3anYjf+PNeI9ASXGTJ56QlTj9I0Z7Dti8Sq6fCUa99rJtG4tcwgRHOZyL/Z3l48 8Dx//op/OTf5C3PLPhYqngpnMaXOQo++XEHqCN5c0j8UyaFLDbfs7H+JDuKZDp3d HQBdqaeyxxGaO87JqKt+K4wHkr+B0genTRfW3zliGVBmZC9KLXHoJ53ENRo1RyMs DcTZXzPdYx+yFJaYk5GAiP/S81eTjbznsQ0ATTEDGZPQcX+LeiFaeZ9aqObIx3UL krX9bohwWzL7bI9hwSp0waoLZGx5TEd9UXClMlct8GXBZUBpYfeaA7BQ8CJoH6+z IsQH/z5eOvfux5LsPUVI3PSd/IJWdh+uz0vDpNWkoKMabiZT1fg= =cSFI -----END PGP SIGNATURE----- Merge tag 'per-namespace-ipc-sysctls-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull ipc sysctl namespace updates from Eric Biederman: "This updates the ipc sysctls so that they are fundamentally per ipc namespace. Previously these sysctls depended upon a hack to simulate being per ipc namespace by looking up the ipc namespace in read or write. With this set of changes the ipc sysctls are registered per ipc namespace and open looks up the ipc namespace. Not only does this series of changes ensure the traditional binding at open time happens, but it sets a foundation for being able to relax the permission checks to allow a user namspace root to change the ipc sysctls for an ipc namespace that the user namespace root requires. To do this requires the ipc namespace to be known at open time" * tag 'per-namespace-ipc-sysctls-for-v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ipc: Remove extra braces ipc: Check permissions for checkpoint_restart sysctls at open time ipc: Remove extra1 field abuse to pass ipc namespace ipc: Use the same namespace to modify and validate ipc: Store ipc sysctls in the ipc namespace ipc: Store mqueue sysctls in the ipc namespace	2022-06-03 15:54:57 -07:00
Waiman Long	d60c4d01a9	ipc/mqueue: use get_tree_nodev() in mqueue_get_tree() When running the stress-ng clone benchmark with multiple testing threads, it was found that there were significant spinlock contention in sget_fc(). The contended spinlock was the sb_lock. It is under heavy contention because the following code in the critcal section of sget_fc(): hlist_for_each_entry(old, &fc->fs_type->fs_supers, s_instances) { if (test(old, fc)) goto share_extant_sb; } After testing with added instrumentation code, it was found that the benchmark could generate thousands of ipc namespaces with the corresponding number of entries in the mqueue's fs_supers list where the namespaces are the key for the search. This leads to excessive time in scanning the list for a match. Looking back at the mqueue calling sequence leading to sget_fc(): mq_init_ns() => mq_create_mount() => fc_mount() => vfs_get_tree() => mqueue_get_tree() => get_tree_keyed() => vfs_get_super() => sget_fc() Currently, mq_init_ns() is the only mqueue function that will indirectly call mqueue_get_tree() with a newly allocated ipc namespace as the key for searching. As a result, there will never be a match with the exising ipc namespaces stored in the mqueue's fs_supers list. So using get_tree_keyed() to do an existing ipc namespace search is just a waste of time. Instead, we could use get_tree_nodev() to eliminate the useless search. By doing so, we can greatly reduce the sb_lock hold time and avoid the spinlock contention problem in case a large number of ipc namespaces are present. Of course, if the code is modified in the future to allow mqueue_get_tree() to be called with an existing ipc namespace instead of a new one, we will have to use get_tree_keyed() in this case. The following stress-ng clone benchmark command was run on a 2-socket 48-core Intel system: ./stress-ng --clone 32 --verbose --oomable --metrics-brief -t 20 The "bogo ops/s" increased from 5948.45 before patch to 9137.06 after patch. This is an increase of 54% in performance. Link: https://lkml.kernel.org/r/20220121172315.19652-1-longman@redhat.com Fixes: `935c6912b1` ("ipc: Convert mqueue fs to fs_context") Signed-off-by: Waiman Long <longman@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-05-09 18:29:21 -07:00
Prakash Sangappa	49c9dd0df6	ipc: update semtimedop() to use hrtimer semtimedop() should be converted to use hrtimer like it has been done for most of the system calls with timeouts. This system call already takes a struct timespec as an argument and can therefore provide finer granularity timed wait. Link: https://lkml.kernel.org/r/1651187881-2858-1-git-send-email-prakash.sangappa@oracle.com Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-05-09 18:29:20 -07:00
Michal Orzel	0e90002965	ipc/sem: remove redundant assignments Get rid of redundant assignments which end up in values not being read either because they are overwritten or the function ends. Reported by clang-tidy [deadcode.DeadStores] Link: https://lkml.kernel.org/r/20220409101933.207157-1-michalorzel.eng@gmail.com Signed-off-by: Michal Orzel <michalorzel.eng@gmail.com> Reviewed-by: Tom Rix <trix@redhat.com> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2022-05-09 18:29:20 -07:00
Alexey Gladkov	38cd5b12b7	ipc: Remove extra braces Fix coding style. In the previous commit, I added braces because, in addition to changing .data, .extra1 also changed. Now this is not needed. Fixes: `1f5c135ee5` ("ipc: Store ipc sysctls in the ipc namespace") Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/37687827f630bc150210f5b8abeeb00f1336814e.1651584847.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2022-05-03 17:25:58 -05:00
Alexey Gladkov	0889f44e28	ipc: Check permissions for checkpoint_restart sysctls at open time As Eric Biederman pointed out, it is possible not to use a custom proc_handler and check permissions for every write, but to use a .permission handler. That will allow the checkpoint_restart sysctls to perform all of their permission checks at open time, and not need any other special code. Link: https://lore.kernel.org/lkml/87czib9g38.fsf@email.froward.int.ebiederm.org/ Fixes: `1f5c135ee5` ("ipc: Store ipc sysctls in the ipc namespace") Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/65fa8459803830608da4610a39f33c76aa933eb9.1651584847.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2022-05-03 17:25:58 -05:00
Alexey Gladkov	dd141a4955	ipc: Remove extra1 field abuse to pass ipc namespace Eric Biederman pointed out that using .extra1 to pass ipc namespace looks like an ugly hack and there is a better solution. We can get the ipc_namespace using the .data field. Link: https://lore.kernel.org/lkml/87czib9g38.fsf@email.froward.int.ebiederm.org/ Fixes: `1f5c135ee5` ("ipc: Store ipc sysctls in the ipc namespace") Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/93df64a8fe93ba20ebbe1d9f8eda484b2f325426.1651584847.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2022-05-03 17:25:58 -05:00
Alexey Gladkov	def7343ff0	ipc: Use the same namespace to modify and validate In the `1f5c135ee5` ("ipc: Store ipc sysctls in the ipc namespace") I missed that in addition to the modification of sem_ctls[3], the change is validated. This validation must occur in the same namespace. Link: https://lore.kernel.org/lkml/875ymnvryb.fsf@email.froward.int.ebiederm.org/ Fixes: `1f5c135ee5` ("ipc: Store ipc sysctls in the ipc namespace") Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/b3cb9a25cce6becbef77186bc1216071a08a969b.1651584847.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2022-05-03 17:25:58 -05:00
Muchun Song	fd60b28842	fs: allocate inode by using alloc_inode_sb() The inode allocation is supposed to use alloc_inode_sb(), so convert kmem_cache_alloc() of all filesystems to alloc_inode_sb(). Link: https://lkml.kernel.org/r/20220228122126.37293-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Theodore Ts'o <tytso@mit.edu> [ext4] Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Alex Shi <alexs@kernel.org> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kari Argillander <kari.argillander@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-03-22 15:57:03 -07:00
Alexey Gladkov	1f5c135ee5	ipc: Store ipc sysctls in the ipc namespace The ipc sysctls are not available for modification inside the user namespace. Following the mqueue sysctls, we changed the implementation to be more userns friendly. So far, the changes do not provide additional access to files. This will be done in a future patch. Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/be6f9d014276f4dddd0c3aa05a86052856c1c555.1644862280.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2022-03-08 13:39:40 -06:00
Alexey Gladkov	dc55e35f9e	ipc: Store mqueue sysctls in the ipc namespace Right now, the mqueue sysctls take ipc namespaces into account in a rather hacky way. This works in most cases, but does not respect the user namespace. Within the user namespace, the user cannot change the /proc/sys/fs/mqueue/* parametres. This poses a problem in the rootless containers. To solve this I changed the implementation of the mqueue sysctls just like some other sysctls. So far, the changes do not provide additional access to files. This will be done in a future patch. v3: * Don't implemenet set_permissions to keep the current behavior. v2: * Fixed compilation problem if CONFIG_POSIX_MQUEUE_SYSCTL is not specified. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/b0ccbb2489119f1f20c737cf1930c3a9c4e4243a.1644862280.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2022-03-08 13:39:40 -06:00
Minghao Chi	520ba72406	ipc/sem: do not sleep with a spin lock held We can't call kvfree() with a spin lock held, so defer it. Link: https://lkml.kernel.org/r/20211223031207.556189-1-chi.minghao@zte.com.cn Fixes: `fc37a3b8b4` ("[PATCH] ipc sem: use kvmalloc for sem_undo allocation") Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Manfred Spraul <manfred@colorfullife.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Yang Guang <cgel.zte@gmail.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com> Cc: Vasily Averin <vvs@virtuozzo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-02-04 09:25:05 -08:00
Muchun Song	359745d783	proc: remove PDE_DATA() completely Remove PDE_DATA() completely and replace it with pde_data(). [akpm@linux-foundation.org: fix naming clash in drivers/nubus/proc.c] [akpm@linux-foundation.org: now fix it properly] Link: https://lkml.kernel.org/r/20211124081956.87711-2-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Alexey Gladkov <gladkov.alexey@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2022-01-22 08:33:37 +02:00
Alexander Mikhalitsyn	85b6d24646	shm: extend forced shm destroy to support objects from several IPC nses Currently, the exit_shm() function not designed to work properly when task->sysvshm.shm_clist holds shm objects from different IPC namespaces. This is a real pain when sysctl kernel.shm_rmid_forced = 1, because it leads to use-after-free (reproducer exists). This is an attempt to fix the problem by extending exit_shm mechanism to handle shm's destroy from several IPC ns'es. To achieve that we do several things: 1. add a namespace (non-refcounted) pointer to the struct shmid_kernel 2. during new shm object creation (newseg()/shmget syscall) we initialize this pointer by current task IPC ns 3. exit_shm() fully reworked such that it traverses over all shp's in task->sysvshm.shm_clist and gets IPC namespace not from current task as it was before but from shp's object itself, then call shm_destroy(shp, ns). Note: We need to be really careful here, because as it was said before (1), our pointer to IPC ns non-refcnt'ed. To be on the safe side we using special helper get_ipc_ns_not_zero() which allows to get IPC ns refcounter only if IPC ns not in the "state of destruction". Q/A Q: Why can we access shp->ns memory using non-refcounted pointer? A: Because shp object lifetime is always shorther than IPC namespace lifetime, so, if we get shp object from the task->sysvshm.shm_clist while holding task_lock(task) nobody can steal our namespace. Q: Does this patch change semantics of unshare/setns/clone syscalls? A: No. It's just fixes non-covered case when process may leave IPC namespace without getting task->sysvshm.shm_clist list cleaned up. Link: https://lkml.kernel.org/r/67bb03e5-f79c-1815-e2bf-949c67047418@colorfullife.com Link: https://lkml.kernel.org/r/20211109151501.4921-1-manfred@colorfullife.com Fixes: `ab602f7991` ("shm: make exit_shm work proportional to task activity") Co-developed-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Andrei Vagin <avagin@gmail.com> Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Cc: Vasily Averin <vvs@virtuozzo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-11-20 10:35:54 -08:00
Alexander Mikhalitsyn	126e8bee94	ipc: WARN if trying to remove ipc object which is absent Patch series "shm: shm_rmid_forced feature fixes". Some time ago I met kernel crash after CRIU restore procedure, fortunately, it was CRIU restore, so, I had dump files and could do restore many times and crash reproduced easily. After some investigation I've constructed the minimal reproducer. It was found that it's use-after-free and it happens only if sysctl kernel.shm_rmid_forced = 1. The key of the problem is that the exit_shm() function not handles shp's object destroy when task->sysvshm.shm_clist contains items from different IPC namespaces. In most cases this list will contain only items from one IPC namespace. How can this list contain object from different namespaces? The exit_shm() function is designed to clean up this list always when process leaves IPC namespace. But we made a mistake a long time ago and did not add a exit_shm() call into the setns() syscall procedures. The first idea was just to add this call to setns() syscall but it obviously changes semantics of setns() syscall and that's userspace-visible change. So, I gave up on this idea. The first real attempt to address the issue was just to omit forced destroy if we meet shp object not from current task IPC namespace [1]. But that was not the best idea because task->sysvshm.shm_clist was protected by rwsem which belongs to current task IPC namespace. It means that list corruption may occur. Second approach is just extend exit_shm() to properly handle shp's from different IPC namespaces [2]. This is really non-trivial thing, I've put a lot of effort into that but not believed that it's possible to make it fully safe, clean and clear. Thanks to the efforts of Manfred Spraul working an elegant solution was designed. Thanks a lot, Manfred! Eric also suggested the way to address the issue in ("[RFC][PATCH] shm: In shm_exit destroy all created and never attached segments") Eric's idea was to maintain a list of shm_clists one per IPC namespace, use lock-less lists. But there is some extra memory consumption-related concerns. An alternative solution which was suggested by me was implemented in ("shm: reset shm_clist on setns but omit forced shm destroy"). The idea is pretty simple, we add exit_shm() syscall to setns() but DO NOT destroy shm segments even if sysctl kernel.shm_rmid_forced = 1, we just clean up the task->sysvshm.shm_clist list. This chages semantics of setns() syscall a little bit but in comparision to the "naive" solution when we just add exit_shm() without any special exclusions this looks like a safer option. [1] https://lkml.org/lkml/2021/7/6/1108 [2] https://lkml.org/lkml/2021/7/14/736 This patch (of 2): Let's produce a warning if we trying to remove non-existing IPC object from IPC namespace kht/idr structures. This allows us to catch possible bugs when the ipc_rmid() function was called with inconsistent struct ipc_ids, struct kern_ipc_perm arguments. Link: https://lkml.kernel.org/r/20211027224348.611025-1-alexander.mikhalitsyn@virtuozzo.com Link: https://lkml.kernel.org/r/20211027224348.611025-2-alexander.mikhalitsyn@virtuozzo.com Co-developed-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Andrei Vagin <avagin@gmail.com> Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Cc: Vasily Averin <vvs@virtuozzo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-11-20 10:35:54 -08:00
Manfred Spraul	0e9beb8a96	ipc/ipc_sysctl.c: remove fallback for !CONFIG_PROC_SYSCTL Compilation of ipc/ipc_sysctl.c is controlled by obj-$(CONFIG_SYSVIPC_SYSCTL) [see ipc/Makefile] And CONFIG_SYSVIPC_SYSCTL depends on SYSCTL [see init/Kconfig] An SYSCTL is selected by PROC_SYSCTL. [see fs/proc/Kconfig] Thus: #ifndef CONFIG_PROC_SYSCTL in ipc/ipc_sysctl.c is impossible, the fallback can be removed. Link: https://lkml.kernel.org/r/20210918145337.3369-1-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-11-09 10:02:53 -08:00
Michal Clapinski	5563cabdde	ipc: check checkpoint_restore_ns_capable() to modify C/R proc files This commit removes the requirement to be root to modify sem_next_id, msg_next_id and shm_next_id and checks checkpoint_restore_ns_capable instead. Since those files are specific to the IPC namespace, there is no reason they should require root privileges. This is similar to ns_last_pid, which also only checks checkpoint_restore_ns_capable. [akpm@linux-foundation.org: ipc/ipc_sysctl.c needs capability.h for checkpoint_restore_ns_capable()] Link: https://lkml.kernel.org/r/20210916163717.3179496-1-mclapinski@google.com Signed-off-by: Michal Clapinski <mclapinski@google.com> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Manfred Spraul <manfred@colorfullife.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-11-09 10:02:53 -08:00
zhangyiru	83c1fd763b	mm,hugetlb: remove mlock ulimit for SHM_HUGETLB Commit `21a3c273f8` ("mm, hugetlb: add thread name and pid to SHM_HUGETLB mlock rlimit warning") marked this as deprecated in 2012, but it is not deleted yet. Mike says he still sees that message in log files on occasion, so maybe we should preserve this warning. Also remove hugetlbfs related user_shm_unlock in ipc/shm.c and remove the user_shm_unlock after out. Link: https://lkml.kernel.org/r/20211103105857.25041-1-zhangyiru3@huawei.com Signed-off-by: zhangyiru <zhangyiru3@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Liu Zixian <liuzixian4@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: wuxu.wu <wuxu.wu@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-11-09 10:02:48 -08:00
Vasily Averin	6a4746ba06	ipc: remove memcg accounting for sops objects in do_semtimedop() Linus proposes to revert an accounting for sops objects in do_semtimedop() because it's really just a temporary buffer for a single semtimedop() system call. This object can consume up to 2 pages, syscall is sleeping one, size and duration can be controlled by user, and this allocation can be repeated by many thread at the same time. However Shakeel Butt pointed that there are much more popular objects with the same life time and similar memory consumption, the accounting of which was decided to be rejected for performance reasons. Considering at least 2 pages for task_struct and 2 pages for the kernel stack, a back of the envelope calculation gives a footprint amplification of <1.5 so this temporal buffer can be safely ignored. The factor would IMO be interesting if it was >> 2 (from the PoV of excessive (ab)use, fine-grained accounting seems to be currently unfeasible due to performance impact). Link: https://lore.kernel.org/lkml/90e254df-0dfe-f080-011e-b7c53ee7fd20@virtuozzo.com/ Fixes: `18319498fd` ("memcg: enable accounting of ipc resources") Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-14 10:22:11 -07:00
Linus Torvalds	35776f1051	ARM development updates for 5.15: - Rename "mod_init" and "mod_exit" so that initcall debug output is actually useful (Randy Dunlap) - Update maintainers entries for linux-arm-kernel to indicate it is moderated for non-subscribers (Randy Dunlap) - Move install rules to arch/arm/Makefile (Masahiro Yamada) - Drop unnecessary ARCH_NR_GPIOS definition (Linus Walleij) - Don't warn about atags_to_fdt() stack size (David Heidelberg) - Speed up unaligned copy_{from,to}_kernel_nofault (Arnd Bergmann) - Get rid of set_fs() usage (Arnd Bergmann) - Remove checks for GCC prior to v4.6 (Geert Uytterhoeven) -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEuNNh8scc2k/wOAE+9OeQG+StrGQFAmE6GkAACgkQ9OeQG+St rGS7HhAAokcdC80ZOJJ+vT/J4sqpTdfTnJmImhkKOKgcw9yBFt7JBuA/6mp6/EV0 2Jd2RpeKG3S8PRlMWE4hGmyIla94r0olDvdh57+4AB/xrSfPO7l7EiaW2xLi0i3F KMysXxxKgbfckoNqPtiYF71cKkUKbZa169t8PyiiW5XYVQncnVGIbmEy69MJCg9n 08NUtkKoDgHkS6hXDVDLoFsGJX5P7X5IDPx6og233qBWRzWgcn1NURfJKD0F7/l+ UPnftUAF8JZp0rhtF2RH1IOu2v2MOVUsrK7D5OjzUEdMSleTN2oX3hmF4HPsG8eJ LeTKJfxoiX3JdWRlmUjomRU6eDqLAIMKsZ0wWoupQTaCq3WHs/mnxEOKY9n/UYGk eQdgb/EQQ5gDUok2WQOxG+Q85s29d14isQnoNa1D0O2YzTK7JiQ6YrASkZWVNLnT Zuw5vDtKk+7NV7QczTl9nHnPWIsRaZr40MXbTIROUO+aPoTxt6lPkv/dqUltrbEg 6Ix/8XsbtAgz8/UEDNz69RYA2DyzDBTO5VLdJutDsXliTAkY+HkqcORTFd72BvWX JEO/xg037a8x5vGpu/t0s+nmDgfy79Yi21u7i3MSjf2FiH09bOUhf7tiuhHVzb97 3po8S/YRiIsJWC1NpMpYFBYeCtJonMJycM05ff6MrLyvLYU2xbs= =Tx+y -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm Pull ARM development updates from Russell King: - Rename "mod_init" and "mod_exit" so that initcall debug output is actually useful (Randy Dunlap) - Update maintainers entries for linux-arm-kernel to indicate it is moderated for non-subscribers (Randy Dunlap) - Move install rules to arch/arm/Makefile (Masahiro Yamada) - Drop unnecessary ARCH_NR_GPIOS definition (Linus Walleij) - Don't warn about atags_to_fdt() stack size (David Heidelberg) - Speed up unaligned copy_{from,to}_kernel_nofault (Arnd Bergmann) - Get rid of set_fs() usage (Arnd Bergmann) - Remove checks for GCC prior to v4.6 (Geert Uytterhoeven) * tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm: ARM: 9118/1: div64: Remove always-true __div64_const32_is_OK() duplicate ARM: 9117/1: asm-generic: div64: Remove always-true __div64_const32_is_OK() ARM: 9116/1: unified: Remove check for gcc < 4 ARM: 9110/1: oabi-compat: fix oabi epoll sparse warning ARM: 9113/1: uaccess: remove set_fs() implementation ARM: 9112/1: uaccess: add __{get,put}_kernel_nofault ARM: 9111/1: oabi-compat: rework fcntl64() emulation ARM: 9114/1: oabi-compat: rework sys_semtimedop emulation ARM: 9108/1: oabi-compat: rework epoll_wait/epoll_pwait emulation ARM: 9107/1: syscall: always store thread_info->abi_syscall ARM: 9109/1: oabi-compat: add epoll_pwait handler ARM: 9106/1: traps: use get_kernel_nofault instead of set_fs() ARM: 9115/1: mm/maccess: fix unaligned copy_{from,to}_kernel_nofault ARM: 9105/1: atags_to_fdt: don't warn about stack size ARM: 9103/1: Drop ARCH_NR_GPIOS definition ARM: 9102/1: move theinstall rules to arch/arm/Makefile ARM: 9100/1: MAINTAINERS: mark all linux-arm-kernel@infradead list as moderated ARM: 9099/1: crypto: rename 'mod_init' & 'mod_exit' functions to be module-specific	2021-09-09 13:25:49 -07:00
Linus Torvalds	2d338201d5	Merge branch 'akpm' (patches from Andrew) Merge more updates from Andrew Morton: "147 patches, based on `7d2a07b769`. Subsystems affected by this patch series: mm (memory-hotplug, rmap, ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan), alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib, checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig, selftests, ipc, and scripts" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits) scripts: check_extable: fix typo in user error message mm/workingset: correct kernel-doc notations ipc: replace costly bailout check in sysvipc_find_ipc() selftests/memfd: remove unused variable Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH configs: remove the obsolete CONFIG_INPUT_POLLDEV prctl: allow to setup brk for et_dyn executables pid: cleanup the stale comment mentioning pidmap_init(). kernel/fork.c: unexport get_{mm,task}_exe_file coredump: fix memleak in dump_vma_snapshot() fs/coredump.c: log if a core dump is aborted due to changed file permissions nilfs2: use refcount_dec_and_lock() to fix potential UAF nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group nilfs2: fix NULL pointer in nilfs_##name##_attr_release nilfs2: fix memory leak in nilfs_sysfs_create_device_group trap: cleanup trap_init() init: move usermodehelper_enable() to populate_rootfs() ...	2021-09-08 12:55:35 -07:00
Rafael Aquini	20401d1058	ipc: replace costly bailout check in sysvipc_find_ipc() sysvipc_find_ipc() was left with a costly way to check if the offset position fed to it is bigger than the total number of IPC IDs in use. So much so that the time it takes to iterate over /proc/sysvipc/* files grows exponentially for a custom benchmark that creates "N" SYSV shm segments and then times the read of /proc/sysvipc/shm (milliseconds): 12 msecs to read 1024 segs from /proc/sysvipc/shm 18 msecs to read 2048 segs from /proc/sysvipc/shm 65 msecs to read 4096 segs from /proc/sysvipc/shm 325 msecs to read 8192 segs from /proc/sysvipc/shm 1303 msecs to read 16384 segs from /proc/sysvipc/shm 5182 msecs to read 32768 segs from /proc/sysvipc/shm The root problem lies with the loop that computes the total amount of ids in use to check if the "pos" feeded to sysvipc_find_ipc() grew bigger than "ids->in_use". That is a quite inneficient way to get to the maximum index in the id lookup table, specially when that value is already provided by struct ipc_ids.max_idx. This patch follows up on the optimization introduced via commit `15df03c879` ("sysvipc: make get_maxid O(1) again") and gets rid of the aforementioned costly loop replacing it by a simpler checkpoint based on ipc_get_maxidx() returned value, which allows for a smooth linear increase in time complexity for the same custom benchmark: 2 msecs to read 1024 segs from /proc/sysvipc/shm 2 msecs to read 2048 segs from /proc/sysvipc/shm 4 msecs to read 4096 segs from /proc/sysvipc/shm 9 msecs to read 8192 segs from /proc/sysvipc/shm 19 msecs to read 16384 segs from /proc/sysvipc/shm 39 msecs to read 32768 segs from /proc/sysvipc/shm Link: https://lkml.kernel.org/r/20210809203554.1562989-1-aquini@redhat.com Signed-off-by: Rafael Aquini <aquini@redhat.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Manfred Spraul <manfred@colorfullife.com> Cc: Waiman Long <llong@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-08 11:50:28 -07:00
Vasily Averin	18319498fd	memcg: enable accounting of ipc resources When user creates IPC objects it forces kernel to allocate memory for these long-living objects. It makes sense to account them to restrict the host's memory consumption from inside the memcg-limited container. This patch enables accounting for IPC shared memory segments, messages semaphores and semaphore's undo lists. Link: https://lkml.kernel.org/r/d6507b06-4df6-78f8-6c54-3ae86e3b5339@virtuozzo.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Roman Gushchin <guro@fb.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Yutian Yang <nglaive@gmail.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:12 -07:00
Vasily Averin	30acd0bdfb	memcg: enable accounting for new namesapces and struct nsproxy Container admin can create new namespaces and force kernel to allocate up to several pages of memory for the namespaces and its associated structures. Net and uts namespaces have enabled accounting for such allocations. It makes sense to account for rest ones to restrict the host's memory consumption from inside the memcg-limited container. Link: https://lkml.kernel.org/r/5525bcbf-533e-da27-79b7-158686c64e13@virtuozzo.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Serge Hallyn <serge@hallyn.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Yutian Yang <nglaive@gmail.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-09-03 09:58:12 -07:00
Arnd Bergmann	bdec014528	ARM: 9114/1: oabi-compat: rework sys_semtimedop emulation sys_oabi_semtimedop() is one of the last users of set_fs() on Arm. To remove this one, expose the internal code of the actual implementation that operates on a kernel pointer and call it directly after copying. There should be no measurable impact on the normal execution of this function, and it makes the overly long function a little shorter, which may help readability. While reworking the oabi version, make it behave a little more like the native one, using kvmalloc_array() and restructure the code flow in a similar way. The naming of __do_semtimedop() is not very good, I hope someone can come up with a better name. One regression was spotted by kernel test robot <rong.a.chen@intel.com> and fixed before the first mailing list submission. Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>	2021-08-20 11:39:26 +01:00
Linus Torvalds	71bd934101	Merge branch 'akpm' (patches from Andrew) Merge more updates from Andrew Morton: "190 patches. Subsystems affected by this patch series: mm (hugetlb, userfaultfd, vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock, migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap, zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc, core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs, signals, exec, kcov, selftests, compress/decompress, and ipc" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits) ipc/util.c: use binary search for max_idx ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock ipc: use kmalloc for msg_queue and shmid_kernel ipc sem: use kvmalloc for sem_undo allocation lib/decompressors: remove set but not used variabled 'level' selftests/vm/pkeys: exercise x86 XSAVE init state selftests/vm/pkeys: refill shadow register after implicit kernel write selftests/vm/pkeys: handle negative sys_pkey_alloc() return code selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random kcov: add __no_sanitize_coverage to fix noinstr for all architectures exec: remove checks in __register_bimfmt() x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned hfsplus: report create_date to kstat.btime hfsplus: remove unnecessary oom message nilfs2: remove redundant continue statement in a while-loop kprobes: remove duplicated strong free_insn_page in x86 and s390 init: print out unknown kernel parameters checkpatch: do not complain about positive return values starting with EPOLL checkpatch: improve the indented label test checkpatch: scripts/spdxcheck.py now requires python3 ...	2021-07-02 12:08:10 -07:00
Manfred Spraul	b869d5be0a	ipc/util.c: use binary search for max_idx If semctl(), msgctl() and shmctl() are called with IPC_INFO, SEM_INFO, MSG_INFO or SHM_INFO, then the return value is the index of the highest used index in the kernel's internal array recording information about all SysV objects of the requested type for the current namespace. (This information can be used with repeated ..._STAT or ..._STAT_ANY operations to obtain information about all SysV objects on the system.) There is a cache for this value. But when the cache needs up be updated, then the highest used index is determined by looping over all possible values. With the introduction of IPCMNI_EXTEND_SHIFT, this could be a loop over 16 million entries. And due to /proc/sys/kernel/*next_id, the index values do not need to be consecutive. With <write 16000000 to msg_next_id>, msgget(), msgctl(,IPC_RMID) in a loop, I have observed a performance increase of around factor 13000. As there is no get_last() function for idr structures: Implement a "get_last()" using a binary search. As far as I see, ipc is the only user that needs get_last(), thus implement it in ipc/util.c and not in a central location. [akpm@linux-foundation.org: tweak comment, fix typo] Link: https://lkml.kernel.org/r/20210425075208.11777-2-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: <1vier1@web.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-07-01 11:06:07 -07:00
Manfred Spraul	17d056e0bd	ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock The patch solves three weaknesses in ipc/sem.c: 1) The initial read of use_global_lock in sem_lock() is an intentional race. KCSAN detects these accesses and prints a warning. 2) The code assumes that plain C read/writes are not mangled by the CPU or the compiler. 3) The comment it sysvipc_sem_proc_show() was hard to understand: The rest of the comments in ipc/sem.c speaks about sem_perm.lock, and suddenly this function speaks about ipc_lock_object(). To solve 1) and 2), use READ_ONCE()/WRITE_ONCE(). Plain C reads are used in code that owns sma->sem_perm.lock. The comment is updated to solve 3) [manfred@colorfullife.com: use READ_ONCE()/WRITE_ONCE() for use_global_lock] Link: https://lkml.kernel.org/r/20210627161919.3196-3-manfred@colorfullife.com Link: https://lkml.kernel.org/r/20210514175319.12195-1-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Cc: <1vier1@web.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-07-01 11:06:07 -07:00
Vasily Averin	bc8136a543	ipc: use kmalloc for msg_queue and shmid_kernel msg_queue and shmid_kernel are quite small objects, no need to use kvmalloc for them. mhocko@: "Both of them are 256B on most 64b systems." Previously these objects was allocated via ipc_alloc/ipc_rcu_alloc(), common function for several ipc objects. It had kvmalloc call inside(). Later, this function went away and was finally replaced by direct kvmalloc call, and now we can use more suitable kmalloc/kfree for them. Link: https://lkml.kernel.org/r/0d0b6c9b-8af3-29d8-34e2-a565c53780f3@virtuozzo.com Reported-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-07-01 11:06:07 -07:00
Vasily Averin	fc37a3b8b4	ipc sem: use kvmalloc for sem_undo allocation Patch series "ipc: allocations cleanup", v2. Some ipc objects use the wrong allocation functions: small objects can use kmalloc(), and vice versa, potentially large objects can use kmalloc(). This patch (of 2): Size of sem_undo can exceed one page and with the maximum possible nsems = 32000 it can grow up to 64Kb. Let's switch its allocation to kvmalloc to avoid user-triggered disruptive actions like OOM killer in case of high-order memory shortage. User triggerable high order allocations are quite a problem on heavily fragmented systems. They can be a DoS vector. Link: https://lkml.kernel.org/r/ebc3ac79-3190-520d-81ce-22ad194986ec@virtuozzo.com Link: https://lkml.kernel.org/r/a6354fd9-2d55-2e63-dd4d-fa7dc1d11134@virtuozzo.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-07-01 11:06:07 -07:00
Linus Torvalds	c54b245d01	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace rlimit handling update from Eric Biederman: "This is the work mainly by Alexey Gladkov to limit rlimits to the rlimits of the user that created a user namespace, and to allow users to have stricter limits on the resources created within a user namespace." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: cred: add missing return error code when set_cred_ucounts() failed ucounts: Silence warning in dec_rlimit_ucounts ucounts: Set ucount_max to the largest positive value the type can hold kselftests: Add test to check for rlimit changes in different user namespaces Reimplement RLIMIT_MEMLOCK on top of ucounts Reimplement RLIMIT_SIGPENDING on top of ucounts Reimplement RLIMIT_MSGQUEUE on top of ucounts Reimplement RLIMIT_NPROC on top of ucounts Use atomic_t for ucounts reference counting Add a reference to ucounts for each cred Increase size of ucounts to atomic_long_t	2021-06-28 20:39:26 -07:00
Varad Gautam	a11ddb37bf	ipc/mqueue, msg, sem: avoid relying on a stack reference past its expiry do_mq_timedreceive calls wq_sleep with a stack local address. The sender (do_mq_timedsend) uses this address to later call pipelined_send. This leads to a very hard to trigger race where a do_mq_timedreceive call might return and leave do_mq_timedsend to rely on an invalid address, causing the following crash: RIP: 0010:wake_q_add_safe+0x13/0x60 Call Trace: __x64_sys_mq_timedsend+0x2a9/0x490 do_syscall_64+0x80/0x680 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f5928e40343 The race occurs as: 1. do_mq_timedreceive calls wq_sleep with the address of `struct ext_wait_queue` on function stack (aliased as `ewq_addr` here) - it holds a valid `struct ext_wait_queue ` as long as the stack has not been overwritten. 2. `ewq_addr` gets added to info->e_wait_q[RECV].list in wq_add, and do_mq_timedsend receives it via wq_get_first_waiter(info, RECV) to call __pipelined_op. 3. Sender calls __pipelined_op::smp_store_release(&this->state, STATE_READY). Here is where the race window begins. (`this` is `ewq_addr`.) 4. If the receiver wakes up now in do_mq_timedreceive::wq_sleep, it will see `state == STATE_READY` and break. 5. do_mq_timedreceive returns, and `ewq_addr` is no longer guaranteed to be a `struct ext_wait_queue ` since it was on do_mq_timedreceive's stack. (Although the address may not get overwritten until another function happens to touch it, which means it can persist around for an indefinite time.) 6. do_mq_timedsend::__pipelined_op() still believes `ewq_addr` is a `struct ext_wait_queue *`, and uses it to find a task_struct to pass to the wake_q_add_safe call. In the lucky case where nothing has overwritten `ewq_addr` yet, `ewq_addr->task` is the right task_struct. In the unlucky case, __pipelined_op::wake_q_add_safe gets handed a bogus address as the receiver's task_struct causing the crash. do_mq_timedsend::__pipelined_op() should not dereference `this` after setting STATE_READY, as the receiver counterpart is now free to return. Change __pipelined_op to call wake_q_add_safe on the receiver's task_struct returned by get_task_struct, instead of dereferencing `this` which sits on the receiver's stack. As Manfred pointed out, the race potentially also exists in ipc/msg.c::expunge_all and ipc/sem.c::wake_up_sem_queue_prepare. Fix those in the same way. Link: https://lkml.kernel.org/r/20210510102950.12551-1-varad.gautam@suse.com Fixes: `c5b2cbdbda` ("ipc/mqueue.c: update/document memory barriers") Fixes: `8116b54e7e` ("ipc/sem.c: document and update memory barriers") Fixes: `0d97a82ba8` ("ipc/msg.c: update and document memory barriers") Signed-off-by: Varad Gautam <varad.gautam@suse.com> Reported-by: Matthias von Faber <matthias.vonfaber@aox-tech.de> Acked-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Manfred Spraul <manfred@colorfullife.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-05-22 15:09:07 -10:00
Bhaskar Chowdhury	7497835f7e	ipc/sem.c: spelling fix s/purpuse/purpose/ Link: https://lkml.kernel.org/r/20210319221432.26631-1-unixbhaskar@gmail.com Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-05-07 00:26:34 -07:00
Bhaskar Chowdhury	b1989a3db4	ipc/sem.c: mundane typo fixes s/runtine/runtime/ s/AQUIRE/ACQUIRE/ s/seperately/separately/ s/wont/won\'t/ s/succesfull/successful/ Link: https://lkml.kernel.org/r/20210326022240.26375-1-unixbhaskar@gmail.com Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-05-07 00:26:33 -07:00
Alexey Gladkov	d7c9e99aee	Reimplement RLIMIT_MEMLOCK on top of ucounts The rlimit counter is tied to uid in the user_namespace. This allows rlimit values to be specified in userns even if they are already globally exceeded by the user. However, the value of the previous user_namespaces cannot be exceeded. Changelog v11: * Fix issue found by lkp robot. v8: * Fix issues found by lkp-tests project. v7: * Keep only ucounts for RLIMIT_MEMLOCK checks instead of struct cred. v6: * Fix bug in hugetlb_file_setup() detected by trinity. Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/970d50c70c71bfd4496e0e8d2a0a32feebebb350.1619094428.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2021-04-30 14:14:02 -05:00
Alexey Gladkov	6e52a9f053	Reimplement RLIMIT_MSGQUEUE on top of ucounts The rlimit counter is tied to uid in the user_namespace. This allows rlimit values to be specified in userns even if they are already globally exceeded by the user. However, the value of the previous user_namespaces cannot be exceeded. Signed-off-by: Alexey Gladkov <legion@kernel.org> Link: https://lkml.kernel.org/r/2531f42f7884bbfee56a978040b3e0d25cdf6cde.1619094428.git.legion@kernel.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2021-04-30 14:14:01 -05:00
Christian Brauner	549c729771	fs: make helpers idmap mount aware Extend some inode methods with an additional user namespace argument. A filesystem that is aware of idmapped mounts will receive the user namespace the mount has been marked with. This can be used for additional permission checking and also to enable filesystems to translate between uids and gids if they need to. We have implemented all relevant helpers in earlier patches. As requested we simply extend the exisiting inode method instead of introducing new ones. This is a little more code churn but it's mostly mechanical and doesnt't leave us with additional inode methods. Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2021-01-24 14:27:20 +01:00

1 2 3 4 5 ...

897 Коммитов