WSL2-Linux-Kernel

История

Muchun Song 14c2404884 locking/rwsem: Optimize down_read_trylock() under highly contended case We found that a process with 10 thousnads threads has been encountered a regression problem from Linux-v4.14 to Linux-v5.4. It is a kind of workload which will concurrently allocate lots of memory in different threads sometimes. In this case, we will see the down_read_trylock() with a high hotspot. Therefore, we suppose that rwsem has a regression at least since Linux-v5.4. In order to easily debug this problem, we write a simply benchmark to create the similar situation lile the following. ```c++ #include <sys/mman.h> #include <sys/time.h> #include <sys/resource.h> #include <sched.h> #include <cstdio> #include <cassert> #include <thread> #include <vector> #include <chrono> volatile int mutex; void trigger(int cpu, char* ptr, std::size_t sz) { cpu_set_t set; CPU_ZERO(&set); CPU_SET(cpu, &set); assert(pthread_setaffinity_np(pthread_self(), sizeof(set), &set) == 0); while (mutex); for (std::size_t i = 0; i < sz; i += 4096) { ptr = '\0'; ptr += 4096; } } int main(int argc, char argv[]) { std::size_t sz = 100; if (argc > 1) sz = atoi(argv[1]); auto nproc = std:🧵:hardware_concurrency(); std::vector<std::thread> thr; sz <<= 30; auto* ptr = mmap(nullptr, sz, PROT_READ \| PROT_WRITE, MAP_ANON \| MAP_PRIVATE, -1, 0); assert(ptr != MAP_FAILED); char* cptr = static_cast<char*>(ptr); auto run = sz / nproc; run = (run >> 12) << 12; mutex = 1; for (auto i = 0U; i < nproc; ++i) { thr.emplace_back(std::thread([i, cptr, run]() { trigger(i, cptr, run); })); cptr += run; } rusage usage_start; getrusage(RUSAGE_SELF, &usage_start); auto start = std::chrono::system_clock::now(); mutex = 0; for (auto& t : thr) t.join(); rusage usage_end; getrusage(RUSAGE_SELF, &usage_end); auto end = std::chrono::system_clock::now(); timeval utime; timeval stime; timersub(&usage_end.ru_utime, &usage_start.ru_utime, &utime); timersub(&usage_end.ru_stime, &usage_start.ru_stime, &stime); printf("usr: %ld.%06ld\n", utime.tv_sec, utime.tv_usec); printf("sys: %ld.%06ld\n", stime.tv_sec, stime.tv_usec); printf("real: %lu\n", std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()); return 0; } ``` The functionality of above program is simply which creates `nproc` threads and each of them are trying to touch memory (trigger page fault) on different CPU. Then we will see the similar profile by `perf top`. 25.55% [kernel] [k] down_read_trylock 14.78% [kernel] [k] handle_mm_fault 13.45% [kernel] [k] up_read 8.61% [kernel] [k] clear_page_erms 3.89% [kernel] [k] __do_page_fault The highest hot instruction, which accounts for about 92%, in down_read_trylock() is cmpxchg like the following. 91.89 │ lock cmpxchg %rdx,(%rdi) Sice the problem is found by migrating from Linux-v4.14 to Linux-v5.4, so we easily found that the commit `ddb20d1d3a` ("locking/rwsem: Optimize down_read_trylock()") caused the regression. The reason is that the commit assumes the rwsem is not contended at all. But it is not always true for mmap lock which could be contended with thousands threads. So most threads almost need to run at least 2 times of "cmpxchg" to acquire the lock. The overhead of atomic operation is higher than non-atomic instructions, which caused the regression. By using the above benchmark, the real executing time on a x86-64 system before and after the patch were: Before Patch After Patch # of Threads real real reduced by ------------ ------ ------ ---------- 1 65,373 65,206 ~0.0% 4 15,467 15,378 ~0.5% 40 6,214 5,528 ~11.0% For the uncontended case, the new down_read_trylock() is the same as before. For the contended cases, the new down_read_trylock() is faster than before. The more contended, the more fast. Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20211118094455.9068-1-songmuchun@bytedance.com		2021-11-23 09:45:36 +01:00
..
Makefile	locking/ww_mutex: Implement rtmutex based ww_mutex API functions	2021-08-17 19:05:26 +02:00
irqflag-debug.c	…
lock_events.c	…
lock_events.h	…
lock_events_list.h	…
lockdep.c	Merge branch 'akpm' (patches from Andrew)	2021-11-09 10:11:53 -08:00
lockdep_internals.h	…
lockdep_proc.c	…
lockdep_states.h	…
locktorture.c	locktorture: Warn on individual lock_torture_init() error conditions	2021-09-13 16:36:16 -07:00
mcs_spinlock.h	…
mutex-debug.c	locking/ww_mutex: Gather mutex_waiter initialization	2021-08-17 19:04:41 +02:00
mutex.c	locking: Remove rcu_read_{,un}lock() for preempt_{dis,en}able()	2021-10-19 17:27:06 +02:00
mutex.h	locking/mutex: Move the 'struct mutex_waiter' definition from <linux/mutex.h> to the internal header	2021-08-17 18:24:31 +02:00
osq_lock.c	…
percpu-rwsem.c	…
qrwlock.c	…
qspinlock.c	…
qspinlock_paravirt.h	…
qspinlock_stat.h	…
rtmutex.c	rtmutex: Wake up the waiters lockless while dropping the read lock.	2021-10-01 13:57:52 +02:00
rtmutex_api.c	locking/rtmutex: Prevent lockdep false positive with PI futexes	2021-08-17 19:06:02 +02:00
rtmutex_common.h	locking/rtmutex: Dont dereference waiter lockless	2021-08-25 15:42:32 +02:00
rwbase_rt.c	locking/rwbase: Optimize rwbase_read_trylock	2021-10-07 13:51:07 +02:00
rwsem.c	locking/rwsem: Optimize down_read_trylock() under highly contended case	2021-11-23 09:45:36 +01:00
semaphore.c	locking/semaphore: Add might_sleep() to down_*() family	2021-08-20 12:33:17 +02:00
spinlock.c	locking: Remove spin_lock_flags() etc	2021-10-30 16:37:28 +02:00
spinlock_debug.c	locking/rwlock: Provide RT variant	2021-08-17 17:50:51 +02:00
spinlock_rt.c	locking/rt: Take RCU nesting into account for __might_resched()	2021-10-01 13:57:51 +02:00
test-ww_mutex.c	locking/ww-mutex: Fix uninitialized use of ret in test_aa()	2021-10-01 13:57:49 +02:00
ww_mutex.h	locking/ww_mutex: Add rt_mutex based lock type and accessors	2021-08-17 19:05:11 +02:00
ww_rt_mutex.c	kernel/locking: Add context to ww_mutex_trylock()	2021-09-17 15:08:41 +02:00